Problems with Dutch and converting files to UNIX


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Problems with Dutch and converting files to UNIX
# 1  
Old 03-22-2014
Problems with Dutch and converting files to UNIX

Dear All,
for last few days I am dealing with silly problem of converting files from different types and styles (ansi, utf) in dutch to a normal unix files with utf-8 without BOM ...
using both linux terminal and cygwin and no matter what I have no been successful

the files has hidden characters as well... I have tried
Code:
dos2unix -iso -o files1.txt (cygwin)
fromdos (linux)
find . | grep .txt | xargs fromdos -d

and even tried it manually on Notepad ++ but no matter what it is not working Smilie there should be something that I dont know how to deal with and I would be grateful if you can help me

I am trying to normalize the files later with following code with the structure of
Code:
folder01/folder02/files1.txt

Code:
FILES="folder01/*"
for W in $FILES
do
	doc=$(basename $W) 
	mkdir res/${doc}
	mkdir res/${doc}/normal
	FILES="folder01/${doc}/*"
	for X in $FILES
	do
		name=$(basename $X) 
		cat $X|sed  's/[[:punct:][:digit:]]//g' | sed '/^ *$/d' > res/${doc}/normal/${name}	
	done 
done

I have attached a sample of the file i have above as well Smilie
Thank you in advance
A-V
# 2  
Old 03-22-2014
Try using the iconv program. That is the way I ever translate character sets. For example if I have utf-16 and I want ascii it is just:
Code:
iconv -f utf-16 -t ascii < input.file

Read the man page on it but then do a iconv -l to get of list of character sets that your iconv knows. Also use our search tool to look for threads containing "iconv". You are not the first person with this problem and you will probably find a few dozen threads.
This User Gave Thanks to Perderabo For This Post:
# 3  
Old 03-22-2014
That file1.txt has a three byte UTF-8 representation of FEFF as an intro; the rest is normal UTF-8 chars, even the é (=0xE9). The normal UTF-16 intro would be a two byte FFFE. So maybe that file has undergone another uncontrolled conversion before, e.g. a little edian - big endian one?
# 4  
Old 03-22-2014
Perderabo, it tells me that I have illegal input sequence at position 0
# 5  
Old 03-22-2014
Why don't you just remove those two leading bytes?
# 6  
Old 03-22-2014
Rudi, I have about 700 files with that structure ... unless there is a way to do that on a full folder I dont think it would be possible
I have tried using perl one-liner as well by
Code:
perl -CD -pe 'tr/\x{feff}//d' file1.txt > new-file1.txt

but still the code above will not work for it and once i use the iconc it stops at position 1251 for some other characters

Last edited by A-V; 03-22-2014 at 01:33 PM..
# 7  
Old 03-22-2014
Like RudiC says, it appears to be UTF-8 encoding with the first three bytes being the BOM 0xEF 0xBB 0xBF. So you could try removing those using:
Code:
tail -c+4 file > file.new

or

If you have GNU sed, you could try:
Code:
sed '1s/^\xEF\xBB\xBF//' file > file.new

You might be able to use GNU sed's -i option, but use that with care and as always test it carefully first..

Last edited by Scrutinizer; 03-22-2014 at 02:01 PM..
This User Gave Thanks to Scrutinizer For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Homework & Coursework Questions

Converting .dat to UNIX

I uploaded a .dat file from sftp to my server and after using dos2unix to convert the file and check my work it says that the file was not transferred correctly and that the content is garbled. Please help (3 Replies)
Discussion started by: Ovid158
3 Replies

2. Shell Programming and Scripting

Converting Multiline Files to Flat Files?

How to convert this: F1-R1 F1-R2 F1-R3 into a flat file for bash?? Each record F2-R1 F2-R2 F2-R3 F3-R1 F3-R2 F3-R3 F4-R1 F4-R2 F4-R3is on one line with all fields for that record, put into an output file. The output file should look like this when converted: F1-R1,F2-R1,F3-R1,F4-R1... (6 Replies)
Discussion started by: bud1738
6 Replies

3. Shell Programming and Scripting

awk - problems by converting date-format

Hi i try to change the date-format from DD/MM/YYYY into MM/DD/YY. Input-Data: ... 31/12/2013,23:40,198.00,6.20,2,2,2,1,11580.0,222 31/12/2013,23:50,209.00,7.30,2,2,3,0,4380.0 01/01/2014,00:00,205.90,8.30,2,2,3,1,9360.0,223 ... Output-Data should be: ...... (7 Replies)
Discussion started by: IMPe
7 Replies

4. Shell Programming and Scripting

Converting DOS filetype to UNIX

Hello folks I am working on a project that requires me to write a script that operates on a bunch of text files. When I try less file.txt I see a bunch of ^M's everywhere. Some Googling tells me that this is because the files have a DOS fileformat and found the following fixes: sed 's/^M$//'... (5 Replies)
Discussion started by: ksk
5 Replies

5. Windows & DOS: Issues & Discussions

Converting UNIX scripts to DOS

Is there a tool available to convert UNIX (BASH Shell) scripts to DOS scripts? I understand that DOS scripting is far inferior to unix scripting, and therfore this conversion may not be possible. Alternativley, perhaps I could convert my Unix scripts to C... then compile it for a windows... (2 Replies)
Discussion started by: Crozz
2 Replies

6. UNIX for Dummies Questions & Answers

Converting Unix text to windows

I am trying to FTP a text file from a machine running LynxOS and I am having problems with the way windows "sees" the characters. For example this is how windows presents the text:     DevProcRcpClass The boxes are what I am having problems with. When viewing the same file on a... (3 Replies)
Discussion started by: mchristisen
3 Replies

7. OS X (Apple)

Converting Unix executable files

I loaded OS X Panther on my Mac G4 and found that many files previously saved as Word or Word Perfect files were inadventently converted to Unix executable files. When I try to read these in Word, it cannot recognize or translate the file properly. Does anyone know how to translate these files? Is... (4 Replies)
Discussion started by: Steven Greenber
4 Replies

8. UNIX Desktop Questions & Answers

Converting BMP to BM (or other unix format)

Hey pllz, ive got a little problem, i want to convert a bmp of gif or jpg to an unix format (bm) anybody got any suggestions ? greets\EJ (1 Reply)
Discussion started by: EJ =)
1 Replies

9. UNIX for Dummies Questions & Answers

converting files from unix to windows

Need Help?? We receive Files From GM Motors and they written on a Sun Workstation using the Tar Command on a 4mm Dat Tape. We have an HP sure Store 24 Tape drive that will Execpt but when i do that it says that the media is bad. was wondering if there was any software that would read it in its... (2 Replies)
Discussion started by: jefft1976
2 Replies
Login or Register to Ask a Question