Ok, I think ahamed's observation is true. The files are not true ASCII text files.
So I downloaded both files in my Windows machine and used Cygwin Bash to prod around. Here's what I see:
The first character looks quite unusual, and there's a space between each character e.g. "c" + space + "h" + space + "r" instead of "chr".
The octal dump of the first line shows this:
So that first two characters are those corresponding to octal numbers 377 and 376; that's decimal 255 and 254. Also, there's the character corresponding to number 0 i.e. chr(0) after each character. It is seen as "\0" in the octal dump above.
The newline or End-of-Line (EOL) character should be "\r\n" for Windows and "\n" for Unix/Linux. (Not sure, but I think it's "\r" for Mac OS and "\n" for Mac OSX). None of those EOL characters are present in the text file, which would confuse awk or Perl.
The other file - "file2.txt" appears to have "\r" characters as EOL.
The "\r" character is the "Carriage Return" character (from the good ol' days of the typewriter); it goes back and starts overwriting the text that was already printed. So it looks like it's "one single line". The octal dump shows the difference clearly.
Now if you are working with "file2.txt" in Mac OS, then you'd want to use MacPerl for processing, and I'd assume it takes care of EOL characters. I have no experience with any Mac system though, so don't quote me on that.
On the other hand, if you want to work in RedHat Linux, then you may want to ensure that the EOL characters are "\n" only, before running any of those shell scripts.
You mentioned that those files exist ".xlsx" files i.e. MS Excel 2007 or higher. In that case, saving them as "tab delimited files" should be pretty straightforward.
Last edited by durden_tyler; 09-14-2011 at 11:46 PM..
This User Gave Thanks to durden_tyler For This Post:
I think you are right that there is definitely a problem with the files.
Thank you for doing that analysis - way over my beginner unix head!
I performed the following:
And there should be ~7000 in each file...
So this has turned into a much worse problem than I thought. I'm guessing that all of the scripts described above in this thread will work properly if I can work out how to correctly turn my .xlsx into tab-delimited text or .csv and they will not join all of the data into 1 line.
So the examples I provided (file1.txt and file2.txt) are only about 10% of the size of the actual files (I couldn't upload bigger files). I generated these modified tab-delimited files using Microsoft Excel and using the Save As feature. I have no idea why it would write all of the data into one line.
I cannot Save As an .xls or .xml (which would then be easy to convert to a .txt or .csv) because there are too many lines (~90000) in the original file. I also cannot open the .xlsx on RedHat because it only recognises this as azip file!
You may want to clean up the files you posted and then try the scripts on them.
I do hope you have Perl in your RedHat Linux box.
If not, then the following suggestion won't work.
Otherwise, do this:
(1) Download the files "file1.txt" and "file2.txt" you attached in your post, to your RedHat Linux system. Put them in a new/freshly created directory.
(2) Back them up first, using the following commands, which will create copies of the two files:
(3) Now clean up file1.txt using the following commands:
The Perl one-liner shown above strips off all ASCII characters corresponding to 0, 254, 255. And then it removes all "\r" characters as well. Hopefully those were the only offending characters. The output is redirected to "file1.txt.new", which is then renamed back to "file1.txt".
So, we should be left with "file1.txt" that has the Linux EOL character "\n" and no non-printable character.
(4) Next, clean up file2.txt using the following command:
This one simply substitutes all "\r" characters to "\n", which is the EOL character for Linux.
If everything has worked fine till now, then you should be left with the following files in your directory:
You may now want to go back and create the shell scripts posted earlier and test those. They will process the files "file1.txt" and "file2.txt" and create a new file in the current directory.
This User Gave Thanks to durden_tyler For This Post:
I am extracting data via sql query and some of the data has commas. Output File must be csv and I cannot update the data in the db (as it is used by other application).
Greta rain drop on... (12 Replies)
Having a hard time with this. Very new to scripting and linux. Spent all sunday trying to do this. Appreciate some help and maybe help breaking down what the syntax does.
Create a Bash program. It should have the following properties
• Creates a secret number between 1 and 100
i. The... (3 Replies)
Can anybody tell me why the second part of this script (Sieve of Eratosthenes) isn't working properly. This isnt coursework or homework just private studies ( Yes Project Euler began it ) I know there are easier ways of doing this too but I want to do it this way.:p
Iam using Cygwin on Vista... (3 Replies)
I'm working on a list of registers(flip-flops to be exact), now i need to extract some value from this list and use them as arguments to pass them to some assembly code
for example i have:
118 chain79 MASTER (FF-LE) FFFF 1975829 /TCK F FD1TQHVTT1 ... (1 Reply)
I need a script to do some date/time conversion. It should take as an input a particular time. It should then generates a series of offsets, in both hour:minute form and number of milliseconds elapsed.
For 03:00, for example, it should give back 04:02:07 (3727000ms*) 05:04:14... (2 Replies)
I am not sure if this is entirely possible, but I want to compare data in a particular column in several .txt files and have a new file generated. I am a biologist with limited unix knowledge. There are currently no programs written for this type of analysis.
First I would like to define the... (1 Reply)
I have a script which generates recursively some files in folders for a given root folder.
I have checks for permissions and it works for all folders except one(i have 777 permission on it). When i try calling the script in problematic folder(problematic folder being root folder), script works as... (2 Replies)
Is there an shell script/batch file to genarate random passwords which expires after a stipulated time period? Please suggest a software which does this for AIX and windows both else.
Thanks. (5 Replies)
My requirement is like this.
I want to generate records of 1 million lines. If I say lines it means one line will contain some string or numbers like
AA,3,4,45,+223424234,Tets,Ghdj,+33434,345453434,........................ upto length lets say 41. ( 41 comma sepearted aplha numneric... (2 Replies)
I have a log file of the below format.
20081016:000042 asdflasjdf asljfljs asdflasjf safjl
20081016:000229 /lashflas /askdfaslj hsfhsahf
20081016:000304 lasflasj ashfashd
20081016:000304 lajfasdf ashfashdfhs
I need to generate a... (3 Replies)