Help with generating a script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with generating a script
# 15  
Old 09-14-2011
Ok, I think ahamed's observation is true. The files are not true ASCII text files.

So I downloaded both files in my Windows machine and used Cygwin Bash to prod around. Here's what I see:

Code:
$
$
$ # print the first line of file1.txt
$
$ head -1 file1.txt
▒▒c h r _ n a m e        c h r _ s t a r t       c h r _ e n d   r e f _ b a s e         a l t _ b a s e         h o m _ h e t   s n p _ q u a l i t y   t o t _ dqa
$
$
$

The first character looks quite unusual, and there's a space between each character e.g. "c" + space + "h" + space + "r" instead of "chr".
The octal dump of the first line shows this:

Code:
$
$ # octal dump of the first line of file1.txt
$
$ head -1 file1.txt | od -bc
0000000 377 376 143 000 150 000 162 000 137 000 156 000 141 000 155 000
        377 376   c  \0   h  \0   r  \0   _  \0   n  \0   a  \0   m  \0
0000020 145 000 011 000 143 000 150 000 162 000 137 000 163 000 164 000
          e  \0  \t  \0   c  \0   h  \0   r  \0   _  \0   s  \0   t  \0
0000040 141 000 162 000 164 000 011 000 143 000 150 000 162 000 137 000
          a  \0   r  \0   t  \0  \t  \0   c  \0   h  \0   r  \0   _  \0
0000060 145 000 156 000 144 000 011 000 162 000 145 000 146 000 137 000
          e  \0   n  \0   d  \0  \t  \0   r  \0   e  \0   f  \0   _  \0
0000100 142 000 141 000 163 000 145 000 011 000 141 000 154 000 164 000
          b  \0   a  \0   s  \0   e  \0  \t  \0   a  \0   l  \0   t  \0
0000120 137 000 142 000 141 000 163 000 145 000 011 000 150 000 157 000
          _  \0   b  \0   a  \0   s  \0   e  \0  \t  \0   h  \0   o  \0
0000140 155 000 137 000 150 000 145 000 164 000 011 000 163 000 156 000
          m  \0   _  \0   h  \0   e  \0   t  \0  \t  \0   s  \0   n  \0
0000160 160 000 137 000 161 000 165 000 141 000 154 000 151 000 164 000
          p  \0   _  \0   q  \0   u  \0   a  \0   l  \0   i  \0   t  \0
0000200 171 000 011 000 164 000 157 000 164 000 137 000 144 000 145 000
          y  \0  \t  \0   t  \0   o  \0   t  \0   _  \0   d  \0   e  \0
0000220 160 000 164 000 150 000 011 000 141 000 154 000 164 000 137 000
          p  \0   t  \0   h  \0  \t  \0   a  \0   l  \0   t  \0   _  \0
0000240 144 000 145 000 160 000 164 000 150 000 011 000 144 000 142 000
          d  \0   e  \0   p  \0   t  \0   h  \0  \t  \0   d  \0   b  \0
0000260 123 000 116 000 120 000 011 000 144 000 142 000 123 000 116 000
          S  \0   N  \0   P  \0  \t  \0   d  \0   b  \0   S  \0   N  \0
0000300 120 000 061 000 063 000 061 000 011 000 162 000 145 000 147 000
          P  \0   1  \0   3  \0   1  \0  \t  \0   r  \0   e  \0   g  \0
0000320 151 000 157 000 156 000 011 000 147 000 145 000 156 000 145 000
          i  \0   o  \0   n  \0  \t  \0   g  \0   e  \0   n  \0   e  \0
0000340 011 000 143 000 150 000 141 000 156 000 147 000 145 000 011 000
         \t  \0   c  \0   h  \0   a  \0   n  \0   g  \0   e  \0  \t  \0
0000360 141 000 156 000 156 000 157 000 164 000 141 000 164 000 151 000
          a  \0   n  \0   n  \0   o  \0   t  \0   a  \0   t  \0   i  \0
0000400 157 000 156 000 011 000 144 000 142 000 123 000 116 000 120 000
          o  \0   n  \0  \t  \0   d  \0   b  \0   S  \0   N  \0   P  \0
0000420 061 000 063 000 062 000 011 000 061 000 060 000 060 000 060 000
          1  \0   3  \0   2  \0  \t  \0   1  \0   0  \0   0  \0   0  \0
0000440 147 000 145 000 156 000 157 000 155 000 145 000 163 000 011 000
          g  \0   e  \0   n  \0   o  \0   m  \0   e  \0   s  \0  \t  \0
0000460 141 000 154 000 154 000 145 000 154 000 145 000 040 000 146 000
          a  \0   l  \0   l  \0   e  \0   l  \0   e  \0      \0   f  \0
0000500 162 000 145 000 161 000 015 000 012
          r  \0   e  \0   q  \0  \r  \0  \n
0000511
$
$

So that first two characters are those corresponding to octal numbers 377 and 376; that's decimal 255 and 254. Also, there's the character corresponding to number 0 i.e. chr(0) after each character. It is seen as "\0" in the octal dump above.

The newline or End-of-Line (EOL) character should be "\r\n" for Windows and "\n" for Unix/Linux. (Not sure, but I think it's "\r" for Mac OS and "\n" for Mac OSX). None of those EOL characters are present in the text file, which would confuse awk or Perl.

The other file - "file2.txt" appears to have "\r" characters as EOL.

Code:
$
$
$ # does "file2.txt" have any "\n" characters?
$
$ cat file2.txt | perl -lne '$count = s/\n//g; print "Number of \\n characters = $count"'
Number of \n characters =
$
$
$ # does "file2.txt" have any "\r" characters?
$
$ cat file2.txt | perl -lne '$count = s/\r//g; print "Number of \\r characters = $count"'
Number of \r characters = 13421
$
$

The "\r" character is the "Carriage Return" character (from the good ol' days of the typewriter); it goes back and starts overwriting the text that was already printed. So it looks like it's "one single line". The octal dump shows the difference clearly.

Code:
$
$
$ # what's the first occurrence of "\r" in file2.txt?
$
$ perl -lne 'print index($_, "\r")' file2.txt
162
$
$ # and the second?
$
$ perl -lne 'print index($_, "\r", 163)' file2.txt
264
$
$ # print the first 160 characters of file2.txt
$
$ perl -lne 'print substr($_,0,160)' file2.txt
chr_name        chr_start       chr_end ref_base        alt_base        hom_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131        region  geallele fres
$
$ # looks good, but print the first 200 characters now
$
$ perl -lne 'print substr($_,0,200)' file2.txt
chr01ame14930   14930tarA       Ghr_end het_base137     65t_base33      som_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131        region  geallele freq
$
$
$ # doesn't look good because "\r" started overwriting the characters printed already
$ # od -bc shows it better; notice the "\r" in bold red below
$
$ perl -lne 'print substr($_,0,200)' file2.txt | od -bc
0000000 143 150 162 137 156 141 155 145 011 143 150 162 137 163 164 141
          c   h   r   _   n   a   m   e  \t   c   h   r   _   s   t   a
0000020 162 164 011 143 150 162 137 145 156 144 011 162 145 146 137 142
          r   t  \t   c   h   r   _   e   n   d  \t   r   e   f   _   b
0000040 141 163 145 011 141 154 164 137 142 141 163 145 011 150 157 155
          a   s   e  \t   a   l   t   _   b   a   s   e  \t   h   o   m
0000060 137 150 145 164 011 163 156 160 137 161 165 141 154 151 164 171
          _   h   e   t  \t   s   n   p   _   q   u   a   l   i   t   y
0000100 011 164 157 164 137 144 145 160 164 150 011 141 154 164 137 144
         \t   t   o   t   _   d   e   p   t   h  \t   a   l   t   _   d
0000120 145 160 164 150 011 144 142 123 116 120 011 144 142 123 116 120
          e   p   t   h  \t   d   b   S   N   P  \t   d   b   S   N   P
0000140 061 063 061 011 162 145 147 151 157 156 011 147 145 156 145 011
          1   3   1  \t   r   e   g   i   o   n  \t   g   e   n   e  \t
0000160 143 150 141 156 147 145 011 141 156 156 157 164 141 164 151 157
          c   h   a   n   g   e  \t   a   n   n   o   t   a   t   i   o
0000200 156 011 144 142 123 116 120 061 063 062 011 061 060 060 060 147
          n  \t   d   b   S   N   P   1   3   2  \t   1   0   0   0   g
0000220 145 156 157 155 145 163 011 141 154 154 145 154 145 040 146 162
          e   n   o   m   e   s  \t   a   l   l   e   l   e       f   r
0000240 145 161 015 143 150 162 060 061 011 061 064 071 063 060 011 061
          e   q  \r   c   h   r   0   1  \t   1   4   9   3   0  \t   1
0000260 064 071 063 060 011 101 011 107 011 150 145 164 011 061 063 067
          4   9   3   0  \t   A  \t   G  \t   h   e   t  \t   1   3   7
0000300 011 066 065 011 063 063 011 163 012
         \t   6   5  \t   3   3  \t   s  \n
0000311
$
$
$

Now if you are working with "file2.txt" in Mac OS, then you'd want to use MacPerl for processing, and I'd assume it takes care of EOL characters. I have no experience with any Mac system though, so don't quote me on that.

On the other hand, if you want to work in RedHat Linux, then you may want to ensure that the EOL characters are "\n" only, before running any of those shell scripts.

You mentioned that those files exist ".xlsx" files i.e. MS Excel 2007 or higher. In that case, saving them as "tab delimited files" should be pretty straightforward.

tyler_durden

Last edited by durden_tyler; 09-14-2011 at 11:46 PM..
This User Gave Thanks to durden_tyler For This Post:
# 16  
Old 09-15-2011
to ahamed101 and tyler_durden

Hi Ahamed and tyler_durden

I think you are right that there is definitely a problem with the files.

Thank you for doing that analysis - way over my beginner unix head!

I performed the following:
Code:
$ grep -c chr01 file2.txt 
1
$ grep -c chr01 file1.txt 
0

And there should be ~7000 in each file...

So this has turned into a much worse problem than I thought. I'm guessing that all of the scripts described above in this thread will work properly if I can work out how to correctly turn my .xlsx into tab-delimited text or .csv and they will not join all of the data into 1 line.

So the examples I provided (file1.txt and file2.txt) are only about 10% of the size of the actual files (I couldn't upload bigger files). I generated these modified tab-delimited files using Microsoft Excel and using the Save As feature. I have no idea why it would write all of the data into one line.

I cannot Save As an .xls or .xml (which would then be easy to convert to a .txt or .csv) because there are too many lines (~90000) in the original file. I also cannot open the .xlsx on RedHat because it only recognises this as azip file!

I feel stuck Smilie
# 17  
Old 09-15-2011
You may want to clean up the files you posted and then try the scripts on them.

I do hope you have Perl in your RedHat Linux box.
If not, then the following suggestion won't work.
Otherwise, do this:

(1) Download the files "file1.txt" and "file2.txt" you attached in your post, to your RedHat Linux system. Put them in a new/freshly created directory.

(2) Back them up first, using the following commands, which will create copies of the two files:

Code:
cp file1.txt file1.txt.orig
cp file2.txt file2.txt.orig

(3) Now clean up file1.txt using the following commands:

Code:
perl -lne 'BEGIN {$x = chr(0); $y=chr(254); $z=chr(255)} s/$x//g; s/$y//g; s/$z//g; s/\r//g; print' file1.txt > file1.txt.new
mv file1.txt.new file1.txt

The Perl one-liner shown above strips off all ASCII characters corresponding to 0, 254, 255. And then it removes all "\r" characters as well. Hopefully those were the only offending characters. The output is redirected to "file1.txt.new", which is then renamed back to "file1.txt".

So, we should be left with "file1.txt" that has the Linux EOL character "\n" and no non-printable character.

(4) Next, clean up file2.txt using the following command:

Code:
perl -plne 's/\r/\n/g' file2.txt > file2.txt.new
mv file2.txt.new file2.txt

This one simply substitutes all "\r" characters to "\n", which is the EOL character for Linux.

If everything has worked fine till now, then you should be left with the following files in your directory:

Code:
file1.txt  <== cleansed file
file1.txt.orig <== original corrupted file
file2.txt <== cleansed file
file2.txt.orig <== original corrupted file

You may now want to go back and create the shell scripts posted earlier and test those. They will process the files "file1.txt" and "file2.txt" and create a new file in the current directory.

tyler_durden
This User Gave Thanks to durden_tyler For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Is there a way to handle commas inside the data when generating a csv file from shell script?

I am extracting data via sql query and some of the data has commas. Output File must be csv and I cannot update the data in the db (as it is used by other application). Example table FavoriteThings Person VARCHAR2(25), Favorite VARCHAR2(100) Sample Data Greta rain drop on... (12 Replies)
Discussion started by: patk625
12 Replies

2. Shell Programming and Scripting

Random number generating script?

Having a hard time with this. Very new to scripting and linux. Spent all sunday trying to do this. Appreciate some help and maybe help breaking down what the syntax does. Create a Bash program. It should have the following properties • Creates a secret number between 1 and 100 i. The... (3 Replies)
Discussion started by: LINUXnoob15
3 Replies

3. Shell Programming and Scripting

Help with ahem Prime number Generating Script

Can anybody tell me why the second part of this script (Sieve of Eratosthenes) isn't working properly. This isnt coursework or homework just private studies ( Yes Project Euler began it ) I know there are easier ways of doing this too but I want to do it this way.:p Iam using Cygwin on Vista... (3 Replies)
Discussion started by: drewann
3 Replies

4. Shell Programming and Scripting

auto-generating assembly code by variables found by script

Hi everybody I'm working on a list of registers(flip-flops to be exact), now i need to extract some value from this list and use them as arguments to pass them to some assembly code for example i have: 118 chain79 MASTER (FF-LE) FFFF 1975829 /TCK F FD1TQHVTT1 ... (1 Reply)
Discussion started by: Behrouzx77
1 Replies

5. Shell Programming and Scripting

Converting date/time and generating offsets in bash script

Hi all, I need a script to do some date/time conversion. It should take as an input a particular time. It should then generates a series of offsets, in both hour:minute form and number of milliseconds elapsed. For 03:00, for example, it should give back 04:02:07 (3727000ms*) 05:04:14... (2 Replies)
Discussion started by: emdan
2 Replies

6. Shell Programming and Scripting

Help generating a script for next-generation sequencing data

I am not sure if this is entirely possible, but I want to compare data in a particular column in several .txt files and have a new file generated. I am a biologist with limited unix knowledge. There are currently no programs written for this type of analysis. First I would like to define the... (1 Reply)
Discussion started by: kellywilliams
1 Replies

7. Shell Programming and Scripting

Problem with script generating files in directory recursively

I have a script which generates recursively some files in folders for a given root folder. I have checks for permissions and it works for all folders except one(i have 777 permission on it). When i try calling the script in problematic folder(problematic folder being root folder), script works as... (2 Replies)
Discussion started by: bb2
2 Replies

8. UNIX for Dummies Questions & Answers

A shell script or software for generating random passwords

Hi, Is there an shell script/batch file to genarate random passwords which expires after a stipulated time period? Please suggest a software which does this for AIX and windows both else. Thanks. (5 Replies)
Discussion started by: dwiravi
5 Replies

9. Shell Programming and Scripting

Generating millions of record using shell script

Hi All, My requirement is like this. I want to generate records of 1 million lines. If I say lines it means one line will contain some string or numbers like AA,3,4,45,+223424234,Tets,Ghdj,+33434,345453434,........................ upto length lets say 41. ( 41 comma sepearted aplha numneric... (2 Replies)
Discussion started by: Rahil2k9
2 Replies

10. Shell Programming and Scripting

Awk Script for generating a report

Hi all, I have a log file of the below format. 20081016:000042 asdflasjdf asljfljs asdflasjf safjl 20081016:000229 /lask/ajlsdf/askdfjsa 20081016:000229 /lashflas /askdfaslj hsfhsahf 20081016:000304 lasflasj ashfashd 20081016:000304 lajfasdf ashfashdfhs I need to generate a... (3 Replies)
Discussion started by: manoj.naidu
3 Replies
Login or Register to Ask a Question