Help with generating a script

09-14-2011

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

Ok, I think ahamed's observation is true. The files are not true ASCII text files.

So I downloaded both files in my Windows machine and used Cygwin Bash to prod around. Here's what I see:

Code:

$
$
$ # print the first line of file1.txt
$
$ head -1 file1.txt
▒▒c h r _ n a m e        c h r _ s t a r t       c h r _ e n d   r e f _ b a s e         a l t _ b a s e         h o m _ h e t   s n p _ q u a l i t y   t o t _ dqa
$
$
$

The first character looks quite unusual, and there's a space between each character e.g. "c" + space + "h" + space + "r" instead of "chr".
The octal dump of the first line shows this:

Code:

$
$ # octal dump of the first line of file1.txt
$
$ head -1 file1.txt | od -bc
0000000 377 376 143 000 150 000 162 000 137 000 156 000 141 000 155 000
        377 376   c  \0   h  \0   r  \0   _  \0   n  \0   a  \0   m  \0
0000020 145 000 011 000 143 000 150 000 162 000 137 000 163 000 164 000
          e  \0  \t  \0   c  \0   h  \0   r  \0   _  \0   s  \0   t  \0
0000040 141 000 162 000 164 000 011 000 143 000 150 000 162 000 137 000
          a  \0   r  \0   t  \0  \t  \0   c  \0   h  \0   r  \0   _  \0
0000060 145 000 156 000 144 000 011 000 162 000 145 000 146 000 137 000
          e  \0   n  \0   d  \0  \t  \0   r  \0   e  \0   f  \0   _  \0
0000100 142 000 141 000 163 000 145 000 011 000 141 000 154 000 164 000
          b  \0   a  \0   s  \0   e  \0  \t  \0   a  \0   l  \0   t  \0
0000120 137 000 142 000 141 000 163 000 145 000 011 000 150 000 157 000
          _  \0   b  \0   a  \0   s  \0   e  \0  \t  \0   h  \0   o  \0
0000140 155 000 137 000 150 000 145 000 164 000 011 000 163 000 156 000
          m  \0   _  \0   h  \0   e  \0   t  \0  \t  \0   s  \0   n  \0
0000160 160 000 137 000 161 000 165 000 141 000 154 000 151 000 164 000
          p  \0   _  \0   q  \0   u  \0   a  \0   l  \0   i  \0   t  \0
0000200 171 000 011 000 164 000 157 000 164 000 137 000 144 000 145 000
          y  \0  \t  \0   t  \0   o  \0   t  \0   _  \0   d  \0   e  \0
0000220 160 000 164 000 150 000 011 000 141 000 154 000 164 000 137 000
          p  \0   t  \0   h  \0  \t  \0   a  \0   l  \0   t  \0   _  \0
0000240 144 000 145 000 160 000 164 000 150 000 011 000 144 000 142 000
          d  \0   e  \0   p  \0   t  \0   h  \0  \t  \0   d  \0   b  \0
0000260 123 000 116 000 120 000 011 000 144 000 142 000 123 000 116 000
          S  \0   N  \0   P  \0  \t  \0   d  \0   b  \0   S  \0   N  \0
0000300 120 000 061 000 063 000 061 000 011 000 162 000 145 000 147 000
          P  \0   1  \0   3  \0   1  \0  \t  \0   r  \0   e  \0   g  \0
0000320 151 000 157 000 156 000 011 000 147 000 145 000 156 000 145 000
          i  \0   o  \0   n  \0  \t  \0   g  \0   e  \0   n  \0   e  \0
0000340 011 000 143 000 150 000 141 000 156 000 147 000 145 000 011 000
         \t  \0   c  \0   h  \0   a  \0   n  \0   g  \0   e  \0  \t  \0
0000360 141 000 156 000 156 000 157 000 164 000 141 000 164 000 151 000
          a  \0   n  \0   n  \0   o  \0   t  \0   a  \0   t  \0   i  \0
0000400 157 000 156 000 011 000 144 000 142 000 123 000 116 000 120 000
          o  \0   n  \0  \t  \0   d  \0   b  \0   S  \0   N  \0   P  \0
0000420 061 000 063 000 062 000 011 000 061 000 060 000 060 000 060 000
          1  \0   3  \0   2  \0  \t  \0   1  \0   0  \0   0  \0   0  \0
0000440 147 000 145 000 156 000 157 000 155 000 145 000 163 000 011 000
          g  \0   e  \0   n  \0   o  \0   m  \0   e  \0   s  \0  \t  \0
0000460 141 000 154 000 154 000 145 000 154 000 145 000 040 000 146 000
          a  \0   l  \0   l  \0   e  \0   l  \0   e  \0      \0   f  \0
0000500 162 000 145 000 161 000 015 000 012
          r  \0   e  \0   q  \0  \r  \0  \n
0000511
$
$

So that first two characters are those corresponding to octal numbers 377 and 376; that's decimal 255 and 254. Also, there's the character corresponding to number 0 i.e. chr(0) after each character. It is seen as "\0" in the octal dump above.

The newline or End-of-Line (EOL) character should be "\r\n" for Windows and "\n" for Unix/Linux. (Not sure, but I think it's "\r" for Mac OS and "\n" for Mac OSX). None of those EOL characters are present in the text file, which would confuse awk or Perl.

The other file - "file2.txt" appears to have "\r" characters as EOL.

Code:

$
$
$ # does "file2.txt" have any "\n" characters?
$
$ cat file2.txt | perl -lne '$count = s/\n//g; print "Number of \\n characters = $count"'
Number of \n characters =
$
$
$ # does "file2.txt" have any "\r" characters?
$
$ cat file2.txt | perl -lne '$count = s/\r//g; print "Number of \\r characters = $count"'
Number of \r characters = 13421
$
$

The "\r" character is the "Carriage Return" character (from the good ol' days of the typewriter); it goes back and starts overwriting the text that was already printed. So it looks like it's "one single line". The octal dump shows the difference clearly.

Code:

$
$
$ # what's the first occurrence of "\r" in file2.txt?
$
$ perl -lne 'print index($_, "\r")' file2.txt
162
$
$ # and the second?
$
$ perl -lne 'print index($_, "\r", 163)' file2.txt
264
$
$ # print the first 160 characters of file2.txt
$
$ perl -lne 'print substr($_,0,160)' file2.txt
chr_name        chr_start       chr_end ref_base        alt_base        hom_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131        region  geallele fres
$
$ # looks good, but print the first 200 characters now
$
$ perl -lne 'print substr($_,0,200)' file2.txt
chr01ame14930   14930tarA       Ghr_end het_base137     65t_base33      som_het snp_quality     tot_depth       alt_depth       dbSNP   dbSNP131        region  geallele freq
$
$
$ # doesn't look good because "\r" started overwriting the characters printed already
$ # od -bc shows it better; notice the "\r" in bold red below
$
$ perl -lne 'print substr($_,0,200)' file2.txt | od -bc
0000000 143 150 162 137 156 141 155 145 011 143 150 162 137 163 164 141
          c   h   r   _   n   a   m   e  \t   c   h   r   _   s   t   a
0000020 162 164 011 143 150 162 137 145 156 144 011 162 145 146 137 142
          r   t  \t   c   h   r   _   e   n   d  \t   r   e   f   _   b
0000040 141 163 145 011 141 154 164 137 142 141 163 145 011 150 157 155
          a   s   e  \t   a   l   t   _   b   a   s   e  \t   h   o   m
0000060 137 150 145 164 011 163 156 160 137 161 165 141 154 151 164 171
          _   h   e   t  \t   s   n   p   _   q   u   a   l   i   t   y
0000100 011 164 157 164 137 144 145 160 164 150 011 141 154 164 137 144
         \t   t   o   t   _   d   e   p   t   h  \t   a   l   t   _   d
0000120 145 160 164 150 011 144 142 123 116 120 011 144 142 123 116 120
          e   p   t   h  \t   d   b   S   N   P  \t   d   b   S   N   P
0000140 061 063 061 011 162 145 147 151 157 156 011 147 145 156 145 011
          1   3   1  \t   r   e   g   i   o   n  \t   g   e   n   e  \t
0000160 143 150 141 156 147 145 011 141 156 156 157 164 141 164 151 157
          c   h   a   n   g   e  \t   a   n   n   o   t   a   t   i   o
0000200 156 011 144 142 123 116 120 061 063 062 011 061 060 060 060 147
          n  \t   d   b   S   N   P   1   3   2  \t   1   0   0   0   g
0000220 145 156 157 155 145 163 011 141 154 154 145 154 145 040 146 162
          e   n   o   m   e   s  \t   a   l   l   e   l   e       f   r
0000240 145 161 015 143 150 162 060 061 011 061 064 071 063 060 011 061
          e   q  \r   c   h   r   0   1  \t   1   4   9   3   0  \t   1
0000260 064 071 063 060 011 101 011 107 011 150 145 164 011 061 063 067
          4   9   3   0  \t   A  \t   G  \t   h   e   t  \t   1   3   7
0000300 011 066 065 011 063 063 011 163 012
         \t   6   5  \t   3   3  \t   s  \n
0000311
$
$
$

Now if you are working with "file2.txt" in Mac OS, then you'd want to use MacPerl for processing, and I'd assume it takes care of EOL characters. I have no experience with any Mac system though, so don't quote me on that.

On the other hand, if you want to work in RedHat Linux, then you may want to ensure that the EOL characters are "\n" only, before running any of those shell scripts.

You mentioned that those files exist ".xlsx" files i.e. MS Excel 2007 or higher. In that case, saving them as "tab delimited files" should be pretty straightforward.

tyler_durden

Last edited by durden_tyler; 09-14-2011 at 11:46 PM..

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

09-15-2011

Registered User

10, 0

Join Date: Nov 2010

Last Activity: 26 February 2014, 10:11 PM EST

Posts: 10

Thanks Given: 10

Thanked 0 Times in 0 Posts

to ahamed101 and tyler_durden

Hi Ahamed and tyler_durden

I think you are right that there is definitely a problem with the files.

Thank you for doing that analysis - way over my beginner unix head!

I performed the following:

Code:

$ grep -c chr01 file2.txt 
1
$ grep -c chr01 file1.txt 
0

And there should be ~7000 in each file...

So this has turned into a much worse problem than I thought. I'm guessing that all of the scripts described above in this thread will work properly if I can work out how to correctly turn my .xlsx into tab-delimited text or .csv and they will not join all of the data into 1 line.

So the examples I provided (file1.txt and file2.txt) are only about 10% of the size of the actual files (I couldn't upload bigger files). I generated these modified tab-delimited files using Microsoft Excel and using the Save As feature. I have no idea why it would write all of the data into one line.

I cannot Save As an .xls or .xml (which would then be easy to convert to a .txt or .csv) because there are too many lines (~90000) in the original file. I also cannot open the .xlsx on RedHat because it only recognises this as azip file!

I feel stuck

kellywilliams

View Public Profile for kellywilliams

Find all posts by kellywilliams

09-15-2011

Registered User

2,100, 402

Join Date: Apr 2009

Last Activity: 11 February 2020, 10:24 AM EST

Posts: 2,100

Thanks Given: 26

Thanked 402 Times in 360 Posts

You may want to clean up the files you posted and then try the scripts on them.

I do hope you have Perl in your RedHat Linux box.
If not, then the following suggestion won't work.
Otherwise, do this:

(1) Download the files "file1.txt" and "file2.txt" you attached in your post, to your RedHat Linux system. Put them in a new/freshly created directory.

(2) Back them up first, using the following commands, which will create copies of the two files:

Code:

cp file1.txt file1.txt.orig
cp file2.txt file2.txt.orig

(3) Now clean up file1.txt using the following commands:

Code:

perl -lne 'BEGIN {$x = chr(0); $y=chr(254); $z=chr(255)} s/$x//g; s/$y//g; s/$z//g; s/\r//g; print' file1.txt > file1.txt.new
mv file1.txt.new file1.txt

The Perl one-liner shown above strips off all ASCII characters corresponding to 0, 254, 255. And then it removes all "\r" characters as well. Hopefully those were the only offending characters. The output is redirected to "file1.txt.new", which is then renamed back to "file1.txt".

So, we should be left with "file1.txt" that has the Linux EOL character "\n" and no non-printable character.

(4) Next, clean up file2.txt using the following command:

Code:

perl -plne 's/\r/\n/g' file2.txt > file2.txt.new
mv file2.txt.new file2.txt

This one simply substitutes all "\r" characters to "\n", which is the EOL character for Linux.

If everything has worked fine till now, then you should be left with the following files in your directory:

Code:

file1.txt  <== cleansed file
file1.txt.orig <== original corrupted file
file2.txt <== cleansed file
file2.txt.orig <== original corrupted file

You may now want to go back and create the shell scripts posted earlier and test those. They will process the files "file1.txt" and "file2.txt" and create a new file in the current directory.

tyler_durden

This User Gave Thanks to durden_tyler For This Post:

durden_tyler

View Public Profile for durden_tyler

Find all posts by durden_tyler

Shell Programming and Scripting

Help with generating a script

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Is there a way to handle commas inside the data when generating a csv file from shell script?

Discussion started by: patk625

2. Shell Programming and Scripting

Random number generating script?

Discussion started by: LINUXnoob15

3. Shell Programming and Scripting

Help with ahem Prime number Generating Script

Discussion started by: drewann

4. Shell Programming and Scripting

auto-generating assembly code by variables found by script

Discussion started by: Behrouzx77

5. Shell Programming and Scripting

Converting date/time and generating offsets in bash script

Discussion started by: emdan

6. Shell Programming and Scripting

Help generating a script for next-generation sequencing data

Discussion started by: kellywilliams

7. Shell Programming and Scripting

Problem with script generating files in directory recursively

Discussion started by: bb2

8. UNIX for Dummies Questions & Answers

A shell script or software for generating random passwords

Discussion started by: dwiravi

9. Shell Programming and Scripting

Generating millions of record using shell script

Discussion started by: Rahil2k9

10. Shell Programming and Scripting

Awk Script for generating a report

Discussion started by: manoj.naidu