awk incorrect format


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk incorrect format
# 1  
Old 10-29-2018
awk incorrect format

I was wondering whether anyone has any idea what is happening here. I'm using simple code to compare 2 tab delimited files based on column 1 values. If the column1 value of file1 exists in file2, then I'm to print the column4 value in file2 in column3 of file1. Here is my code:



1st I have to produce file1 by concatenating columns 4&5 of the input file:


INPUT FILE:


Code:
1    52828739    rs12044739    C    T
1    52835713    rs72899818    T    C
1    52836736    rs10157619    G    A
1    52844478    rs6684941    A    G

Using this simple code, I reorder columns in the INPUT file and concatenate columns 4 &5:
Code:
awk -F '\t' ' {print $3,$1,$2,$4$5}' OFS="\t" INPUT > file1

This produces the following file1 which looks fine:
Code:
rs12044739    1    52828739    CT
rs72899818    1    52835713    TC
rs10157619    1    52836736    GA
rs6684941    1    52844478    AG

This is file2:


Code:
rs12044739    1    52828739    CC
rs72899818    1    52835713    TC
rs10157619    1    52836736    GG
rs6684941    1    52844478    AA


Using the code:


Code:
awk 'NR == FNR {REP [$1] = $4; next} $1 in REP {$3 = REP[$1]} 1' OFS="\t" file2 file1 > results

Yields the following output file, RESULTS, which so far seems to look fine:


Code:
rs12044739    1    CC    CT
rs72899818    1    TC    TC
rs10157619    1    GG    GA
rs6684941    1    AA    AG

For final processing we need to print all rows for which column3 = column4. So I used this simple code:
Code:
awk -F '\t' '{ if ($3 = $4) print $0}'  OFS="\t" RESULTS > RESULTS2

Thus RESULTS2 should look like this:


Code:
rs72899818    1    TC    TC

Instead what I get is this:


Code:
rs72899818    1    TC    

TC

Any ideas as to what is causing the column4 value to print to the following row?

Last edited by Geneanalyst; 10-29-2018 at 05:01 PM.. Reason: forgot a step
# 2  
Old 10-29-2018
I note that none of the data you posted looks like it came from strictly tab-separated files. I suspect rogue spaces and/or carriage returns in your data.

From your example file1 and file2 I get
Code:
rs12044739    1    52828739    CT
rs72899818      1       TT      TC
rs10157619    1    52836736    GA
rs6684941    1    52844478    AG

This User Gave Thanks to Corona688 For This Post:
# 3  
Old 10-29-2018
Quote:
Originally Posted by Corona688
I note that none of the data you posted looks like it came from strictly tab-separated files. I suspect rogue spaces and/or carriage returns in your data.

From your example file1 and file2 I get
Code:
rs12044739    1    52828739    CT
rs72899818      1       TT      TC
rs10157619    1    52836736    GA
rs6684941    1    52844478    AG








Sorry, I forgot to add a step. I have edited my post to reflect this....
# 4  
Old 10-29-2018
How about, hoping that any invisible surprises like the ones corona688 mentioned will have been taken care of, and untested, from a windows machine, so bear with me:


Code:
awk -F "\t" 'NR == FNR {REP[$3] = $4$5; next} REP[$1] == $4 {$3 = REP[$1]; gsub (/[^ACGT]/, "", $3); print}' OFS="\t" INPUT file2


BTW, if ($3 = $4) should read if ($3 == $4), but this would not explain the additional line feed that you complain about.
This User Gave Thanks to RudiC For This Post:
# 5  
Old 10-29-2018
Quote:
Originally Posted by RudiC
How about, hoping that any invisible surprises like the ones corona688 mentioned will have been taken care of, and untested, from a windows machine, so bear with me:


Code:
awk -F "\t" 'NR == FNR {REP[$3] = $4$5; next} REP[$1] == $4 {$3 = REP[$1]; gsub (/[^ACGT]/, "", $3); print}' OFS="\t" INPUT file2

BTW, if ($3 = $4) should read if ($3 == $4), but this would not explain the additional line feed that you complain about.

Thanks Rudi, but that didn't produce the desired output RESULT2. BTW, is there a quick way to clean up a file that may have hidden spaces and other sorts of things to make it a clean tab delimited file.
# 6  
Old 10-29-2018
What Rudi & Corona are saying: UNIX text files have different carriage control characters from windows text files - like tab delimited Excel output.

cleanup windows files == dos2unix command:

carriage control for
UNIX: ASCII 10 written "\n", called a newline character.
Windows: ASCII 13 and ASCII 10 - "\r\n", called return and newline.

awk will misbehave on windows text files. Most decent editors let you change UNIX <-> Windows at will. The UNIX dos2unix command does what you need when the file got onto the Linux box with bad carriage control. unix2dos goes the other way for you. Windows does not like UNIX carriage control, either Tit for tat, I guess.
This User Gave Thanks to jim mcnamara For This Post:
# 7  
Old 10-29-2018
Quote:
Originally Posted by jim mcnamara
What Rudi & Corona are saying: UNIX text files have different carriage control characters from windows text files - like tab delimited Excel output.

cleanup windows files == dos2unix command:

carriage control for
UNIX: ASCII 10 written "\n", called a newline character.
Windows: ASCII 13 and ASCII 10 - "\r\n", called return and newline.

awk will misbehave on windows text files. Most decent editors let you change UNIX <-> Windows at will. The UNIX dos2unix command does what you need when the file got onto the Linux box with bad carriage control. unix2dos goes the other way for you. Windows does not like UNIX carriage control, either Tit for tat, I guess.

That did the trick. One of the text files sent to me must have been processed with a windows machine...
This User Gave Thanks to Geneanalyst For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk command gives incorrect result?

Hi All, I am looking to filter out filesystems which are greter than a specific value. I use the command df -h | awk '$4 >=70.00 {print $4,$5}' But this results out as below, which also gives for lower values. 9% /u01 86% /home 8% /u01/data 82% /install 70% /u01/app Looks... (3 Replies)
Discussion started by: jjoy
3 Replies

2. Shell Programming and Scripting

Df -h | awk - output incorrect matching

Running solaris 9, on issuing the follwing command df -h | awk '$5 > 45 {print}' Filesystems with utilisation > 45% are being displayed as well as those between 5 and-9%!!! (3 Replies)
Discussion started by: squrcles
3 Replies

3. Shell Programming and Scripting

awk sum giving incorrect value

cat T|awk -v format=$format '{ SUM += $1} END { printf format,SUM}' the file T has below data usghrt45tf:hrguat:/home/hrguat $ cat T -1363000.00123456789 -95000.00789456123 -986000.0045612378 -594000.0015978 -368939.54159753258415 -310259.0578945612 -133197.37123456789... (4 Replies)
Discussion started by: zulfi123786
4 Replies

4. Shell Programming and Scripting

wget format incorrect

I want to extract a web page to a temporary file as a source document. I tried: wget $webPgURL > /tmp/tmpfil but it says I have a missing URL. I have echoed $webPgURL just prior to the wget command and it is correct. If I use: firefox $webPbURL it brings up firefox with the correct page. Can... (3 Replies)
Discussion started by: slak0
3 Replies

5. Shell Programming and Scripting

awk : deleting specific incorrect lines

Hello friends, I searched in forums for similar threads but what I want is to have a single awk code to perform followings; I have a big log file going like this; ... 7450494 1724465 -47 003A98B710C0 7450492 1724461 -69 003A98B710C0 7450488 1724459 001DA1915B70 trafo_14:3 7450482... (5 Replies)
Discussion started by: enes71
5 Replies

6. Shell Programming and Scripting

scripting/awk help : awk sum output is not comming in regular format. Pls advise.

Hi Experts, I am adding a column of numbers with awk , however not getting correct output: # awk '{sum+=$1} END {print sum}' datafile 2.15291e+06 How can I getthe output like : 2152910 Thank you.. # awk '{sum+=$1} END {print sum}' datafile 2.15079e+06 (3 Replies)
Discussion started by: rveri
3 Replies

7. Shell Programming and Scripting

awk to extract incorrect fixed length records

I have a number of unix text files containing fixed-length records (normal unix linefeed terminator) where I need to find odd records which are an incorrect length. The data is not validated and records can contain odd backslash characters and control characters which makes them awkward to process... (2 Replies)
Discussion started by: methyl
2 Replies

8. Shell Programming and Scripting

Merge lines in a file with Awk - incorrect output

Hi, I would like: FastEthernet0/0 is up, line protocol is up 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 output errors, 0 collisions, 0 interface resets Serial1/0:0 is up, line protocol is up 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort 0... (14 Replies)
Discussion started by: mv652
14 Replies

9. Shell Programming and Scripting

Awk incorrect data.

I am using the following command: nawk -F"," 'NR==FNR {a=$1;next} a {print a,$1,$2,$3}' file1 file2 I am getting 40 records output. But when i import file1 and file2 in MS Access i get 140 records. And i know 140 is correct count. Appreciate your help on correcting the above script (5 Replies)
Discussion started by: pinnacle
5 Replies

10. Shell Programming and Scripting

AWK CSV to TXT format, TXT file not in a correct column format

HI guys, I have created a script to read 1 column in a csv file and then place it in text file. However, when i checked out the text file, it is not in a column format... Example: CSV file contains name,age aa,11 bb,22 cc,33 After using awk to get first column TXT file... (1 Reply)
Discussion started by: mdap
1 Replies
Login or Register to Ask a Question