Column comparison between two files: moved from another post

10-04-2010

Registered User

22, 1

Join Date: Apr 2010

Last Activity: 17 March 2020, 12:04 AM EDT

Posts: 22

Thanks Given: 19

Thanked 1 Time in 1 Post

Thanks every one

(Danmero, Bartus11, Pravin27 (the Awk folks)!!! Kurumi (Ruby Wrangler!!) ygemici (Sed-lovers) !!...

@Danmero, although my problem is solved I am intrigued that your code doesn't work for my sample files (the files I posted first i.e. Freq.txt and Pval.txt. Just so I understand better, could you please explain the code? I had made my sample files quickly on windows notepad. I then remade them using vi editor and still I had the same problems running your code. I tried to debug myself but never figured it out. I do realize it must be something very mundane. Your code works fine for my actual files though!!

@kurumi....I have some problems running your Ruby script, but I will get back to you after I do some debugging myself

@ygemici..Your sed script also gave me some problems but again...I will get back to you after I try some tweaking myself

csn

---------- Post updated 10-04-10 at 10:52 AM ---------- Previous update was 10-03-10 at 05:03 PM ----------

Quote:

Originally Posted by cs_novice

@Danmero, although my problem is solved I am intrigued that your code doesn't work for my sample files (the files I posted first i.e. Freq.txt and Pval.txt. Just so I understand better, could you please explain the code? I had made my sample files quickly on windows notepad. I then remade them using vi editor and still I had the same problems running your code. I tried to debug myself but never figured it out. I do realize it must be something very mundane. Your code works fine for my actual files though!!

csn

I think I figured out what the awk one liner under discussion does....

Code:

awk 'NR==FNR{a[$4]=$5;next}a[$1]{print $0"\t"a[$1]}' file1 file2

We set NR = FNR i.e., the current count (ordinal number) of record of 1st input file and second input file are same. Then when we say

Quote:

{a[$4]=$5}a[$1]

we are capturing the elements of field5 (i.e., of file1, $5), provided field 4 (i.e. file1;$4) matches the first field of file 2 ($1), in an associative array a[$1] ( the 'next' does not let the program do anything else with file1).

We then print all fields in the file 2 and of course separated by a tab the associative array that contains the elements of $5 (from file 1).

Quote:

{print $0"\t"a[$1]}

Since we set NR==FNR the array only has as many lines as in file2 (I think this statement of mine is wrong but I am not sure)

Even as a biologist I am beginning to get interested in the nitty gritty of programing:

I think I like it. I hope to able to do the same with the sed and ruby script, but that is for another day. Even a simple awk code has squeezed the maximum out of me.

So much power packed in one little statement.

please feel free to correct my understanding of this.

have a good day
csn

Last edited by cs_novice; 10-03-2010 at 07:15 PM..

cs_novice

View Public Profile for cs_novice

Find all posts by cs_novice

10-04-2010

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

You got most of it right, except the "NR==FNR" part. It is not setting any of these variables. Instead it is comparing them, which is true only when processing 1st file (as FNR gets reset to 1 with each file). This allows us to build associative array "a" based on contents of the 1st file, and then use it to compare values with the 2nd file.

bartus11

View Public Profile for bartus11

Find all posts by bartus11

10-04-2010

Registered User

17, 1

Join Date: Jun 2008

Last Activity: 20 September 2011, 11:40 PM EDT

Posts: 17

Thanks Given: 0

Thanked 1 Time in 1 Post

Column comparision between two files

Code:

while read line2
do
a=`echo $line2 |awk '{print $1}'`
d=`echo $line2 |awk '{print $2}'`
while read line1
do
b=`echo $line1|awk '{print $1}'`
c=`echo $line1|awk '{print $2}'`
if [ $a -eq $b ] ;
then
echo $a $c $d >>mtch
echo "matching"
else
echo "not matching"
fi
done < x
done <y

Last edited by Scott; 10-04-2010 at 02:52 PM.. Reason: Code tags

This User Gave Thanks to lnviyyapu For This Post:

lnviyyapu

View Public Profile for lnviyyapu

Find all posts by lnviyyapu

10-06-2010

Registered User

22, 1

Join Date: Apr 2010

Last Activity: 17 March 2020, 12:04 AM EDT

Posts: 22

Thanks Given: 19

Thanked 1 Time in 1 Post

Quote:

Originally Posted by bartus11

Thanks...I now understand this a lot better!! These are all logic statements...it is either true or false!!
will keep updating as I get better at this.
csn

cs_novice

View Public Profile for cs_novice

Find all posts by cs_novice

10-10-2010

Registered User

22, 1

Join Date: Apr 2010

Last Activity: 17 March 2020, 12:04 AM EDT

Posts: 22

Thanks Given: 19

Thanked 1 Time in 1 Post

a bug(?) in the awk one liner for column comparison

Hello Friends
I have been using this awk one liner

Code:

           awk 'NR==FNR{a[$4]=$5}a[$1]{print $0"\t"a[$1]}' Gene_Count.txt Pval.txt

to compare field 4 of the file Gene_Count.txt to field 1 of Pval.txt and extract field 5 of Gene_Count.txt and print it along side all columns of Pval.txt.

I know that I have already discussed quite a bit about the files, however for sake of completeness I have included the slightly modified files to illustrate the problem.
Gene_Count.txt

Code:

CHR    START    END    Transc_ID    READ_COUNT    BASES_COV
    
chr1      268430147    268436813    ID=GRMZM2G015073_T01      362   4027
chr1      16776238      16779559    ID=GRMZM2G445588_T01      0     0
chr1      109050562     109054042    ID=GRMZM2G356344_T01      85    123
chr1      243260011     243280610    ID=GRMZM2G044740_T01      77    1480
chr1      260039640     260047849    ID=GRMZM2G420436_T01      13    1447
chr1      15724186      15728999    ID=GRMZM2G119852_T01      1032    1906
chr1      19922021      19924137    ID=AC166636.1_FGT010      3    89

Pval.txt (note this now also includes the ID that has a zero count (in field 5 of Gene_count)

Code:

Transc_ID    DP    Pval.cross
ID=GRMZM2G015073_T01    23.6044288292005    0.0206790394438121
ID=GRMZM2G445588_T01    2.42080832941224    0.566356492613311
ID=GRMZM2G356344_T01    31.0575268969536    0.489032543538082
ID=GRMZM2G044740_T01    8.33858514064342    0.125869127182036
ID=GRMZM2G420436_T01    4.08274762082918    0.0214579269824967
ID=GRMZM2G119852_T01    59.7782287606723    0.0372160593886689
ID=AC166636.1_FGT010    1.18004103601881    0.0180008630009030
ID=GRMZM2G100242_T02    61.4167813736184    0.0142003131557532
ID=GRMZM2G180458_T01    19.7051930517752    0.0643166007561127

on using the awk command given above I get this out put:

Pval_count

Code:

Transc_ID    DP    Pval.cross    READ_COUNT
ID=GRMZM2G015073_T01    23.6044288292005    0.0206790394438121    362
ID=GRMZM2G356344_T01    31.0575268969536    0.489032543538082    85
ID=GRMZM2G044740_T01    8.33858514064342    0.125869127182036    77
ID=GRMZM2G420436_T01    4.08274762082918    0.0214579269824967    13
ID=GRMZM2G119852_T01    59.7782287606723    0.0372160593886689    1032
ID=AC166636.1_FGT010    1.18004103601881    0.0180008630009030    3

This is great until I noticed that row #2 of Gene_count whose $5 (Field5) is '0' is thrown out:

Code:

chr1      16776238      16779559    ID=GRMZM2G445588_T01      0     0

I have noticed that all records where the field 5 of 'Gene_Count.txt' is '0' is thrown out although this seems to defy logic (at least as far as I understand). I need the records even if the field 5 value is '0'.

Could anyone please help me with this?

thanks
CSN

cs_novice

View Public Profile for cs_novice

Find all posts by cs_novice

10-10-2010

Registered User

2,163, 123

Join Date: Nov 2007

Last Activity: 31 July 2016, 9:42 AM EDT

Location: H3X

Posts: 2,163

Thanks Given: 11

Thanked 123 Times in 116 Posts

Try this one:

Code:

awk 'NR==FNR{a[$4]=$5_}a[$1]{print $0"\t"a[$1]}' Gene_Count.txt Pval.txt

This User Gave Thanks to danmero For This Post:

danmero

View Public Profile for danmero

Find all posts by danmero

Shell Programming and Scripting

Column comparison between two files: moved from another post

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to count the number of files moved?

Discussion started by: Kadikis

2. Shell Programming and Scripting

Need help in column comparison & adding extra line to files

Discussion started by: b@l@ji

3. Linux

Possible Cause of Files Not Being Moved?

Discussion started by: rymnd_12345

4. Shell Programming and Scripting

column value comparison in a file

Discussion started by: jam_prasanna

5. Shell Programming and Scripting

List moved files in text file

Discussion started by: Ashtefere

6. UNIX for Advanced & Expert Users

How to know the user who moved the files to other dir

Discussion started by: srilaxmi

7. Solaris

files updated in last 10 hours should be moved

Discussion started by: sanjay1979

8. UNIX for Dummies Questions & Answers

Showing Moved Files

Discussion started by: msb65

9. UNIX for Dummies Questions & Answers

rsync, which files where moved?

Discussion started by: JCR

10. Shell Programming and Scripting

Getting a list of filenames of moved files

Discussion started by: chengwei