Join not working properly


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Join not working properly
# 1  
Old 12-13-2013
Join not working properly

I want to join two files , with file 1 col 3 and file 2 col 1 as key.
The join command is erratic for some reason. File 2 is a master file having all the names, and file 1 has some values. I want to add the names from fil2 in file 1. If I use the original master file, some output is missing.

For example Medtr1g004990 exists in the master file but does not come in the output.

However, if just use a truncated master file that has 5 records including Medtr1g004990, the output seems to be correct.

I have used sorted files also , same problem. Please help solve this, if join doesnt work,please let me know if some similar command to join would.

I have attached the original master file and pasted the truncated one.



File 1
Code:
# more Medtr1g006600.exp
XLOC_000005     XLOC_000005     Medtr1g004990   chr1:35909-40554        q1      q2      OK      0.520378        6.91484 3.73206 6.85797 5e-05   0.000126299     yes     down
XLOC_000006     XLOC_000006     Medtr1g006490   chr1:44429-46280        q1      q2      OK      16.1083 122.606 2.92814 10.2969 5e-05   0.000126299     yes     down
XLOC_000008     XLOC_000008     Medtr1g006600   chr1:51360-54977        q1      q2      OK      6.94505 3.84361 -0.853525       -2.49824        0.0001  0.000244358     yes     up
XLOC_000010     XLOC_000010     Medtr1g006660   chr1:70777-71741        q1      q2      OK      1.15476 2.47771 1.10142 2.07776 0.0045  0.00841718      yes     down
XLOC_000014     XLOC_000014     Medtr1g006975   chr1:129007-136403      q1      q2      OK      0.389401        0.166262        -1.2278 -2.00092        0.0017  0.00343409      yes     up

File 2 (truncated from attached file)
Code:
# more Medtr1g006600.annot
Medtr1g004990   casein kinase
Medtr1g006490   major intrinsic protein %28MIP%29 family transporter
Medtr1g006590   tonoplast intrinsic protein
Medtr1g006600   exostosin family protein
Medtr1g006605   hypothetical protein
Medtr1g006660   AP2 domain class transcription factor

Command and output with master file

Code:
 # join -a1 -1 3 -2 1  Medtr1g006600.exp mt4.genenames.txt
Medtr1g004990 XLOC_000005 XLOC_000005 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down
Medtr1g006490 XLOC_000006 XLOC_000006 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down
Medtr1g006600 XLOC_000008 XLOC_000008 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up exostosin family protein
Medtr1g006660 XLOC_000010 XLOC_000010 chr1:70777-71741 q1 q2 OK 1.15476 2.47771 1.10142 2.07776 0.0045 0.00841718 yes down AP2 domain class transcription factor
Medtr1g006975 XLOC_000014 XLOC_000014 chr1:129007-136403 q1 q2 OK 0.389401 0.166262 -1.2278 -2.00092 0.0017 0.00343409 yes up disease resistance protein %28CC-NBS-LRR class%29 family protein

Command and output with truncated file

Code:
# join -a1 -1 3 -2 1  Medtr1g006600.exp Medtr1g006600.annot 
Medtr1g004990 XLOC_000005 XLOC_000005 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down casein kinase
Medtr1g006490 XLOC_000006 XLOC_000006 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down major intrinsic protein %28MIP%29 family transporter
Medtr1g006600 XLOC_000008 XLOC_000008 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up exostosin family protein
Medtr1g006660 XLOC_000010 XLOC_000010 chr1:70777-71741 q1 q2 OK 1.15476 2.47771 1.10142 2.07776 0.0045 0.00841718 yes down AP2 domain class transcription factor
Medtr1g006975 XLOC_000014 XLOC_000014 chr1:129007-136403 q1 q2 OK 0.389401 0.166262 -1.2278 -2.00092 0.0017 0.00343409 yes up

# 2  
Old 12-13-2013
Using your original files, I get this:
Code:
join -a1 -1 3 -2 1  file /tmp/mt4.genenames.txt
Medtr1g004990 XLOC_000005 XLOC_000005 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down
Medtr1g006490 XLOC_000006 XLOC_000006 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down
join: /tmp/mt4.genenames.txt:6: is not sorted: Medtr1g004980    hypothetical protein
Medtr1g006600 XLOC_000008 XLOC_000008 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up exostosin family protein
etc.

After sort /tmp/mt4.genenames.txt >/tmp/MED, I get
Code:
join -a1 -1 3 -2 1  file /tmp/MED 
Medtr1g004990 XLOC_000005 XLOC_000005 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down casein kinase
Medtr1g006490 XLOC_000006 XLOC_000006 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down major intrinsic protein %28MIP%29 family transporter
Medtr1g006600 XLOC_000008 XLOC_000008 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up exostosin family protein
Medtr1g006660 XLOC_000010 XLOC_000010 chr1:70777-71741 q1 q2 OK 1.15476 2.47771 1.10142 2.07776 0.0045 0.00841718 yes down AP2 domain class transcription factor
Medtr1g006975 XLOC_000014 XLOC_000014 chr1:129007-136403 q1 q2 OK 0.389401 0.166262 -1.2278 -2.00092 0.0017 0.00343409 yes up disease resistance protein %28CC-NBS-LRR class%29 family protein

, which is close to what you need, I believe.
This User Gave Thanks to RudiC For This Post:
# 3  
Old 12-13-2013
Thanks, sorting solves the problem. I was previously sorting with -k3,3 which doesnt work but -k3 works.
Actually another problem is with some of the data.
Some of the key field in file 1 are concatenated by a comma

For example

Code:
XLOC_017038     XLOC_017038     Medtr3g056615,Medtr3g056620     chr3:22631993-22636677  q1      q2      OK      0.588602        0.125583        -2.22866        -2.09056        0.0099  0.0372138       yes     up

which actually has Medtr3g056615 and Medtr3g056620 , is it possible to look up both names from the master file and output names separated by a '"," ?

The corresponding entries in the master files are
Code:
Medtr3g056615   NB-ARC domain disease resistance protein
Medtr3g056620   NB-ARC domain disease resistance protein

So the output for these lines should be

Code:
Medtr3g056615,Medtr3g056620 XLOC_017038 XLOC_017038 chr3:22631993-22636677 q1 q2 OK 0.588602 0.125583 -2.22866 -2.09056 0.0099 0.0372138 yes up NB-ARC domain disease resistance protein,NB-ARC domain disease resistance protein

# 4  
Old 12-13-2013
Try this - you may want to add some formatting, but in principle, it works:
Code:
awk     'NR==FNR        {T[$1]=($1="")$0; next}
                        {C=split ($3,I,",")
                         n=NF+1
                         for (i=1; i<=C; i++) $n=$n($n?",":"")T[I[i]]}
         1
        ' /tmp/mt4.genenames.txt file1
XLOC_000005 XLOC_000005 Medtr1g004990 chr1:35909-40554 q1 q2 OK 0.520378 6.91484 3.73206 6.85797 5e-05 0.000126299 yes down  casein kinase
XLOC_000006 XLOC_000006 Medtr1g006490 chr1:44429-46280 q1 q2 OK 16.1083 122.606 2.92814 10.2969 5e-05 0.000126299 yes down  major intrinsic protein %28MIP%29 family transporter
XLOC_000008 XLOC_000008 Medtr1g006600 chr1:51360-54977 q1 q2 OK 6.94505 3.84361 -0.853525 -2.49824 0.0001 0.000244358 yes up  exostosin family protein
XLOC_000010 XLOC_000010 Medtr1g006660 chr1:70777-71741 q1 q2 OK 1.15476 2.47771 1.10142 2.07776 0.0045 0.00841718 yes down  AP2 domain class transcription factor
XLOC_000014 XLOC_000014 Medtr1g006975 chr1:129007-136403 q1 q2 OK 0.389401 0.166262 -1.2278 -2.00092 0.0017 0.00343409 yes up  disease resistance protein %28CC-NBS-LRR class%29 family protein
XLOC_017038 XLOC_017038 Medtr3g056615,Medtr3g056620 chr3:22631993-22636677 q1 q2 OK 0.588602 0.125583 -2.22866 -2.09056 0.0099 0.0372138 yes up  NB-ARC domain disease resistance protein, NB-ARC domain disease resistance protein

This User Gave Thanks to RudiC For This Post:
# 5  
Old 12-16-2013
Ok, I used the code removing a $n from line 4,
Code:
awk     'NR==FNR        {T[$1]=($1="")$0; next}                         
                                        {C=split ($3,I,",")                         
 n=NF+1                          for (i=1; i<=C; i++) $n=($n?",":"")T[I[i]]}       
   1        
 ' /tmp/mt4.genenames.txt file1

The output doesnt pick up anything from genenames, it is the same as the input.

Code:
#   awk     'NR==FNR        {T[$1]=($1="")$0; next}
>                           {
>                           C=split ($3,I,",")
>                        n = NF+1
>                            for (i=1; i<=C; i++) $n=($n?",":"")T[I[i]]}
>            1
>         ' mt4.genenames.sorted.txt file1 | head
XLOC_048267 XLOC_048267 Medtr0001s0120 scaffold0001:49523-50287 q1 q2 OK 1.84679 0 -inf nan 5e-05 0.000126299 yes up
XLOC_048258 XLOC_048258 Medtr0001s0190 scaffold0001:64218-69070 q1 q2 OK 3.61301 0 -inf nan 5e-05 0.000126299 yes up
XLOC_048271 XLOC_048271 Medtr0001s0430 scaffold0001:188781-195705 q1 q2 OK 9.48928 3.76233 -1.33467 -4.06109 5e-05 0.000126299 yes up
XLOC_048260 XLOC_048260 Medtr0001s0490 scaffold0001:397820-401891 q1 q2 OK 10.303 4.14216 -1.31462 -3.77643 5e-05 0.000126299 yes up
XLOC_048263 XLOC_048263 Medtr0001s0570 scaffold0001:430832-437206 q1 q2 OK 7.21593 4.49932 -0.681479 -2.09584 0.00025 0.00057981 yes up
XLOC_048274 XLOC_048274 Medtr0001s0660 scaffold0001:502646-507542 q1 q2 OK 17.1506 2.11969 -3.01634 -7.77595 5e-05 0.000126299 yes up
XLOC_048321 XLOC_048321 Medtr0002s0270 scaffold0002:187389-188523 q1 q2 OK 0.232771 4.20249 4.17426 5.6741 0.0001 0.000244358 yes down
XLOC_048337 XLOC_048337 Medtr0002s0900 scaffold0002:341925-343403 q1 q2 OK 69.3725 28.7422 -1.27119 -4.90762 5e-05 0.000126299 yes up
XLOC_048340 XLOC_048340 Medtr0002s1060 scaffold0002:400363-401875 q1 q2 OK 1.527 160.669 6.71725 14.7088 5e-05 0.000126299 yes down
XLOC_048308 XLOC_048308 Medtr0002s1200 scaffold0002:454553-457418 q1 q2 OK 11.0499 0.148847 -6.21406 -6.09651 0.00165 0.00334041 yes up

Something is missing in the code in this line, after the second $n
Code:
for (i=1; i<=C; i++) $n=$n($n?",":"")T[I[i]]}


Last edited by ritakadm; 12-16-2013 at 12:14 PM..
# 6  
Old 12-16-2013
Don't remove anything, it needs the code as presented. Nothing is missing in that line which uses a conditional assignment; it appends the T[I[i]] array elements to the new last field $n.
Test the code with a small subset of both files; come back with error msgs on failure.
# 7  
Old 12-16-2013
The error is the following with any subset of data, or for that matter any filename that i provide, it is not getting to the files at all.

Code:
awk: cmd. line:3: (FILENAME=file1 FNR=1) fatal: function `n' not defined

with code

Code:
awk     'NR==FNR {T[$1]=($1="")$0; next}
                        {C=split ($3,I,",")
                         n= $NF+1
                         for (i=1; i<=C; i++) $n=$n($n?",":"")T[I[i]]}
         1
        ' mt4.genenames.sorted.txt file1

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Expansion not working properly

I'm using an Ubuntu machine and expansion is not working properly. What would cause this? Do I need to check for any particular bash packages? $ ipcs -m | grep $USER | awk '{printf "%s ",$2}' $ ipcs -m | grep UNF | awk '{printf "%s ",$2}' 294912 1048577 425986 688131 786436 1245189... (14 Replies)
Discussion started by: cokedude
14 Replies

2. Shell Programming and Scripting

Why is sort not working properly here ?

Platform: RHEL 5.4 In the below text file I have strings like following. $ cat /tmp/mytextfile.txt DISK1 DISK10 DISK101 DISK102 DISK103 DISK104 DISK105 DISK106 DISK107 DISK108 DISK109 DISK110 DISK111 DISK112 DISK113 DISK114 (8 Replies)
Discussion started by: kraljic
8 Replies

3. UNIX for Dummies Questions & Answers

~c is not working properly with -r option

Hi There, --------- file1 ------- ~c asd@ac.com -------------- Now i am using below command cat file1|mailx -s " testing" -r " My Name" abc@tech.com (3 Replies)
Discussion started by: Tapan Sharma
3 Replies

4. Linux

rexec not working properly

Hi, I am trying to enable rexec to automate certain tasks(it has to be rexec, not ssh or any other due to the system environment), so after switching to linux, I followed the certain instructions that were laid out in the web. My operating system is fedora 17, so I first installed the... (1 Reply)
Discussion started by: wringer
1 Replies

5. Shell Programming and Scripting

mailx not working properly

I am using mailx command in my script to attach a file and send an email. I need to attach a csv file and send email to a mail id - I am using uuencode output.csv output.csv | mailx -s "test mail" xyz@abc.com This will send a mail with scrambled text in body. am i missing something ?... (4 Replies)
Discussion started by: Sriranga
4 Replies

6. UNIX for Advanced & Expert Users

Sendmail is not working properly

Hi All, Can any one help me to solve the issue. The Issue is, i have started the sendmail service on my RHEL 4 update 6 box, I am able to send the mail from my box to almost all of the Email Id's except few. Exampe, test mail. . Output is :the message is sent. now if I send the... (2 Replies)
Discussion started by: akhtar.bhat
2 Replies

7. Shell Programming and Scripting

\n not working properly

Hi all, I'm trying to generate a series of txt files starting from a plain csv file part of my code: #!/bin/ksh INSTALLDIR=/Users/ME/Installdir CSV=CSV.csv TMP=/tmp/$(basename $0).txt tr -s "\r" "\n" < /$INSTALLDIR/$CSV > $TMP function Makefiles { printf '%24s:%30s\n' "sometext"... (1 Reply)
Discussion started by: Jive Spector
1 Replies

8. HP-UX

FC card not working properly

Hi I've a problem with Hp-ux 11.11 9000/800/rp3440 system. Already the software for driver & its patch are loaded for HBA Fibrechannel card, but still the fibrechannel card is showing the status "Unclaimed" . What will be reason for this? How to get the status "Claimed" ? Pl. help me out.... (4 Replies)
Discussion started by: Mike1234
4 Replies

9. Programming

y is this not working properly?

#include <stdio.h> #include <sys/types.h> #include <string.h> #include <sys/stat.h> #include <unistd.h> struct stat s; main() { char c; if (fork()==0) { system("clear"); do { printf("myAI\\>§ "); scanf("%s",c); if(stat(c,&s)>-1) {... (3 Replies)
Discussion started by: C|[anti-trust]
3 Replies

10. UNIX for Dummies Questions & Answers

Keyboard not working properly...

Hello Again, Those that have noticed my earlier posts will know that I have succesfully installed Solaris 8 onto my pc. I haven't been able to get x-server working (i think it doesn't like my video card) though I've been able to log into root (with a bit of help from unix forums :o ) and have... (2 Replies)
Discussion started by: timresh
2 Replies
Login or Register to Ask a Question