simple join for multiple files and produce 3 outputs


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting simple join for multiple files and produce 3 outputs
# 1  
Old 08-28-2010
simple join for multiple files and produce 3 outputs

Code:
sh script file1 filea fileb filec ................filez. >>output1 & output2 &output3

Quote:
output1 contains all common ones in multiple files
output2 contains the one that is not in file1 but in all others.
output3 contains the one that is in file1 and not in others
file1
Code:
z10     1873    1920    z_number1_E59
z10     2042    2090    z_number2_E59
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59
z1      1873    1920    z_number1_E60
z1      2042    2090    z_number2_E60
z1      2032    2041    z_number2_E20

filea
Code:
z10     1873    1920    z_number1_E59
z10     2042    2090    z_number2_E59
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59
z1      1863    1872    z_number1_E60
z1      2032    2041    z_number2_E60
z1      2032    2041    z_number2_E10

fileb
Code:
z10     1863    1872    z_number1_E59
z10     2032    2041    z_number2_E59
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59
z1      1873    1920    z_number1_E60
z1      2042    2090    z_number2_E60
z1      2032    2041    z_number2_E10

filec
Code:
z10     1873    1920    z_number1_E59
z10     2042    2090    z_number2_E59
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59
z1      1863    1872    z_number1_E60
z1      2032    2041    z_number2_E60
z1      2032    2041    z_number2_E10

output1
Code:
Z22     2476    2560    z_number3_E59
Z22     2838    2915    z_number4_E59

output2
Code:
z1      2032    2041    z_number2_E10

output3
Code:
z1      2032    2041    z_number2_E20


Last edited by stateperl; 08-28-2010 at 05:06 AM.. Reason: Trying to change post name to three outputs
# 2  
Old 08-28-2010
Try this, the variable nfiles contains the number of files:
Code:
awk -v nfiles="4" 'NR==FNR{a[$0]++;next}
$0 in a {a[$0]++; next}
{b[$0]++}
END{
  for(i in a){
    if(a[i]==nfiles) {
      print i > "output1"
    }
    else if(a[i]==1) {
        print i > "output3"
    }
  }
  for(i in b){
    if(b[i]==nfiles-1) {
        print i > "output2"
    }
  }
}' file1 filea fileb filec

# 3  
Old 08-29-2010
Hi

Thank you. working good . But it is throwing error when I ran 17 files ?

Quote:
awk: cmd. line:1: (FILENAME=/file14.bed FNR=250923) fatal: cannot open file `/filej0' for reading (No such file or directory)

Last edited by stateperl; 08-29-2010 at 02:31 AM..
# 4  
Old 08-29-2010
Quote:
Originally Posted by stateperl
Thank you. working good . But it is throwing error when I ran 17 files ?
Code:
awk: cmd. line:1: (FILENAME=/file14.bed FNR=250923) fatal: cannot open file `/filej0' for reading (No such file or directory)

Does the file /filej0 exist in your directory?
# 5  
Old 08-29-2010
Could u please explain ur code? I confused, how did u finding common lines in all files??
# 6  
Old 08-29-2010
Quote:
Originally Posted by gvj
Could u please explain ur code? I confused, how did u finding common lines in all files??
Explanation:
Code:
$0 in a {a[$0]++; next}		# This counts the number of each line in an array

    if(a[i]==nfiles) {		# If the value of an element == 4
      print i > "output1"	# ++ the line exists in 4 files
    }

# 7  
Old 08-29-2010
Quote:
Originally Posted by gvj
Could u please explain ur code? I confused, how did u finding common lines in all files??
I've taken the liberty to add some explaination to Franklin52's code:

Code:
awk -v nfiles="4" '
NR==FNR{a[$0]++;next}           # NR is the current record number counting from start of programme
                                # FNR is the current record number counting from start of current file
                                # when FNR == NR it implies the record comes from file 1
                                # thus this statement captures all records from file 1 in the hash named 'a'
                                # the index is the whole input record with the value being a count of files.
                                # next causes the next record to be read and the programme to loop to top

$0 in a {a[$0]++; next}         # when this statement is reached, the record ($0) is not in the first file
                                # if the current record was seen in the first file increment the counter
                                # maintained in a.  Next causes the next record to be read.

{b[$0]++}                       # this statement is executed when a record from file2...n is encountered
                                # and the record was not seen in file1. A second hash is used to
                                # track all records that weren't in file 1

END{                            # this section of code is driven after the last record is read from file n
   for(i in a){                 # for every record seen in file 1...
      if(a[i]==nfiles) {        # if the record was seen in all files (count in a matches number of files)
         print i > "output1"    # print to the list for seen in all files
      }
      else if(a[i]==1) {        # if the record was only seen in the first file print to output list
         print i > "output3"     # of records just in file 1
      }
  }

  for(i in b){                  # for every record that wasn't seen in the  first file, but was seen
     if(b[i]==nfiles-1) {       # in all other files, print to that list
        print i > "output2"
     }
 }
}' file1 filea fileb filec

The only potential problem with this code is that it will yield a false positive in the case where filex has a duplicate line that is in file1 and is missing from exactly one other input file. Related combinations of duplicates and 'holes' will also fall into this. If this is a concern, an easy solution would be to 'sort -u' each of the input files to remove all duplicate records.
This User Gave Thanks to agama For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Join files on multiple fields

Hello all, I want to join 2 tabbed files on the first 2 fields, and filling the missing values with 0. The 3rd column in each file is constant for the entire file. file1 12658699 ST5 XX2720 0 1 0 1 53039541 ST5 XX2720 1 0 1.5 1 file2 ... (6 Replies)
Discussion started by: sheetalk
6 Replies

2. Shell Programming and Scripting

Join two files combining multiple columns and produce mix and match output

I would like to join two files when two columns in each file matches with each other and then produce an output when taking multiple columns. Like I have file A 1234,ABCD,23,JOHN,NJ,USA 2345,ABCD,24,SAM,NY,USA 5678,GHIJ,24,TOM,NY,USA 5678,WXYZ,27,MAT,NJ,USA and file B ... (2 Replies)
Discussion started by: mady135
2 Replies

3. Shell Programming and Scripting

Join multiple files with filename

Please help, I want to join multiple files based on column 1, and put the missing values as 0. Also the colname in the output should say which file the values came from. FILE1 1 11 2 12 3 13 FILE2 2 22 3 23 4 24 FILE3 1 31 3 33 4 34 FILE1 FILE2 FILE3 1 11 0 31 (1 Reply)
Discussion started by: newbie83
1 Replies

4. Shell Programming and Scripting

Join multiple files

Hi there, I am trying to join 24 files (i showed example of 3 files below). They all have 2 columns. The first columns is common to all. The files are tab delimited eg file 1 rs0001 100e-34 rs0003 2.8e-01 rs008 1.9e-90 file 2 rs0001 1.98e-22 rs0004 3.77e-10... (4 Replies)
Discussion started by: fat
4 Replies

5. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Hi, I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column). I want to write a script to join the files by the first common column so that in the... (5 Replies)
Discussion started by: evelibertine
5 Replies

6. Shell Programming and Scripting

Awk - join multiple files

Is it possible to join all the files with input1 based on 1st column? input1 a b c d e f input2 a b input3 a e input4 c (2 Replies)
Discussion started by: quincyjones
2 Replies

7. Shell Programming and Scripting

Shell script that will compare two config files and produce 2 outputs 1)actual config file 2)report

Hi I am new to shell scripting. There is a requirement to write a shell script to meet follwing needs.Prompt reply shall be highly appreciated. script that will compare two config files and produce 2 outputs - actual config file and a report indicating changes made. OS :Susi linux ver 10.3. ... (4 Replies)
Discussion started by: muraliinfy04
4 Replies

8. UNIX for Dummies Questions & Answers

best method of replacing multiple strings in multiple files - sed or awk? most simple preferred :)

Hi guys, say I have a few files in a directory (58 text files or somthing) each one contains mulitple strings that I wish to replace with other strings so in these 58 files I'm looking for say the following strings: JAM (replace with BUTTER) BREAD (replace with CRACKER) SCOOP (replace... (19 Replies)
Discussion started by: rich@ardz
19 Replies

9. Shell Programming and Scripting

How to join multiple files?

I am trying to join a few hundred files using join. Is there a way to use while read or something else to automate this. My problem is the following. Day 1 City Temp ABC 20 DEF 30 HIJ 15 Day 2 City Temp ABC 22 DEF 29 KLM 5 Day 3 (3 Replies)
Discussion started by: theFinn
3 Replies

10. UNIX for Dummies Questions & Answers

Join 2 files with multiple columns: awk/grep/join?

Hello, My apologies if this has been posted elsewhere, I have had a look at several threads but I am still confused how to use these functions. I have two files, each with 5 columns: File A: (tab-delimited) PDB CHAIN Start End Fragment 1avq A 171 176 awyfan 1avq A 172 177 wyfany 1c7k A 2 7... (3 Replies)
Discussion started by: InfoSeeker
3 Replies
Login or Register to Ask a Question