Filter lines common in two files


 
Thread Tools Search this Thread
# 1  
Filter lines common in two files

Thanks everyone. I got that problem solved.

I require one more help here. (Yes, UNIX definitely seems to be fun and useful, and I WILL eventually learn it for myself. But I am now on a different project and don't really have time to go through all the basics. So, I will really appreciate some help.)

I've got two data files.
One has: ID lat lon data1 data2; for ex:
Code:
1001   23.5  -6.45  3.2  14.68
1002   48.2  -35.6  8.5  21.67
1003   -5.6   23.6   3.5  3.56
...
...

And the other has: ID data3 data4 data5
For example:
Code:
1001   C   16   US
1002   D    32   US
1004   E    13   US
...
...


There are approximately 2500 IDs but neither has all of them. file1 has few missing, so does file2, but the missing IDs are not necessarily common.

Now I want to make a new file with: ID lat lon data4 ; but ONLY for the IDs that are common in both.
So, for above files it would be:

Code:
1001   23.5  -6.45  16
1002   48.2  -35.6  32
...
...

I searched the forums and there are similar problems which are solved using awk but I could not comprehend the scripts well enough to create my own solution.
Thanks.
# 2  
see below ... assumes removal of column labels (first row) ... see man comm and man sort ...

steps in the process ...
1. grab and sort IDs from 1st column ($1) of file 1 (pool) and send to temp file (pool.1t)
2. grab and sort IDs from 1st column ($1) of file 2 (pool4) and send to temp file (pool.4t)
3. for each ID common to both files comm -12 file1 file2
> a. grab lon from lon column ($2) in file 1 (pool)
> b. grab lat from lat column ($3) in file 1 (pool)
> c. grab data4 from data4 column ($3) in file 2 (pool4)
> d. echo out values in correct order
Code:
root@debiangeek:/tmp# cat pool
123 a b c f
456 a b c d 
789 a b c e
root@debiangeek:/tmp# cat pool4
789 a 25 c e
234 d 35 e c
456 a 57 c d 
123 a 66 c f
root@debiangeek:/tmp# awk -F" " '{print $1}' pool | sort -nu > pool.1t
root@debiangeek:/tmp# cat pool.1t
123
456
789
root@debiangeek:/tmp# awk -F" " '{print $1}' pool4 | sort -nu > pool.4t
root@debiangeek:/tmp# cat pool.4t
123
234
456
789
root@debiangeek:/tmp# for i in $(comm -12 pool.1t pool.4t)
> do
>     lat=$(awk "\$1 ~ /$i/ {print \$2}" pool)
>     lon=$(awk "\$1 ~ /$i/ {print \$3}" pool)
>     dat4=$(awk "\$1 ~ /$i/ {print \$3}" pool4)
>     echo "$i $lat $lon $dat4"
> done | tee /tmp/file1
123 a b 66
456 a b 57
789 a b 25
root@debiangeek:/tmp# cat /tmp/file1
123 a b 66
456 a b 57
789 a b 25
root@debiangeek:/tmp#

# 3  
Quote:
Originally Posted by latsyrc
Thanks everyone. I got that problem solved.

I require one more help here. (Yes, UNIX definitely seems to be fun and useful, and I WILL eventually learn it for myself. But I am now on a different project and don't really have time to go through all the basics. So, I will really appreciate some help.)

I've got two data files.
One has: ID lat lon data1 data2; for ex:
Code:
1001   23.5  -6.45  3.2  14.68
1002   48.2  -35.6  8.5  21.67
1003   -5.6   23.6   3.5  3.56
...
...

And the other has: ID data3 data4 data5
For example:
Code:
1001   C   16   US
1002   D    32   US
1004   E    13   US
...
...


There are approximately 2500 IDs but neither has all of them. file1 has few missing, so does file2, but the missing IDs are not necessarily common.

Now I want to make a new file with: ID lat lon data4 ; but ONLY for the IDs that are common in both.
So, for above files it would be:

Code:
1001   23.5  -6.45  16
1002   48.2  -35.6  32
...
...

I searched the forums and there are similar problems which are solved using awk but I could not comprehend the scripts well enough to create my own solution.
Thanks.
You really should start a new thread when you have a new problem... It makes it a lot easier for people who may try to read this thread later figure out what problem later messages in the thread are trying to address.

Assuming that the above two input files are named datfile3 and datfile4, respectively, and that you want the output stored in a file named output; the following simple awk script seems to be a little more direct than Just Ice's proposal:
Code:
awk '
FNR == NR {
	d4[$1] = $3
	next
}
($1 in d4) {
	printf("%s   %s  %s  %s\n", $1, $2, $3, d4[$1])
}' datfile4 datfile3 > output

When you run this script with the above sample input files, the contents of output will be:
Code:
1001   23.5  -6.45  16
1002   48.2  -35.6  32

which matches the spacing you requested.

As I mentioned before, if you want to run this on a Solaris system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of /usr/bin/awk or /bin/awk.
# 4  
Paste is one of the easy solution

Parse all of your required line to different file with awk and then join all of your file with the paste command. With the paste command you can use any kind of seperator also. In this example i used semicolon.

Paste command usage:

paste -d ";" file1 file2 file3 .....

Code:
[goksel@gokcell cozum]$ cat >file1
1001   23.5  -6.45  3.2  14.68
1002   48.2  -35.6  8.5  21.67
1003   -5.6   23.6   3.5  3.56
^Z
[1]+  Stopped                 cat > file1
[goksel@gokcell cozum]$ cat >file2
1001   C   16   US
1002   D    32   US
1004   E    13   US
^Z
[2]+  Stopped                 cat > file2
[goksel@gokcell cozum]$ cat file1
1001   23.5  -6.45  3.2  14.68
1002   48.2  -35.6  8.5  21.67
1003   -5.6   23.6   3.5  3.56

cat file1 | awk '{print $1}' >f1l1
cat file1 | awk '{print $2}' >f1l2
cat file1 | awk '{print $3}' >f1l3
cat file2 | awk '{print $3}' >f2l3

paste -d ";" f1l1 f1l2 f1l3 f2l3 >result

cat result 
1001;23.5;-6.45;16
1002;48.2;-35.6;32
1003;-5.6;23.6;13

Regards,
Goksel Yangin
Computer Engineer

Moderator's Comments:
Mod Comment edit by bakunin: repaired some broken CODE-tags.

Last edited by bakunin; 07-10-2013 at 06:02 AM..
This User Gave Thanks to gokcell For This Post:
# 5  
Thank you everyone! I learnt a new thing or two from each post.

@Don_Cragun,
Thanks again. Your code gave me perfect results. However, I still cannot comprehend it entirely. Can you please briefly explain this:

Code:
awk '
FNR == NR {
	d4[$1] = $3
	next
}
($1 in d4) {
	printf("%s   %s  %s  %s\n", $1, $2, $3, d4[$1])
}' datfile4 datfile3 > output

# 6  
Moderator's Comments:
Mod Comment As this is a new problem i splitted the original thread and moved the posts dealing with the new problem here.

Like DonCragun already said: please open a new thread for every separate problem. Thank you.


bakunin
# 7  
Code:
awk '
# 1st file fields: ID data3 data4 data5
# 2nd file fields: ID lat lon data1 data2
FNR == NR {             # If this line is from the 1st file...
        d4[$1] = $3     # d4[ID] = data4 associated with ID
        next            # Skip to next input line
}
($1 in d4) {            # If the ID on this line (from the 2nd file) was in the
                        # 1st file...
        printf("%s   %s  %s  %s\n", $1, $2, $3, d4[$1])
                        # Print ID, lat, and lon from the 2nd file and data4
                        # from the 1st file with 3 spaces between ID and lat,
                        # and 2 spaces between other fields.
}' datfile4 datfile3 > output   # 1st file is named datfile4, 2nd file is named
                                # datfile3; save the output in a file named
                                # output.

 

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Awk: output lines with common field to separate files

Hi, A beginner one. my input.tab (tab-separated): h1 h2 h3 h4 h5 item1 grpA 2 3 customer1 item2 grpB 4 6 customer1 item3 grpA 5 9 customer1 item4 grpA 0 0 customer2 item5 grpA 9 1 customer2 objective: output a file for each customer ($5) with the item number ($1) only if $2 matches... (2 Replies)
Discussion started by: beca123456
2 Replies

2. Shell Programming and Scripting

Find common lines between all of the files in one folder

Could it be possible to find common lines between all of the files in one folder? Just like comm -12 . So all of the files two at a time. I would like all of the outcomes to be written to a different files, and the file names could be simply numbers - 1 , 2 , 3 etc. All of the file names contain... (19 Replies)
Discussion started by: Eve
19 Replies

3. Shell Programming and Scripting

Find common lines with one file and with all of the files in another folder

Hi! I would like to comm -12 with one file and with all of the files in another folder that has a 100 files or more (that file is not in that folder) to find common text lines. I would like to have each case that they have common lines to be written to a different output file and the names of the... (6 Replies)
Discussion started by: Eve
6 Replies

4. Shell Programming and Scripting

Finding out the common lines in two files using 4 fields with the help of awk and UNIX

Dear All, I have 2 files. If field 1, 2, 4 and 5 matches in both file1 and file2, I want to print the whole line of file1 and file2 one after another in my output file. File1: sc2/80 20 . A T 86 F=5;U=4 sc2/60 55 . G T ... (1 Reply)
Discussion started by: NamS
1 Replies

5. Shell Programming and Scripting

Find common lines between multiple files

Hello everyone A few years Ago the user radoulov posted a fancy solution for a problem, which was about finding common lines (gene variation names) between multiple samples (files). The code was: awk 'END { for (R in rec) { n = split(rec, t, "/") if (n > 1) dup = dup ?... (5 Replies)
Discussion started by: bibb
5 Replies

6. Shell Programming and Scripting

Get common lines from multiple files

FileA chr1 31237964 NP_001018494.1 PUM1 M340L chr1 31237964 NP_055491.1 PUM1 M340L chr1 33251518 NP_037543.1 AK2 H191D chr1 33251518 NP_001616.1 AK2 H191D chr1 57027345 NP_001004303.2 C1orf168 P270S FileB chr1 ... (9 Replies)
Discussion started by: genehunter
9 Replies

7. Shell Programming and Scripting

Common lines from files

Hello guys, I need a script to get the common lines from two files with a criteria that if the first two columns match then I keep the maximum value of the 5th column.(tab separated columns) . 3rd and 4th columns corresponds to the row which has highest value for the 5th column. Sample... (2 Replies)
Discussion started by: jaysean
2 Replies

8. Shell Programming and Scripting

Common lines from files

Hello guys, I need a script to get the common lines from two files with a criteria that if the first two columns match then I keep the maximum value of the 3rd column.(tab separated columns) Sample input: file1: 111 222 0.1 333 444 0.5 555 666 0.4 file 2: 111 222 0.7 555 666... (5 Replies)
Discussion started by: jaysean
5 Replies

9. Shell Programming and Scripting

Drop common lines at head/tail of a large set of files

Hi! I have a large set of pairs of text files (each pair in their own subdirectory) and each pair shares head/tail (a couple of first and last lines) but differs in the middle part. I need to delete the heads/tails and keep only the middle portions in which they differ. The lengths of heads/tails... (1 Reply)
Discussion started by: dobryden
1 Replies

10. Shell Programming and Scripting

To find all common lines from 'n' no. of files

Hi, I have one situation. I have some 6-7 no. of files in one directory & I have to extract all the lines which exist in all these files. means I need to extract all common lines from all these files & put them in a separate file. Please help. I know it could be done with the help of... (11 Replies)
Discussion started by: The Observer
11 Replies

Featured Tech Videos