Filter lines common in two files

07-08-2013

Registered User

6, 0

Join Date: Jul 2013

Last Activity: 19 July 2013, 6:32 AM EDT

Posts: 6

Thanks Given: 5

Thanked 0 Times in 0 Posts

Filter lines common in two files

Thanks everyone. I got that problem solved.

I require one more help here. (Yes, UNIX definitely seems to be fun and useful, and I WILL eventually learn it for myself. But I am now on a different project and don't really have time to go through all the basics. So, I will really appreciate some help.)

I've got two data files.
One has: ID lat lon data1 data2; for ex:

Code:

1001   23.5  -6.45  3.2  14.68
1002   48.2  -35.6  8.5  21.67
1003   -5.6   23.6   3.5  3.56
...
...

And the other has: ID data3 data4 data5
For example:

Code:

1001   C   16   US
1002   D    32   US
1004   E    13   US
...
...

There are approximately 2500 IDs but neither has all of them. file1 has few missing, so does file2, but the missing IDs are not necessarily common.

Now I want to make a new file with: ID lat lon data4 ; but ONLY for the IDs that are common in both.
So, for above files it would be:

Code:

1001   23.5  -6.45  16
1002   48.2  -35.6  32
...
...

I searched the forums and there are similar problems which are solved using awk but I could not comprehend the scripts well enough to create my own solution.
Thanks.

latsyrc

View Public Profile for latsyrc

Find all posts by latsyrc

07-08-2013

Registered User

962, 67

Join Date: Mar 2005

Last Activity: 16 January 2019, 9:30 PM EST

Location: Philadelphia metro

Posts: 962

Thanks Given: 3

Thanked 67 Times in 61 Posts

see below ... assumes removal of column labels (first row) ... see man comm and man sort ...

steps in the process ...
1. grab and sort IDs from 1st column ($1) of file 1 (pool) and send to temp file (pool.1t)
2. grab and sort IDs from 1st column ($1) of file 2 (pool4) and send to temp file (pool.4t)
3. for each ID common to both files comm -12 file1 file2
> a. grab lon from lon column ($2) in file 1 (pool)
> b. grab lat from lat column ($3) in file 1 (pool)
> c. grab data4 from data4 column ($3) in file 2 (pool4)
> d. echo out values in correct order

Code:

root@debiangeek:/tmp# cat pool
123 a b c f
456 a b c d 
789 a b c e
root@debiangeek:/tmp# cat pool4
789 a 25 c e
234 d 35 e c
456 a 57 c d 
123 a 66 c f
root@debiangeek:/tmp# awk -F" " '{print $1}' pool | sort -nu > pool.1t
root@debiangeek:/tmp# cat pool.1t
123
456
789
root@debiangeek:/tmp# awk -F" " '{print $1}' pool4 | sort -nu > pool.4t
root@debiangeek:/tmp# cat pool.4t
123
234
456
789
root@debiangeek:/tmp# for i in $(comm -12 pool.1t pool.4t)
> do
>     lat=$(awk "\$1 ~ /$i/ {print \$2}" pool)
>     lon=$(awk "\$1 ~ /$i/ {print \$3}" pool)
>     dat4=$(awk "\$1 ~ /$i/ {print \$3}" pool4)
>     echo "$i $lat $lon $dat4"
> done | tee /tmp/file1
123 a b 66
456 a b 57
789 a b 25
root@debiangeek:/tmp# cat /tmp/file1
123 a b 66
456 a b 57
789 a b 25
root@debiangeek:/tmp#

Just Ice

View Public Profile for Just Ice

Find all posts by Just Ice

07-08-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by latsyrc

Code:

1001   23.5  -6.45  3.2  14.68
1002   48.2  -35.6  8.5  21.67
1003   -5.6   23.6   3.5  3.56
...
...

And the other has: ID data3 data4 data5
For example:

Code:

1001   C   16   US
1002   D    32   US
1004   E    13   US
...
...

Code:

1001   23.5  -6.45  16
1002   48.2  -35.6  32
...
...

I searched the forums and there are similar problems which are solved using awk but I could not comprehend the scripts well enough to create my own solution.
Thanks.

You really should start a new thread when you have a new problem... It makes it a lot easier for people who may try to read this thread later figure out what problem later messages in the thread are trying to address.

Assuming that the above two input files are named datfile3 and datfile4, respectively, and that you want the output stored in a file named output; the following simple awk script seems to be a little more direct than Just Ice's proposal:

Code:

awk '
FNR == NR {
	d4[$1] = $3
	next
}
($1 in d4) {
	printf("%s   %s  %s  %s\n", $1, $2, $3, d4[$1])
}' datfile4 datfile3 > output

When you run this script with the above sample input files, the contents of output will be:

Code:

1001   23.5  -6.45  16
1002   48.2  -35.6  32

which matches the spacing you requested.

As I mentioned before, if you want to run this on a Solaris system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of /usr/bin/awk or /bin/awk.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

07-09-2013

Registered User

14, 3

Join Date: Dec 2011

Last Activity: 1 September 2013, 7:17 AM EDT

Location: Saudi Arabia

Posts: 14

Thanks Given: 0

Thanked 3 Times in 3 Posts

Paste is one of the easy solution

Parse all of your required line to different file with awk and then join all of your file with the paste command. With the paste command you can use any kind of seperator also. In this example i used semicolon.

Paste command usage:

paste -d ";" file1 file2 file3 .....

Code:

[goksel@gokcell cozum]$ cat >file1
1001   23.5  -6.45  3.2  14.68
1002   48.2  -35.6  8.5  21.67
1003   -5.6   23.6   3.5  3.56
^Z
[1]+  Stopped                 cat > file1
[goksel@gokcell cozum]$ cat >file2
1001   C   16   US
1002   D    32   US
1004   E    13   US
^Z
[2]+  Stopped                 cat > file2
[goksel@gokcell cozum]$ cat file1
1001   23.5  -6.45  3.2  14.68
1002   48.2  -35.6  8.5  21.67
1003   -5.6   23.6   3.5  3.56

cat file1 | awk '{print $1}' >f1l1
cat file1 | awk '{print $2}' >f1l2
cat file1 | awk '{print $3}' >f1l3
cat file2 | awk '{print $3}' >f2l3

paste -d ";" f1l1 f1l2 f1l3 f2l3 >result

cat result 
1001;23.5;-6.45;16
1002;48.2;-35.6;32
1003;-5.6;23.6;13

Regards,
Goksel Yangin
Computer Engineer

Moderator's Comments:

edit by bakunin: repaired some broken CODE-tags.

Last edited by bakunin; 07-10-2013 at 06:02 AM..

This User Gave Thanks to gokcell For This Post:

gokcell

View Public Profile for gokcell

Find all posts by gokcell

07-10-2013

Registered User

6, 0

Join Date: Jul 2013

Last Activity: 19 July 2013, 6:32 AM EDT

Posts: 6

Thanks Given: 5

Thanked 0 Times in 0 Posts

Thank you everyone! I learnt a new thing or two from each post.

@Don_Cragun,
Thanks again. Your code gave me perfect results. However, I still cannot comprehend it entirely. Can you please briefly explain this:

Code:

awk '
FNR == NR {
	d4[$1] = $3
	next
}
($1 in d4) {
	printf("%s   %s  %s  %s\n", $1, $2, $3, d4[$1])
}' datfile4 datfile3 > output

latsyrc

View Public Profile for latsyrc

Find all posts by latsyrc

07-10-2013

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Moderator's Comments:

As this is a new problem i splitted the original thread and moved the posts dealing with the new problem here.

Like DonCragun already said: please open a new thread for every separate problem. Thank you.

bakunin

View Public Profile for bakunin

Find all posts by bakunin

07-10-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Code:

awk '
# 1st file fields: ID data3 data4 data5
# 2nd file fields: ID lat lon data1 data2
FNR == NR {             # If this line is from the 1st file...
        d4[$1] = $3     # d4[ID] = data4 associated with ID
        next            # Skip to next input line
}
($1 in d4) {            # If the ID on this line (from the 2nd file) was in the
                        # 1st file...
        printf("%s   %s  %s  %s\n", $1, $2, $3, d4[$1])
                        # Print ID, lat, and lon from the 2nd file and data4
                        # from the 1st file with 3 spaces between ID and lat,
                        # and 2 spaces between other fields.
}' datfile4 datfile3 > output   # 1st file is named datfile4, 2nd file is named
                                # datfile3; save the output in a file named
                                # output.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

UNIX for Dummies Questions & Answers

Filter lines common in two files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Awk: output lines with common field to separate files

Discussion started by: beca123456

2. Shell Programming and Scripting

Find common lines between all of the files in one folder

Discussion started by: Eve

3. Shell Programming and Scripting

Find common lines with one file and with all of the files in another folder

Discussion started by: Eve

4. Shell Programming and Scripting

Finding out the common lines in two files using 4 fields with the help of awk and UNIX

Discussion started by: NamS

5. Shell Programming and Scripting

Find common lines between multiple files

Discussion started by: bibb

6. Shell Programming and Scripting

Get common lines from multiple files

Discussion started by: genehunter

7. Shell Programming and Scripting

Common lines from files

Discussion started by: jaysean

8. Shell Programming and Scripting

Common lines from files

Discussion started by: jaysean

9. Shell Programming and Scripting

Drop common lines at head/tail of a large set of files

Discussion started by: dobryden

10. Shell Programming and Scripting

To find all common lines from 'n' no. of files

Discussion started by: The Observer