Merge two files with different lengths

04-19-2013

Registered User

3, 0

Join Date: Apr 2013

Last Activity: 30 April 2013, 3:32 PM EDT

Posts: 3

Thanks Given: 4

Thanked 0 Times in 0 Posts

Merge two files with different lengths

Hi there,

I have two very long files like:

file1: two fields

Code:

file2: 6 fields

Code:

1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1
...

The numbers of lines in file1 and file2 are not equal.

I want to get an output file like

file3: 6 fields, and the first two fields are exactly the same as the first two fields in file1. For example, the line with the first two field "1 123" has a match in file2: "1 123 0 1 0 0", then print the whole line in file2:"1 123 0 1 0 0" to file3. If one line in file1 does not have a match in file2, e.g. "1 125", then print "1 125 0 0 0 0" to file3.

Code:

1 123 0 1 0 0
1 125 0 0 0 0 
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10  0 0 0 0
3 215 1 1 2 1
4 56  0 0 0 0
...

I am wondering if this can be done using awk or join or any other in linux? Since the files are very large, I really want it to be fast. Thanks a lot~~~

Note: Field 2 in both file1 and file2 has only number values, but field 1 in both files may have characters too. The two fields are sorted. And in both files, this kind of situation will not happen, no duplicates.

Code:

1 123
1 123
...

Also we do not need to consider the lines in file2 which do not have any match in file1, for example "1 126 2 1 0 0" (no match in file1), then this line should not be added to file3.

Last edited by ClaraW; 04-26-2013 at 08:38 PM..

ClaraW

View Public Profile for ClaraW

Find all posts by ClaraW

04-19-2013

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Try something like:

Code:

awk '{p=$0; getline<f} $1 FS $2 != p { $0 = p " 0 0 0 0" }1' f=file2 file1 > file3

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-20-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Where's your file2's lines 1 126 and 2 138in your output file? Does your specification imply lines in file2 that don't exist in file1 are to be suppressed? What happens to two or more consecutive lines in file1 missing in file2?

While scrutinizer's proposal may be very fast, it's difficult to get the two files back in sync if more than one line is missing in either.

After adding line 1 127 and 1 128 to file1, try this one:

Code:

sort file1 file2 |
awk     'tmp == $1 FS $2        {tmp = ""}
         tmp && 
           tmp != $1 FS $2      {print tmp, "0 0 0 0"}
         NF == 2                {tmp = $0; next} 
         NF == 6                {print; tmp = ""}
         END                    {if (tmp) print tmp, "0 0 0 0"}
        '
1 123 0 1 0 0
1 125 0 0 0 0
1 126 2 1 0 0
1 127 0 0 0 0
1 128 0 0 0 0
1 234 2 3 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 234 0 0 0 0

These 2 Users Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

04-20-2013

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Thanks, RudiC, I may have misread the requirement. If so, this adaptation might fix this:

Code:

awk '{p=$1 FS $2; while( getline<f && $1 FS $2 < p) } $1 FS $2 != p { $0 = p " 0 0 0 0" }1' f=file2 file1

or

Code:

awk '{p=$1; q=$2; while(getline<f && ( $1<p || $2<q ) )} $1!=p || $2!=q { $0 = p FS q FS "0 0 0 0" }1' f=file2 file1

Last edited by Scrutinizer; 04-20-2013 at 07:04 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

04-22-2013

Registered User

3, 0

Join Date: Apr 2013

Last Activity: 30 April 2013, 3:32 PM EDT

Posts: 3

Thanks Given: 4

Thanked 0 Times in 0 Posts

Thank you so much, Scrutinizer and RudiC! And yes RudiC, such lines are to be suppressed. I only want to get the information for all lines in file1, if the lines exits in file2, then add the other fields 3, 4, 5 and 6 in file2 to file1, if one line in file1 is not in file2, then add "0 0 0 0" for fields 3, 4, 5 and 6. But if one line in field 2 does not exit in file1, then just ignore the line.

ClaraW

View Public Profile for ClaraW

Find all posts by ClaraW

04-22-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Here are two awk scripts that I think do what you need. Use the 1st script if the field separator in either file is a mixture of spaces and tabs. Use the 2nd script if the field separator in both files is always a single space. In both cases, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk:

Code:

echo 1st form:
awk '
function gf1() {
        if((getline file1line < file1) != 1) exit(0)
        split(file1line, field, /[ \t]+/)
        key1 = field[1] FS field[2]
        return(1)
}
{       while(key1 < $1 FS $2) {
                if(key1) print key1 " 0 0 0 0"
                gf1()
        }
}
key1 == $1 FS $2 {
        print key1, $3, $4, $5, $6
        gf1()
}
END {   while(1) {
                print key1 " 0 0 0 0"
                gf1()
        }
}' file1=file1 file2 > file3

echo 2nd form:

awk '
function gf1() {
        if((getline key1 < file1) != 1) exit(0)
        return(1)
}
{       while(key1 < $1 FS $2) {
                if(key1) print key1 " 0 0 0 0"
                gf1()
        }
}
key1 == $1 FS $2 {
        print
        gf1()
}
END {   while(1) {
                print key1 " 0 0 0 0"
                gf1()
        }
}' file1=file1 file2 > file4

Note that the 2nd form sends output to file4 instead of file3 (so you can run both forms and compare the output).

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

04-26-2013

Registered User

3, 0

Join Date: Apr 2013

Last Activity: 30 April 2013, 3:32 PM EDT

Posts: 3

Thanks Given: 4

Thanked 0 Times in 0 Posts

Hi all,

I think the problem is maybe I did not state my question clearly. I've changed the statement and made it more easy to understand.

Thanks,

Lu

ClaraW

View Public Profile for ClaraW

Find all posts by ClaraW

Shell Programming and Scripting

Merge two files with different lengths

10 More Discussions You Might Find Interesting

1. Programming

Simple C program to count word lengths

Discussion started by: Riker1204

2. UNIX for Advanced & Expert Users

UTF-8,16,32 character lengths using awk

Discussion started by: tostay2003

3. Shell Programming and Scripting

Paste files of varying lengths

Discussion started by: Un1xNewb1e

4. Shell Programming and Scripting

Merge files and generate a resume in two files

Discussion started by: jiam912

5. Shell Programming and Scripting

Checking in a directory how many files are present and basing on that merge all the files

Discussion started by: srikanth_sagi

6. Shell Programming and Scripting

Merging data from 2 files of different lengths?

Discussion started by: sgb2301

7. Shell Programming and Scripting

Read lines with different lengths in while loop

Discussion started by: jossojjos

8. Solaris

limit on Solaris username lengths?

Discussion started by: hcclnoodles

9. Shell Programming and Scripting

Merge files of differrent size with one field common in both files using awk

Discussion started by: shashi1982

10. UNIX for Dummies Questions & Answers

Using grep to find strings of certain lengths?

Discussion started by: crabtruck