Merge two files with different lengths

04-26-2013

Registered User

858, 184

Join Date: Mar 2013

Last Activity: 12 May 2013, 11:33 PM EDT

Posts: 858

Thanks Given: 18

Thanked 184 Times in 179 Posts

Code:

$ cat file1
1 123
1 125
1 234
2 123
2 234
2 300
2 312
3 10
3 215
4 56

Code:

$ cat file2
1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1

Code:

$ cat test.awk
NF == 2 {
  if (looking == 1) { print a, b, "0 0 0 0" }
  else { looking = 1 }
  a = $1; b = $2
  }
NF != 2 && looking == 1 {
  if (a == $1 && b == $2) { print }
  else { print a, b, "0 0 0 0" }
  looking = 0
  }

Code:

$ sort file1 file2 | awk -f test.awk
1 123 0 1 0 0
1 125 0 0 0 0
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10 0 0 0 0
3 215 1 1 2 1
4 56 0 0 0 0

hanson44

View Public Profile for hanson44

Find all posts by hanson44

04-26-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Combining both files into a single stream is a very clever approach, hanson44. Great idea. However, the implementation needs a bit more work. If the last line of the sorted stream belongs to file1, its "a b 0 0 0 0" line is not generated.

Regards,
Alister

This User Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

04-26-2013

Registered User

858, 184

Join Date: Mar 2013

Last Activity: 12 May 2013, 11:33 PM EDT

Posts: 858

Thanks Given: 18

Thanked 184 Times in 179 Posts

Quote:

If the last line of the sorted stream belongs to file1

Ah, the special case not considered.

Try again...

Code:

$ cat file1
1 123
1 125
1 234
2 123
2 234
2 300
2 312
3 10
3 215
4 56
4 80

Code:

$ cat file2
1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1

Code:

$ cat test.awk
NF == 2 {
  if (looking == 1) { print a, b, "0 0 0 0" }
  else { looking = 1 }
  a = $1; b = $2
  }
NF != 2 && looking == 1 {
  if (a == $1 && b == $2) { print }
  else { print a, b, "0 0 0 0" }
  looking = 0
  }
END {
  if (looking == 1) { print a, b, "0 0 0 0" }
  }

Code:

$ sort file1 file2 | awk -f test.awk
1 123 0 1 0 0
1 125 0 0 0 0
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10 0 0 0 0
3 215 1 1 2 1
4 56 0 0 0 0
4 80 0 0 0 0

hanson44

View Public Profile for hanson44

Find all posts by hanson44

04-26-2013

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Assuming that a - (hyphen) does not occur in either of the first two fields:

Code:

awk '{print $1"-"$2 "\t" $0}' file1 | sort -k1,1 > file1.tmp
awk '{print $1"-"$2 "\t" $0}' file2 | sort -k1,1 > file2.tmp
join -e 0 -a 1 -o 1.2,1.3,2.4,2.5,2.6,2.7 file1.tmp file2.tmp > file3

Note: join without the -t option requires sort to use -b. However, since file?.tmp files can never have whitespace before the first field (because awk's default field splitting precludes whitespace in the field variables, $1 and $2, which constitute the new first field), I am able to dispense with sort's -b.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

04-27-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Nothing has been said about the sizes of these files. If they are "large", the sort could be relatively expensive. Here is a slightly modified version of one of the awk scripts I proposed earlier (correcting a bug that duplicated the final line of output in some cases) with additional comments added. It avoids the need to sort since the OP has stated that both input files are sorted and produces the desired output except for three lines. The following script doesn't add a trailing space to the output line:

Code:

1 125 0 0 0 0

and it produces the output lines:

Code:

3 10 0 0 0 0
    and
4 56 0 0 0 0

instead of the output lines:

Code:

3 10  0 0 0 0
    and
4 56  0 0 0 0

(note the extra spaces before the "0 0 0 0").

The awk script is:

Code:

awk '
# gf1() -  Get a line from file1.
# Description:
#       Set key1 to the 1st two fields from that line separated by a single
#       space.  It is assumed that the input fields are separated by a
#       combination of one or more spaces and tabs.
# Exit Code:
#       0 EOF or error reading from file1.
# Return Value:
#       1 Successful completion.
function gf1() {
        if((getline file1line < file1) != 1) exit(0)
        split(file1line, field, /[ \t]+/)
        key1 = field[1] FS field[2]
        return(1)
}
{       while(key1 < $1 FS $2) {
                # We are here because either this is the 1st line from file2
                # and we have not read a line from file1 yet or the line from
                # file1 does not have a match in file2 and we are looking at a
                # key from file2 that is greater than the key from the current
                # line in file1...
                if(key1) {
                        # Create a enw line for an unmatched key from file1.
                        print key1 " 0 0 0 0"
                        key1 = ""
                }
                gf1()   # Get another line from file1.
        }
}
key1 == $1 FS $2 {
        # We have a matching key, print the line from file2 and get a new line
        # from both files.
        print key1, $3, $4, $5, $6
        gf1()
}
END {   while(key1) {
                # Create new lines for any remaining unmatched keys from file1.
                print key1 " 0 0 0 0"
                gf1()   # We will exit this loop when we hit EOF on file1.
        }
}' file1=file1 file2 > file3

As always, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk.

Hope this helps,
Don

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Merge two files with different lengths

10 More Discussions You Might Find Interesting

1. Programming

Simple C program to count word lengths

Discussion started by: Riker1204

2. UNIX for Advanced & Expert Users

UTF-8,16,32 character lengths using awk

Discussion started by: tostay2003

3. Shell Programming and Scripting

Paste files of varying lengths

Discussion started by: Un1xNewb1e

4. Shell Programming and Scripting

Merge files and generate a resume in two files

Discussion started by: jiam912

5. Shell Programming and Scripting

Checking in a directory how many files are present and basing on that merge all the files

Discussion started by: srikanth_sagi

6. Shell Programming and Scripting

Merging data from 2 files of different lengths?

Discussion started by: sgb2301

7. Shell Programming and Scripting

Read lines with different lengths in while loop

Discussion started by: jossojjos

8. Solaris

limit on Solaris username lengths?

Discussion started by: hcclnoodles

9. Shell Programming and Scripting

Merge files of differrent size with one field common in both files using awk

Discussion started by: shashi1982

10. UNIX for Dummies Questions & Answers

Using grep to find strings of certain lengths?

Discussion started by: crabtruck