Merge two files with different lengths


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merge two files with different lengths
# 8  
Old 04-26-2013
Code:
$ cat file1
1 123
1 125
1 234
2 123
2 234
2 300
2 312
3 10
3 215
4 56

Code:
$ cat file2
1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1

Code:
$ cat test.awk
NF == 2 {
  if (looking == 1) { print a, b, "0 0 0 0" }
  else { looking = 1 }
  a = $1; b = $2
  }
NF != 2 && looking == 1 {
  if (a == $1 && b == $2) { print }
  else { print a, b, "0 0 0 0" }
  looking = 0
  }

Code:
$ sort file1 file2 | awk -f test.awk
1 123 0 1 0 0
1 125 0 0 0 0
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10 0 0 0 0
3 215 1 1 2 1
4 56 0 0 0 0

# 9  
Old 04-26-2013
Combining both files into a single stream is a very clever approach, hanson44. Great idea. However, the implementation needs a bit more work. If the last line of the sorted stream belongs to file1, its "a b 0 0 0 0" line is not generated.

Regards,
Alister
This User Gave Thanks to alister For This Post:
# 10  
Old 04-26-2013
Quote:
If the last line of the sorted stream belongs to file1
Ah, the special case not considered. Smilie Try again...

Code:
$ cat file1
1 123
1 125
1 234
2 123
2 234
2 300
2 312
3 10
3 215
4 56
4 80

Code:
$ cat file2
1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1

Code:
$ cat test.awk
NF == 2 {
  if (looking == 1) { print a, b, "0 0 0 0" }
  else { looking = 1 }
  a = $1; b = $2
  }
NF != 2 && looking == 1 {
  if (a == $1 && b == $2) { print }
  else { print a, b, "0 0 0 0" }
  looking = 0
  }
END {
  if (looking == 1) { print a, b, "0 0 0 0" }
  }

Code:
$ sort file1 file2 | awk -f test.awk
1 123 0 1 0 0
1 125 0 0 0 0
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10 0 0 0 0
3 215 1 1 2 1
4 56 0 0 0 0
4 80 0 0 0 0

# 11  
Old 04-26-2013
Assuming that a - (hyphen) does not occur in either of the first two fields:
Code:
awk '{print $1"-"$2 "\t" $0}' file1 | sort -k1,1 > file1.tmp
awk '{print $1"-"$2 "\t" $0}' file2 | sort -k1,1 > file2.tmp
join -e 0 -a 1 -o 1.2,1.3,2.4,2.5,2.6,2.7 file1.tmp file2.tmp > file3

Note: join without the -t option requires sort to use -b. However, since file?.tmp files can never have whitespace before the first field (because awk's default field splitting precludes whitespace in the field variables, $1 and $2, which constitute the new first field), I am able to dispense with sort's -b.

Regards,
Alister
# 12  
Old 04-27-2013
Nothing has been said about the sizes of these files. If they are "large", the sort could be relatively expensive. Here is a slightly modified version of one of the awk scripts I proposed earlier (correcting a bug that duplicated the final line of output in some cases) with additional comments added. It avoids the need to sort since the OP has stated that both input files are sorted and produces the desired output except for three lines. The following script doesn't add a trailing space to the output line:
Code:
1 125 0 0 0 0

and it produces the output lines:
Code:
3 10 0 0 0 0
    and
4 56 0 0 0 0

instead of the output lines:
Code:
3 10  0 0 0 0
    and
4 56  0 0 0 0

(note the extra spaces before the "0 0 0 0").

The awk script is:
Code:
awk '
# gf1() -  Get a line from file1.
# Description:
#       Set key1 to the 1st two fields from that line separated by a single
#       space.  It is assumed that the input fields are separated by a
#       combination of one or more spaces and tabs.
# Exit Code:
#       0 EOF or error reading from file1.
# Return Value:
#       1 Successful completion.
function gf1() {
        if((getline file1line < file1) != 1) exit(0)
        split(file1line, field, /[ \t]+/)
        key1 = field[1] FS field[2]
        return(1)
}
{       while(key1 < $1 FS $2) {
                # We are here because either this is the 1st line from file2
                # and we have not read a line from file1 yet or the line from
                # file1 does not have a match in file2 and we are looking at a
                # key from file2 that is greater than the key from the current
                # line in file1...
                if(key1) {
                        # Create a enw line for an unmatched key from file1.
                        print key1 " 0 0 0 0"
                        key1 = ""
                }
                gf1()   # Get another line from file1.
        }
}
key1 == $1 FS $2 {
        # We have a matching key, print the line from file2 and get a new line
        # from both files.
        print key1, $3, $4, $5, $6
        gf1()
}
END {   while(key1) {
                # Create new lines for any remaining unmatched keys from file1.
                print key1 " 0 0 0 0"
                gf1()   # We will exit this loop when we hit EOF on file1.
        }
}' file1=file1 file2 > file3

As always, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk.

Hope this helps,
Don
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

Simple C program to count word lengths

So my program is not working and I keep changing it to figure out why. So I have two questions, can I do tracing similar to bash, and also what is wrong with this. The idea is simple, I want to count "word" lengths, with the loose definition of word not being a space, tab, or newline. Here is... (11 Replies)
Discussion started by: Riker1204
11 Replies

2. UNIX for Advanced & Expert Users

UTF-8,16,32 character lengths using awk

Hi All, I am trying to obtain count of characters using awk, but "length" function returns a value of 1 for 2-byte or 3-byte characters as well unlike wc -c command. I have tried to use the below commands within awk function, but it does not seem to work { cmd="wc -c "stringtocheck ( cmd )... (6 Replies)
Discussion started by: tostay2003
6 Replies

3. Shell Programming and Scripting

Paste files of varying lengths

I have three files of varying lengths and different number of columns. How can I paste all three with all columns aligned? File1 ---- 123 File2 ---- 234 345 678 File3 ---- 456 789 Output should look like: 123 234 456 345 789 (6 Replies)
Discussion started by: Un1xNewb1e
6 Replies

4. Shell Programming and Scripting

Merge files and generate a resume in two files

Dear Gents, Please I need your help... I need small script :) to do the following. I have a thousand of files in a folder produced daily. I need first to merge all files called. txt (0009.txt, 0010.txt, 0011.txt) and and to output a resume of all information on 2 separate files in csv... (14 Replies)
Discussion started by: jiam912
14 Replies

5. Shell Programming and Scripting

Checking in a directory how many files are present and basing on that merge all the files

Hi, My requirement is,there is a directory location like: :camp/current/ In this location there can be different flat files that are generated in a single day with same header and the data will be different, differentiated by timestamp, so i need to verify how many files are generated... (10 Replies)
Discussion started by: srikanth_sagi
10 Replies

6. Shell Programming and Scripting

Merging data from 2 files of different lengths?

Hi all, Sorry if someone has answered something like this already, but I have a problem. I am not brilliant with "awk" but think it should be the command to use to get what I am after. I have 2 files: job-file (several hundred lines like): 1018003,LONG MU WAN,1113S 1018004,LONG MU... (4 Replies)
Discussion started by: sgb2301
4 Replies

7. Shell Programming and Scripting

Read lines with different lengths in while loop

Hi there ! I need to treat files with variable line length, and process the tab-delimited words of each line. The tools I know are some basic bash scripting and sed ... I haven't got to python or perl yet. So my file looks like this obj1 0.01953 0.34576 0.04418 0.01249 obj2 0.78140... (7 Replies)
Discussion started by: jossojjos
7 Replies

8. Solaris

limit on Solaris username lengths?

Hi this question applies to Solaris 8,9,10 and opensolaris as in my environment it applies to all of these Is there a limit on the size of the username (in /etc/passwd) or indeed does there come a point where, like the 8 character limitation of passwords, the system receives the input but... (6 Replies)
Discussion started by: hcclnoodles
6 Replies

9. Shell Programming and Scripting

Merge files of differrent size with one field common in both files using awk

hi, i am facing a problem in merging two files using awk, the problem is as stated below, file1: A|B|C|D|E|F|G|H|I|1 M|N|O|P|Q|R|S|T|U|2 AA|BB|CC|DD|EE|FF|GG|HH|II|1 .... .... .... file2 : 1|Mn|op|qr (2 Replies)
Discussion started by: shashi1982
2 Replies

10. UNIX for Dummies Questions & Answers

Using grep to find strings of certain lengths?

I am trying to use grep to find strings of certain lengths that all start with the same letter. Is this possible?:confused: (4 Replies)
Discussion started by: crabtruck
4 Replies
Login or Register to Ask a Question