Merge two files with different lengths


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Merge two files with different lengths
# 1  
Old 04-19-2013
Merge two files with different lengths

Hi there,

I have two very long files like:

file1: two fields

Code:
1 123 
1 125
1 234
2 123
2 234
2 300
2 312
3 10
3 215
4 56
...

file2: 6 fields

Code:
1 123 0 1 0 0
1 126 2 1 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 300 0 1 2 3
2 311 2 4 6 0
3 120 3 4 1 0
3 215 1 1 2 1
3 216 0 2 1 5
3 345 8 0 1 0
3 357 0 1 1 1
3 500 2 1 0 1
4 17  6 1 0 2
4 70  0 1 0 1
...

The numbers of lines in file1 and file2 are not equal.

I want to get an output file like

file3: 6 fields, and the first two fields are exactly the same as the first two fields in file1. For example, the line with the first two field "1 123" has a match in file2: "1 123 0 1 0 0", then print the whole line in file2:"1 123 0 1 0 0" to file3. If one line in file1 does not have a match in file2, e.g. "1 125", then print "1 125 0 0 0 0" to file3.

Code:
1 123 0 1 0 0
1 125 0 0 0 0 
1 234 0 0 0 0
2 123 0 1 0 1
2 234 0 0 0 0
2 300 0 1 2 3
2 312 0 0 0 0
3 10  0 0 0 0
3 215 1 1 2 1
4 56  0 0 0 0
...

I am wondering if this can be done using awk or join or any other in linux? Since the files are very large, I really want it to be fast. Thanks a lot~~~

Note: Field 2 in both file1 and file2 has only number values, but field 1 in both files may have characters too. The two fields are sorted. And in both files, this kind of situation will not happen, no duplicates.

Code:
1 123
1 123
...

Also we do not need to consider the lines in file2 which do not have any match in file1, for example "1 126 2 1 0 0" (no match in file1), then this line should not be added to file3.

Last edited by ClaraW; 04-26-2013 at 08:38 PM..
# 2  
Old 04-19-2013
Try something like:
Code:
awk '{p=$0; getline<f} $1 FS $2 != p { $0 = p " 0 0 0 0" }1' f=file2 file1 > file3

This User Gave Thanks to Scrutinizer For This Post:
# 3  
Old 04-20-2013
Where's your file2's lines 1 126 and 2 138in your output file? Does your specification imply lines in file2 that don't exist in file1 are to be suppressed? What happens to two or more consecutive lines in file1 missing in file2?

While scrutinizer's proposal may be very fast, it's difficult to get the two files back in sync if more than one line is missing in either.

After adding line 1 127 and 1 128 to file1, try this one:
Code:
sort file1 file2 |
awk     'tmp == $1 FS $2        {tmp = ""}
         tmp && 
           tmp != $1 FS $2      {print tmp, "0 0 0 0"}
         NF == 2                {tmp = $0; next} 
         NF == 6                {print; tmp = ""}
         END                    {if (tmp) print tmp, "0 0 0 0"}
        '
1 123 0 1 0 0
1 125 0 0 0 0
1 126 2 1 0 0
1 127 0 0 0 0
1 128 0 0 0 0
1 234 2 3 0 0
2 123 0 1 0 1
2 138 1 1 1 1
2 234 0 0 0 0

These 2 Users Gave Thanks to RudiC For This Post:
# 4  
Old 04-20-2013
Thanks, RudiC, I may have misread the requirement. If so, this adaptation might fix this:
Code:
awk '{p=$1 FS $2; while( getline<f && $1 FS $2 < p) } $1 FS $2 != p { $0 = p " 0 0 0 0" }1' f=file2 file1

or
Code:
awk '{p=$1; q=$2; while(getline<f && ( $1<p || $2<q ) )} $1!=p || $2!=q { $0 = p FS q FS "0 0 0 0" }1' f=file2 file1


Last edited by Scrutinizer; 04-20-2013 at 07:04 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 5  
Old 04-22-2013
Thank you so much, Scrutinizer and RudiC! And yes RudiC, such lines are to be suppressed. I only want to get the information for all lines in file1, if the lines exits in file2, then add the other fields 3, 4, 5 and 6 in file2 to file1, if one line in file1 is not in file2, then add "0 0 0 0" for fields 3, 4, 5 and 6. But if one line in field 2 does not exit in file1, then just ignore the line.
# 6  
Old 04-22-2013
Here are two awk scripts that I think do what you need. Use the 1st script if the field separator in either file is a mixture of spaces and tabs. Use the 2nd script if the field separator in both files is always a single space. In both cases, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk:
Code:
echo 1st form:
awk '
function gf1() {
        if((getline file1line < file1) != 1) exit(0)
        split(file1line, field, /[ \t]+/)
        key1 = field[1] FS field[2]
        return(1)
}
{       while(key1 < $1 FS $2) {
                if(key1) print key1 " 0 0 0 0"
                gf1()
        }
}
key1 == $1 FS $2 {
        print key1, $3, $4, $5, $6
        gf1()
}
END {   while(1) {
                print key1 " 0 0 0 0"
                gf1()
        }
}' file1=file1 file2 > file3

echo 2nd form:

awk '
function gf1() {
        if((getline key1 < file1) != 1) exit(0)
        return(1)
}
{       while(key1 < $1 FS $2) {
                if(key1) print key1 " 0 0 0 0"
                gf1()
        }
}
key1 == $1 FS $2 {
        print
        gf1()
}
END {   while(1) {
                print key1 " 0 0 0 0"
                gf1()
        }
}' file1=file1 file2 > file4

Note that the 2nd form sends output to file4 instead of file3 (so you can run both forms and compare the output).
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 04-26-2013
Hi all,

I think the problem is maybe I did not state my question clearly. I've changed the statement and made it more easy to understand.

Thanks,

Lu
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

Simple C program to count word lengths

So my program is not working and I keep changing it to figure out why. So I have two questions, can I do tracing similar to bash, and also what is wrong with this. The idea is simple, I want to count "word" lengths, with the loose definition of word not being a space, tab, or newline. Here is... (11 Replies)
Discussion started by: Riker1204
11 Replies

2. UNIX for Advanced & Expert Users

UTF-8,16,32 character lengths using awk

Hi All, I am trying to obtain count of characters using awk, but "length" function returns a value of 1 for 2-byte or 3-byte characters as well unlike wc -c command. I have tried to use the below commands within awk function, but it does not seem to work { cmd="wc -c "stringtocheck ( cmd )... (6 Replies)
Discussion started by: tostay2003
6 Replies

3. Shell Programming and Scripting

Paste files of varying lengths

I have three files of varying lengths and different number of columns. How can I paste all three with all columns aligned? File1 ---- 123 File2 ---- 234 345 678 File3 ---- 456 789 Output should look like: 123 234 456 345 789 (6 Replies)
Discussion started by: Un1xNewb1e
6 Replies

4. Shell Programming and Scripting

Merge files and generate a resume in two files

Dear Gents, Please I need your help... I need small script :) to do the following. I have a thousand of files in a folder produced daily. I need first to merge all files called. txt (0009.txt, 0010.txt, 0011.txt) and and to output a resume of all information on 2 separate files in csv... (14 Replies)
Discussion started by: jiam912
14 Replies

5. Shell Programming and Scripting

Checking in a directory how many files are present and basing on that merge all the files

Hi, My requirement is,there is a directory location like: :camp/current/ In this location there can be different flat files that are generated in a single day with same header and the data will be different, differentiated by timestamp, so i need to verify how many files are generated... (10 Replies)
Discussion started by: srikanth_sagi
10 Replies

6. Shell Programming and Scripting

Merging data from 2 files of different lengths?

Hi all, Sorry if someone has answered something like this already, but I have a problem. I am not brilliant with "awk" but think it should be the command to use to get what I am after. I have 2 files: job-file (several hundred lines like): 1018003,LONG MU WAN,1113S 1018004,LONG MU... (4 Replies)
Discussion started by: sgb2301
4 Replies

7. Shell Programming and Scripting

Read lines with different lengths in while loop

Hi there ! I need to treat files with variable line length, and process the tab-delimited words of each line. The tools I know are some basic bash scripting and sed ... I haven't got to python or perl yet. So my file looks like this obj1 0.01953 0.34576 0.04418 0.01249 obj2 0.78140... (7 Replies)
Discussion started by: jossojjos
7 Replies

8. Solaris

limit on Solaris username lengths?

Hi this question applies to Solaris 8,9,10 and opensolaris as in my environment it applies to all of these Is there a limit on the size of the username (in /etc/passwd) or indeed does there come a point where, like the 8 character limitation of passwords, the system receives the input but... (6 Replies)
Discussion started by: hcclnoodles
6 Replies

9. Shell Programming and Scripting

Merge files of differrent size with one field common in both files using awk

hi, i am facing a problem in merging two files using awk, the problem is as stated below, file1: A|B|C|D|E|F|G|H|I|1 M|N|O|P|Q|R|S|T|U|2 AA|BB|CC|DD|EE|FF|GG|HH|II|1 .... .... .... file2 : 1|Mn|op|qr (2 Replies)
Discussion started by: shashi1982
2 Replies

10. UNIX for Dummies Questions & Answers

Using grep to find strings of certain lengths?

I am trying to use grep to find strings of certain lengths that all start with the same letter. Is this possible?:confused: (4 Replies)
Discussion started by: crabtruck
4 Replies
Login or Register to Ask a Question