need help using "join"

02-02-2008
need help using "join"

Dear experts,

I urgently need to know how to join these 2 files that match.

I have one file which looks like
and another file that looks like
i desperately need to know how to match the 2 files with using the second "-" as a field separator. Ive been tryong all day i cannot figure it out. Please help !!!

02-02-2008
Two things:

I've never used join, so you're probably going to get a better answer from someone else.

I didn't find the problem statement very clear, and looking at the data didn't help much. It appears that there isn't any commonality between the data. For example, are the first entries from the two files supposed to correlate?
1-0-0060122450000 0060122000550

Are the 00060122 numbers what you're trying to join on? If so, what is the desired output you're after?

Anyway, I experimented with join a bit and found the results could easily be sent through awk to simply print the fields you need. Maybe you should approach it like that.

If you post a more detailed description of what you're after it may help.
02-02-2008
sorry i was not precise.
The two files are huge, over 2 million records each. But the format is the same. Only difference is that one file has the additional "1-0-" or "1-1-" or "1-3-".

Im doing it a long way by cutting characters 1-4 off and then using paste and join again later.

Appreciate if you could let me know how to do it using awk. Thanks !!

02-02-2008
Well my assumptions on what you were trying to do were bad, so my awk solution didn't pan out for me. However I gave it a shot with a really short python script, and I think I may have what you need.

To be honest, you still didn't give me a clear idea of what you wanted your output to look like, so here's what I assumed. If I'm wrong, then sorry, this is my last shot.

Using the first four lines of your data, I think you want this output:
0060122450000 2000550
0060122450001 2000632
0060122450002 2001374
0060122450006 2004006

The first number is from file_a with the first four chars chopped off. The second number has the first value stripped off.

If this is correct, then here's a super simple python script to get that for you:

script name:

# open the data files
fa = open('file_a','rb')
fb = open('file_b','rb')

#Go through the files line by line stripping out just the parts
# that you want to keep. Also strip the newlines from the end.

for line in fa:
bita = line[4:].strip('\n')
tmpb = fb.readline().strip('\n')
# You could add a chk here to ensure tmpb matches bita. You'd have to do some additional chopping though. With millions of records, I'd do it.
bitb = tmpb[-7:]
print bita,bitb

# close the files

Run the script like this:
shellPrompt$ python

The script makes the assumption that your data files are matched up correctly, with the entries matching position-wise all the way through. If they're not, then this won't work without modifications. Your data will be wrong if the values are shifted.

And finally, there are probably more elegant python or shell techniques of doing this, but this works.

Good luck.

Last edited by H2OBoodle; 02-02-2008 at 09:18 AM.. Reason: give more info on how to run the script, warn on data corruption if data is not aligned.
02-02-2008

Will you pls make clear what you want in output file ?? see for successful
execution of join commd you require the field should be of same length
and sorted in the same order . i.e. if digit -> sort -n and if alfabet
then -> sort -d . so the both the file will be sorted on the same order
and for faster (a bit ) execution keep the field as 1st and use
join -1 1 -2 1 -t( the delimiter if you have used any ) -o 1.1 1.2 1.3 file1
file 2 > output file . And your job is done.

02-06-2008
Thanks h20boodle and mahesh. They both work really well !!! Thanks again
