Problem with Join Command


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Problem with Join Command
# 8  
Old 08-15-2016
As you can see, it works on the copies of the files you posted. So - there must be an inherent difference in the original files you work upon. Reduce the files to just the lines in questions, try again, and if it doesn't work, post the output of
Code:
od -tx1c file[12]

.
# 9  
Old 08-15-2016
Quote:
Originally Posted by Varshha
I posted the reply yesterday but I am not sure why it is not reflecting. So here it is again :

The files were originally tab delimited but I made them comma delimited to help me with the join command. I am now trying again with tab delimited files. I also tried some additions in my join command and this is what I gave :

Code:
join -t"  "  -a 2 -a 1 -e 'NULL' -o '0,1.1,1.2,2.1,2.2' File1 File2 | head -100

I am getting a result out of this which unfortunately means that UNIX is not finding a common key for the files to join and it is surprising because there ARE common values between the files. This is how the sample of the result looks like :

Code:
01635158332	09/09/2016 01635158332	09/09/2016 NULL NULL NULL
01635163349	11/24/2009 01635163349	11/24/2009 NULL NULL NULL
16.11	01635163339 NULL NULL 16.11	01635163339 NULL
16.11	01635163349 NULL NULL 16.11	01635163349 NULL


As you can see above, 01635163349 is a common key between File 1 that has dates and file 2 that has the cost. So ideally the result should be

Code:
01635163349  11/24/2009  16.11

The command
Code:
join -1 1 -2 1 File 1 File 2

does not give me any result as in no output on the console at all.

This is how file 1 looks:

Code:
00033492482     04/11/2006
00033492682     07/14/2009
00033492702     02/09/2010
00076848302     08/10/2010
00881123792     11/07/2000
01130162424     06/12/2007
01130164254     01/29/2008
01130165543     05/16/2011
01130168864     07/14/2009
01635163349     11/24/2009

File 2:

Code:
0.00    03139822826
0.00    49246820001
0.00    7621830148
0.00    822004599003
0.11    73379268872
0.64    67119603398
0.65    67261704102
16.11   01635163349


Can there be any other way to achieve an inner join between these files?
You have now shown us 3 different input file formats (tab separated fields, <space><comma><space> separated fields, and <space><comma> separated fields). You have shown us commands using <space>, <comma>, and <tab> as the field separator. And it isn't clear which separators have been used in the files those command are processing.

More importantly, you have said that your file names are File 1, File1, file 1, and file1. Since none of your commands have quoted the filenames being passed as arguments, many of them are asking various utilities to work on files named File or file and 1 and 2 (which presumably result in non-existent file diagnostics that you haven't shown us). The name of a file is case sensitive and having a <space> in a filename requires special handling in LOTS of ways that are being ignored in all of your command lines.

Then, it is also important to understand that in an awk script, $0 is the contents of the current input line, $2 is the contents of the 2nd field in the current input line, and a command like:
Code:
awk 'NR=FNR{check[$0];next} $2 in check' File2 File1

is never going to work unless
Code:
File2

contains line that just contain whole lines that exactly match the 2nd field of a line in File1 (which is not true for any of your sample input file pairs.

And, the command line:
Code:
cat File2 | while read line; do  grep $line File1; done

will only work correctly if there are no <space> or <tab> characters on any line in File2 AND you are trying to find complete lines form File2 that match a subset of a line from File1.

And, the command line:
Code:
join file 2 file1

should give you a diagnostic similar to:
Code:
usage: join [-a file no | -v file no ] [-e string] [-1 field] [-2 field]
            [-o list] [-t char] file1 file2

not the no output that you say you get.

If you keep giving us inconsistent data and don't show us what your command lines and/or the output you get from them really are, you make it impossible for us to help you.

Saying things like:
Quote:
Came out as a typo .... but this is not working either Smilie
Doesn't help us. Show us the exact diagnostic that was produced!

Saying things like:
Quote:
These files are being sent by the source. There are many other columns in these files. I have manipulated them to remove the unrequired columns and the header using AWK and SED.
Doesn't give us any indication as to whether or not we are working on UNIX format text files after you have manipulated files sent by the source. If, after have manipulated them, the source files are still DOS format text files, there is a good chance that fields are matching because of DOS text file <carriage-return> line separators causing <carriage-return> characters to keep fields from matching or to cause output sent to your terminal being obscured by parts of output lines overwriting earlier text already sent to your screen.

Please give us clear answers to the questions we have asked. We are asking for information that will allow us to help you. We are not asking you to do extra work for the fun of it.

Please help us help you!
# 10  
Old 08-15-2016
Sorry for the confusion that I am creating. Let me start from the beginning. These files are tab delimited files. Because there is a confusion with the file names, I will henceforth use the original file names --dlya0908.tab (which I was referring as File1) and tgpr.tab (which I was referring as File2)

I do not have information on how does the source team create the files. It is an external server from which the files are FTPed.

So, dlya0908.tab looks like this :

Code:
00033492482     04/11/2006
00033492682     07/14/2009
00033492702     02/09/2010
00076848302     08/10/2010
00881123792     11/07/2000
01130162424     06/12/2007
01130164254     01/29/2008
01130165543     05/16/2011
01130168864     07/14/2009
01635163349     11/24/2009

and tgpr.tab looks like this :

Code:
0.00    03139822826
0.00    49246820001
0.00    7621830148
0.00    822004599003
0.11    73379268872
0.64    67119603398
0.65    67261704102
16.11   01635163349

I am trying to join these files like this :

Code:
join -t"  "  -11 -22 dlya0908.tab tgpr.tab

The above command does not give me any result.

When I give

Code:
join -a 1 -a 2 -e "NULL" -o'0,1.1,2.2' dlya0908.tab tgpr.tab

I get

Code:
0.00 NULL 03139822826
0.00 NULL 49246820001
0.00 NULL 7621830148
0.00 NULL 822004599003
0.11 NULL 73379268872
0.64 NULL 67119603398
0.65 NULL 67261704102
00033492482 00033492482 NULL
00033492682 00033492682 NULL
00033492702 00033492702 NULL
00076848302 00076848302 NULL
00881123792 00881123792 NULL
01130162424 01130162424 NULL
01130164254 01130164254 NULL
01130165543 01130165543 NULL
01130168864 01130168864 NULL
01635163349 01635163349 NULL
16.11 NULL 01635163349

This is wrong because there is a common key here -- 01635163349.

So the output I am looking for is :

Code:
01635163349  11/24/2009  16.11

I am looking for a way to inner join these files. tgpr.tab is a full dump file while dlya0809 is a daily file.

I hope the information I have given is helpful this time Smilie
# 11  
Old 08-15-2016
Files being joined by the join utility must be ordered in the collating sequence of sort −b on the fields on which they are being joined.

The 2nd field in tgpr.tab is NOT in sorted order. And, with the sample data you showed us in post #10, every line in dlya0908.tab sorts before the 1st line in tgpr.tab.
# 12  
Old 08-15-2016
Quote:
Originally Posted by Varshha
.
.
.
I hope the information I have given is helpful this time Smilie
Sorry, no. Nothing new. How about the octal dump?

Quote:
join -t" " -11 -22 dlya0908.tab tgpr.tab
,with a <TAB> (\t, 0x09) char following the -t option, will print the desired result unless field 1 in dlya0908.tab can't be joined with field 2 in tgpr.tab due to - obviously non-printing - differences.


Quote:
join -a 1 -a 2 -e "NULL" -o'0,1.1,2.2' dlya0908.tab tgpr.tab
will compare field 1 in dlya0908.tab to field 1 in tgpr.tab and won't find identical entries, of course.
This User Gave Thanks to RudiC For This Post:
# 13  
Old 08-15-2016
If dlya0908.tab is in sorted order by field 1, and tgpr.tab is not changing while your script is running, you might want to try:
Code:
sort -b -k2,2 tgpr.tab|join -1 1 -2 2 -a 1 -a 2 -e "NULL" -o'0,1.1,2.2' dlya0908.tab -

which, with the data shown in post #10, produces the output:
Code:
00033492482 00033492482 NULL
00033492682 00033492682 NULL
00033492702 00033492702 NULL
00076848302 00076848302 NULL
00881123792 00881123792 NULL
01130162424 01130162424 NULL
01130164254 01130164254 NULL
01130165543 01130165543 NULL
01130168864 01130168864 NULL
01635163349 01635163349 01635163349
03139822826 NULL 03139822826
49246820001 NULL 49246820001
67119603398 NULL 67119603398
67261704102 NULL 67261704102
73379268872 NULL 73379268872
7621830148 NULL 7621830148
822004599003 NULL 822004599003

Or, with just:
Code:
sort -b -k2,2 tgpr.tab|join -1 1 -2 2 -e "NULL" -o'0,1.1,2.2' dlya0908.tab -

and those same input files, you get the output:
Code:
01635163349 01635163349 01635163349

PS: Note, however, that this only works if your input files actually have tab separated fields. The sample files you have provided in this thread use sequences of spaces as field separators (not tabs).

Last edited by Don Cragun; 08-15-2016 at 08:03 AM.. Reason: Add PS.
This User Gave Thanks to Don Cragun For This Post:
# 14  
Old 08-15-2016
Right! So it was the second column that was creating trouble. It is working fine now. Thank You for your time Don and RudiC!
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Weird problem with join command

I have a weird issue going on with the join command... I have two files I am trying to join...here is a line from each file with the important parts marked in red: file1: /groupspace/ccops/cmis/bauwkrcn/commsamp_20140315.txt,1 file2:... (3 Replies)
Discussion started by: dbiggied
3 Replies

2. UNIX for Dummies Questions & Answers

How to use the the join command to join multiple files by a common column

Hi, I have 20 tab delimited text files that have a common column (column 1). The files are named GSM1.txt through GSM20.txt. Each file has 3 columns (2 other columns in addition to the first common column). I want to write a script to join the files by the first common column so that in the... (5 Replies)
Discussion started by: evelibertine
5 Replies

3. UNIX for Dummies Questions & Answers

how to join two files using "Join" command with one common field in this problem?

file1: Toronto:12439755:1076359:July 1, 1867:6 Quebec City:7560592:1542056:July 1, 1867:5 Halifax:938134:55284:July 1, 1867:4 Fredericton:751400:72908:July 1, 1867:3 Winnipeg:1170300:647797:July 15, 1870:7 Victoria:4168123:944735:July 20, 1871:10 Charlottetown:137900:5660:July 1, 1873:2... (2 Replies)
Discussion started by: mindfreak
2 Replies

4. UNIX for Dummies Questions & Answers

Problem when using join command

Dear all, I have two files (each only contains 1 column) as attached. I want to combined the two files and only show the common records in both files. But when I use join command only the last row was combined. Anyone know what is the problem? I don't know how to write the correct code to only... (2 Replies)
Discussion started by: forevertl
2 Replies

5. UNIX for Dummies Questions & Answers

problem with join

So I want to join two files that have a lot of rows The file named gen1 has 2 columns: head gen1 1008567 0.4026931012 1119535 0.7088912314 1120590 0.7093805634 1145994 0.7287952590 1148140 0.7313924434 1155173 0.7359550430 1188481 0.7598914553 1201155 0.7663406553 1206921... (2 Replies)
Discussion started by: peanuts48
2 Replies

6. UNIX for Dummies Questions & Answers

SOLVED: Join problem

Hello, Going through book, "Guide to UNIX Using Linux". I am doing one of the projects that has me writing scripts to join files. Here is my pnumname script and I am extracting the programmers names and numbers from the program file and redirecting the output to the file pnn. I then created a... (0 Replies)
Discussion started by: thebeav
0 Replies

7. Shell Programming and Scripting

awk command for simple join command but based on 2 columns

input1 a_a a/a 10 100 a1 a_a 20 200 b1 b_b 30 300 input2 a_a a/a xxx yyy a1 a1 lll ppp b1 b_b kkk ooo output a_a a/a 10 100 xxx yyy (2 Replies)
Discussion started by: ruby_sgp
2 Replies

8. Shell Programming and Scripting

Problem with Join command

Hi guyz Excuse me for posting simple question I tried join and sort and other perl commands but failed I have 2 files. 1st file contain single column with around 6000 values (rows). Second file contain 2 columns 1st column is the same column (in 1st file) but randomly ordered and second... (5 Replies)
Discussion started by: repinementer
5 Replies

9. Shell Programming and Scripting

join (pls help on join command)

Hi, I am a new learner of join command. Some result really make me confused. Please kindly help me. input: file1: LEO oracle engineer 210375 P.Jones Office Runner ID897 L.Clip Personl Chief ID982 S.Round UNIX admin ID6 file2: Dept2C ID897 6 years Dept5Z ID982 1 year Dept3S ID6 2... (1 Reply)
Discussion started by: summer_cherry
1 Replies

10. Shell Programming and Scripting

A join problem?

Hi everybody, I am hoping somebody here will be either be able to solve my troubles or at least give me a push in the right direction :) . I am developing a shell script to read in 4 different files worth of data that each contain a list of: username firstname secondname group score I... (2 Replies)
Discussion started by: jamjamjammie
2 Replies
Login or Register to Ask a Question