Help required on Length based lookup


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help required on Length based lookup
# 1  
Old 12-18-2014
Help required on Length based lookup

Hi,

I have two files one (abc.txt) is having approx 28k records and another (bcd.txt) on is having 112k records, the length of each files are varried.

I am trying to look up abc.txt file with bcd.txt based on length, where ever abc.txt records are matching with bcd.txt I am successful match the records with bcd, but I am unable to fetch the records which are not matching with bcd.txt.

abc.txt
Code:
191
120
122
123
1234
1245
123456
1890

bcd.txt

Code:
120
1201
1203
121
122
1224
12345
123
199

I want the mismatch in each file are as below:

Code:
abc.txt matches with bcd.txt
120
1201
1203
122
1224
123
12345
abc.txt not matches with bcd.txt
191
1890
1245
 
bcd.txt not matches with abc.txt
199

below is my script which I tried for matching of the records, but its is taking almost 5 hours, and next I am unable to find the mismatch records for both the files.

Code:
awk -F"," 'BEGIN{OFS=","}
{
if(NR==FNR){
a[FNR]=$0;max=FNR;Next}
if(NR!=FNR)
 {
if (FNR==1) print $0;
 for ( i=1;i<=max;i++)
 {  
 tmp = a[i];
 len = length(tmp);
 if(substr($1,1,len) ==tmp)
 {print $0;}
 } #End For
 } #End if
}' abc.txt bcd.txt > abc_matches_bcd.txt;

Please help me on this, this will save a lot of manual work at my end.

Regards,
Ram
# 2  
Old 12-18-2014
Hello rramkrishnas,

Following may help you in same.
Code:
awk 'BEGIN{print "abc.txt matches with bcd.txt"} FNR==NR{X[$1]=$1;next} {Y[$1]=$1} {for(i in X){for(i in Y){if(X[i]){print X[i];delete X[i];delete Y[i]}}}} END{print "abc.txt not matches with bcd.txt"; for(u in X){if(X[u]){print X[u]}};print "bcd.txt not matches with abc.txt"; for(v in Y){print Y[v]}}' abc.txt bcd.txt

Output will be as follows.
Code:
abc.txt matches with bcd.txt
120
122
123
abc.txt not matches with bcd.txt
1890
1234
123456
1245
191
bcd.txt not matches with abc.txt
1201
1203
12345
121
1224
199

EDIT: Adding a non one liner form for same.
Code:
awk 'BEGIN{print "abc.txt matches with bcd.txt"}
     FNR==NR{X[$1]=$1;next}
     {Y[$1]=$1}
     {for(i in X)
        {for(i in Y)
                {if(X[i])
                        {print X[i];
                         delete X[i];
                         delete Y[i]
                        }
                }
        }
     }
     END{
        print "abc.txt not matches with bcd.txt";
        for(u in X){
                        if(X[u])
                                {print X[u]}
                   };
        print "bcd.txt not matches with abc.txt";
        for(v in Y){
                        print Y[v]
                   }
        }' abc.txt bcd.txt

Thanks,
R. Singh

Last edited by RavinderSingh13; 12-18-2014 at 07:55 AM.. Reason: Added non one liner form for solution
# 3  
Old 12-18-2014
With some wild guessing I presume that you want to match entries based on the smallest common substring. But some questions remain:
Will abc always have the smallest substring or could that be in bcd as well?
Will the smallest substring always precede the longer ones?
Where is the 121 entry from bcd in the outputs? Where 123456 from abc?
# 4  
Old 12-18-2014
Code:
awk 'NR==FNR{a[$1]++;next}
{if(a[$1]){m[$1]++}else{notM[$1]++}delete a[$1]} 
END {print "matching"; for (i in m) {print i}
print "abc not match with bcd"; for ( j in a) {print j;}
print "bcd not match with abc"; for (k in notM) { print k}}' abc.txt bcd.txt

# 5  
Old 12-19-2014
Dear RudiC,

With some wild guessing I presume that you want to match entries based on the smallest common substring. But some questions remain:

Below are my comments against your Query.
Will abc always have the smallest substring or could that be in bcd as well?

abc.txt Records will have smalest substring as well as the same string will apear in bcd.txt.

Will the smallest substring always precede the longer ones?
Yes, and will be present at bcd.txt

Where is the 121 entry from bcd in the outputs? Where 123456 from abc?
121 entry is an extra entry in bcd.txt, and 123456 is presnt at bcd.txt however in abc.txt 123 record is present hence 123456 should be a match case.

---------- Post updated 12-19-14 at 11:37 AM ---------- Previous update was 12-18-14 at 05:40 PM ----------

Can any one please help me on this
# 6  
Old 12-19-2014
Your requirements still are far from clear. Why do 1201, 1203 and 1224 from bcd show up in the "match" result? Where is 1234 from abc?
# 7  
Old 12-19-2014
Dear Rudic,

if you will see my abc.txt where I am having a record which is haing value of 120, where as in bcd, my value is 1201 & 1203, since 120 of abc.txt, is matching with first 3 digit of bcd.txt which is having 1203 & 1201, like wise for 1224 too.

Hope I am clear in my requirement now.

Regards,
Ram
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Outputting sequences based on length with sed

I have this file: >ID1 AA >ID2 TTTTTT >ID-3 AAAAAAAAA >ID4 TTTTTTGGAGATCAGTAGCAGATGACAG-GGGGG-TGCACCCC Add I am trying to use this script to output sequences longer than 15 characters: sed -r '/^>/N;{/^.{,15}$/d}' The desire output would be this: >ID4... (8 Replies)
Discussion started by: Xterra
8 Replies

2. Shell Programming and Scripting

Append 0's based on length

I'm having data like this, "8955719","186497034","0001","M","3" "8955719","186497034","0002","M","10" "8955719","186497034","0003","M","10" "8955719","186497034","0004","M","3" "8955723","186499034","0001","M","3" "8955723","186499034","0002","M","10" "8955723","186499034","0003","M","10"... (3 Replies)
Discussion started by: Artlk
3 Replies

3. Shell Programming and Scripting

Filtering duplicates based on lookup table and rules

please help solving the following. I have access to redhat linux cluster having 32gigs of ram. I have duplicate ids for variable names, in the file 1,2 are duplicates;3,4 and 5 are duplicates;6 and 7 are duplicates. My objective is to use only the first occurrence of these duplicates. Lookup... (4 Replies)
Discussion started by: ritakadm
4 Replies

4. Shell Programming and Scripting

Append spaces the rows to make it into a required fixed length file

I want to make a script to read row by row and find its length. If the length is less than my required length then i hav to append spaces to that paritucular row. Each row contains special characters, spaces, etc. For example my file contains , 12345 abcdef 234 abcde 89012 abcdefgh ... (10 Replies)
Discussion started by: Amrutha24
10 Replies

5. UNIX for Dummies Questions & Answers

Length of a segment based on coordinates

Hi, I would like to have the length of a segment based on coordinates of its parts. Example input file: chr11 genes_good3.gtf aggregate_gene 1 100 gene1 chr11 genes_good3.gtf exonic_part 1 60 chr11 genes_good3.gtf exonic_part 70 100 chr11 genes_good3.gtf aggregate_gene 200 1000 gene2... (2 Replies)
Discussion started by: fadista
2 Replies

6. UNIX for Dummies Questions & Answers

Sorting words based on length

i need to write a bash script that recive a list of varuables kaka pele ronaldo beckham zidane messi rivaldo gerrard platini i need the program to print the longest word of the list. word in the output appears on a separate line and word order in the output is in the order Llachsicografi costs.... (1 Reply)
Discussion started by: yairpg
1 Replies

7. Shell Programming and Scripting

Split strings based on length

Hi All I am very much in need of help splitting strings based on length in Perl. e.g., Input text is : International NOUN Corp. NOUN 's POS Tulsa NOUN Output I want is : International I In Int Inte l al nal onal NOUN Corp. C Co Cor Corp . p. rp. orp. NOUN... (2 Replies)
Discussion started by: my_Perl
2 Replies

8. Shell Programming and Scripting

SED based on file lookup

Newb here trying to figure this one out. :confused: I am trying to create a SED (or some other idea) line that will replace the data field if the original text is seen in a separate text file. The lookup file would be line delimted. For example: sed 's/<if in file>/YES/' File structure:... (3 Replies)
Discussion started by: sdlennon
3 Replies

9. UNIX for Advanced & Expert Users

Clueless about how to lookup and reverse lookup IP addresses under a file!!.pls help

Write a quick shell snippet to find all of the IPV4 IP addresses in any and all of the files under /var/lib/output/*, ignoring whatever else may be in those files. Perform a reverse lookup on each, and format the output neatly, like "IP=192.168.0.1, ... (0 Replies)
Discussion started by: choco4202002
0 Replies

10. UNIX for Dummies Questions & Answers

Need find a file based length

Can some please help me? Want to find files over 35 characters in length? I am running HPUX. Would it be possible with find? Thanks in advance (8 Replies)
Discussion started by: J_ang
8 Replies
Login or Register to Ask a Question