use python or awk to match names 'with error tolerance'


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting use python or awk to match names 'with error tolerance'
# 1  
Old 01-19-2009
use python or awk to match names 'with error tolerance'

I think this is a very challenging problem I am facing and I have no idea how to deal with it
Suppose I have two csv files

A.csv
Toyota Camry,1998,blue
Honda Civic,1999,blue

B.csv
Toyota Inc. Camry, 2000km
Honda Corp Civic,1500km

I want to generate C.csv
Toyota Camry,1998,blue ,2000km
Honda Civic,1999,blue,1500km

The worst part of the task is that there needs to be error tolerance to deal with the variations in the company name
1.extra spaces
2.extra dots
3.phrases such as Inc, corp.

Is this mission impossible?
# 2  
Old 01-19-2009
Code:
#!/usr/bin/perl
open FH,"<a.csv";
while(<FH>){
	chomp;
	my @tmp=split(",",$_);
	$hash{$tmp[0]}=$_;
}
close FH;
open FH,"<b.csv";
while(<FH>){
	chomp;
	my @tmp=split(",",$_,2);
	$tmp[0]=~s/(Inc|Corp)\.* //;
	$hash{$tmp[0]}.=",".$tmp[1];
}
for $key (keys %hash){
	print $hash{$key},"\n";
}

# 3  
Old 01-19-2009
Lemme give it a try:

Code:
cat a.csv | while read x; do
echo -n "$x,";grep `echo ^$x | awk '{print $1}'` b.csv | awk -F, '{print $NF}' | sed 's/^ *//g;s/ *$//g'
done

# 4  
Old 02-26-2009
I don't know perl, would you please do it in Python or SAS
# 5  
Old 02-27-2009
Code:
import re

f1, f2 = ['A.csv', 'B.csv']
a, b = open('A.csv', 'r'), open('B.csv', 'r')
sep = ','
excl = {sep:1, '.':1, 'Inc':1,'Corp':1}

ah, bh = {}, {}
for i in (a):
        l = i.strip().split(sep, 1)
        ah[ l[0] ] = l[1]
a.close()

for i in (b):
        l = i.strip().split(sep, 1)
        n = re.sub("[.,]", "", l[0])
        s = " ".join([i for i in n.split() if(excl.has_key(i) == False)])
        if(ah.has_key(s)):
                print sep.join([s, ah[s], l[1]])
        else:
                print "Could not match", s, "with", f1;
b.close()

Output:
Code:
C:\Projects\Python>type A.csv
Toyota Camry,1998,blue
Honda Civic,1999,blue

C:\Projects\Python>type B.csv
Toyota  Inc. Camry, 2000km
Honda Corp.     Civic,1500km

C:\Projects\Python>match.py
Toyota Camry,1998,blue, 2000km
Honda Civic,1999,blue,1500km

# 6  
Old 02-27-2009
Code:
nawk 'BEGIN{FS=","}
{
if(NR==FNR)
  _[$1]=$0
else
{
  sub(/(Inc.?|Corp.?) /,"",$1)
  _[$1]=sprintf("%s,%s",_[$1],$2)
}
}
END{
  for(i in _)
  print _[i]
}' a b

# 7  
Old 03-01-2009
Thanks a lot for the reply, but is it possible to create manual translation tables:

Suppose the file is now
A.csv
Toyota Camry,1998,blue
Honda Civic,1999,blue
Acura Inf,2000,yellow

B.csv
Toyota Inc. Camry, 2000km
Honda Corp Civic,1500km
HondaUSA Inf, 2000, 2300km

I want to generate C.csv
Toyota Camry,1998,blue ,2000km
Honda Civic,1999,blue,1500km
HondaUSA Inf,2000,yellow,2300km

How to generate a list of translation table which would say: Acura translates to HondaUSA
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print match or non-match and select fields/patterns for non-matches

In the awk below I am trying to output those lines that Match between file1 and file2, those Missing in file1, and those missing in file2. Using each $1,$2,$4,$5 value as a key to match on, that is if those 4 fields are found in both files the match, but if those 4 fields are not found then missing... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output. $1 in database.txt will be in list.txt as a partial match. $2 of database.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

awk to match field between two files and use conditions on match

I am trying to look for $2 of file1 (skipping the header) in $2 of file2 (skipping the header) and if they match and the value in $10 is > 30 and $11 is > 49, then print the line from file1 to a output file. If no match is foung the line is not printed. Both the input and output are tab-delimited.... (3 Replies)
Discussion started by: cmccabe
3 Replies

4. Shell Programming and Scripting

Python fails to detect String Match Found

Below is my code for comparing string for Exact Match found in python. for word in jdbc_trgt.split(','): global comp comp=word.strip(); print "GloBAL:" + comp fiIn = open('list.txt').readlines() for lines in fiIn: print "line1s:" +... (6 Replies)
Discussion started by: mohtashims
6 Replies

5. Programming

Python re.findall inverse Match

I ask of you but yet another simplistic question that I hope can be answered. Its better explained showing my code. Here is my list(tmp_pkglist), which contains a list of all Debian (Jessie) packages: snippet 'zssh (1.5c.debian.1-3.2+b1 , 1.5c.debian.1-3.2 )', 'zsync (0.6.2-1)', 'ztex-bmp... (2 Replies)
Discussion started by: metallica1973
2 Replies

6. Shell Programming and Scripting

Finding contiguous numbers in a list but with a gap number tolerance

Dear all, I have a imput file like this imput scaffold_0 10558458 10558459 1.8 scaffold_0 10558464 10558465 1.75 scaffold_0 10558467 10558468 1.8 scaffold_0 10558468 10558469 1.71428571428571 scaffold_0 10558469... (5 Replies)
Discussion started by: valente
5 Replies

7. Shell Programming and Scripting

Printing names using awk?

Mike Harrington:(510) 548-1278:250:100:175 Christian Dobbins:(408) 538-2358:155:90:201 Susan Dalsass:(206) 654-6279:250:60:50 Need to learn how to print each record preceded by the number of the record using awk. awk -F '|' '{print NF}'would it be... (2 Replies)
Discussion started by: JA50
2 Replies

8. UNIX for Dummies Questions & Answers

awk display the match and 2 lines after the match is found.

Hello, can someone help me how to find a word and 2 lines after it and then send the output to another file. For example, here is myfile1.txt. I want to search for "Error" and 2 lines below it and send it to myfile2.txt I tried with grep -A but it's not supported on my system. I tried with awk,... (4 Replies)
Discussion started by: eurouno
4 Replies

9. Shell Programming and Scripting

Assign 1,2,3 according to the names using Awk

Print same letters as 1 or 3 (ex:a/a)and different letters as 2 (ex:a/b) based on Name and subname 1st column indicates main names (ex: ID1 is one main name) and 2nd column indicates sub names (ex: a1 is a subname of ID1) and 3rd column indicates sub-sub names (ex: a/b is a sub-sub name of... (6 Replies)
Discussion started by: ruby_sgp
6 Replies

10. Shell Programming and Scripting

Patern Match Question on file names

I have a script which I use to archive log files and I want to install it on another server. I match any file with a ".log" in the name. Most files end with ".log" or ".log.nnnn". Of course someone has a file on this server that they do not want to archive that has .login.ear in the file name and... (1 Reply)
Discussion started by: prismtx
1 Replies
Login or Register to Ask a Question