CSV joining and checking multiple files

08-22-2016

Registered User

57, 3

Join Date: Jan 2016

Last Activity: 19 September 2019, 10:01 AM EDT

Posts: 57

Thanks Given: 17

Thanked 3 Times in 2 Posts

CSV joining and checking multiple files

Hello,

For our work we use several scripts to gather/combine data for use in our webshop. Untill now we did not had any problems but since a couple days we noticed some mismatches between imports.

It happened that several barcodes where matched even though it was a complete other product. Of course the scripts arent checking on this yet so we need to upgrade the scripts to check for this and give us a list to check or update the listing.

The supplier sends us a CSV file with data as shown below:

supplier_clean.csv

Code:

ean;pps_reference;stock;price;sku;mpn;manufacturer
4260010852693;1043154;84;743.42;P00000172;70100118555;Fujifilm
4960999575285;273189;9400;141.80;P00009067;2768B016;Canon
0013803092899,4960999575292;27433196;44;44.94;P00022338;2768B017;Canon
8715946388540;2944686;1030;47.76;P00000878;C13S042167;Epson
0088698115763,3141725001174;3654125;20;54.80;P00004251;C1825A;Hewlett Packard

This file is being joined to another file with the following code, more on this here:

joining.sh

Code:

#!/bin/sh

awk  '
BEGIN           {FS = OFS = ";"
                 print "ean;sku;pps_reference;mpn;stock;price;manufacturer;supplier_code"
                }
                {gsub (/ /, "", $1)
                }
NR == FNR       {for (n = split($1, T, ","); n > 0; n--) S[T[n]]=$2
                 next
                }
                {for (n = split($1, T, ","); n > 0; n--) if (T[n] in S) {$2 = S[T[n]] OFS $2
                 print
                 next
                 }
                }
' $1 $2 > $3

This script get called as follows.

Code:

join_prijslijst.sh website_clean.csv supplier_clean.csv results.csv

The website_clean has the following data (short example)

website_clean.csv

Code:

Barcode;Sku;Manufacturer
0696720480781,4000567150589;P00002801;Braun Photo Technik
4000461043031;P00002800;D�rr
4000461034213,4000461037818;P00002799;D�rr
0891257001526,8912570015266;P00002634;Gary Fong
0891257001106;P00002633;Gary Fong
0887111646026;P00002632;HP
0887111515629;P00002631;HP

The problem is that the checking if the manufacturer has to happen during the joining together and to make matters worse some suppliers have different names for some suppliers (For example HP, Hewlet packard etc etc).

My idea is to have another file where first the website_clean checks for all the possible names of that manufacturer (see below for example) and this then compares against the supplier_clean csv. If the correct name is in there it continues as normal and if not it writes this line to a seperate file which we then can manual check for the names. In this seperate file i need both lines though so we can check which would be the correct name/product.

manufacturer_check.csv

Code:

manufacturer
HP,Hewlett Packard, HP INC.
Canon
Fujifilm
Epson
WD,Western Digital

I hope this is clear in what needs to happen to make it work. If not let me know and i will try to explain it better.

SDohmen

View Public Profile for SDohmen

Find all posts by SDohmen

08-22-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

If I remember correctly, EAN are unique except for a certain range set aside for any shop's internal coding, so what do you mean by "several barcodes where matched"? Please give us examples of data sets where the identification went wrong. And, your path forward is not too clear to me. Please explain in more detail. Why should the supplier's name help if barcodes are falsely read?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-22-2016

Registered User

57, 3

Join Date: Jan 2016

Last Activity: 19 September 2019, 10:01 AM EDT

Posts: 57

Thanks Given: 17

Thanked 3 Times in 2 Posts

They should be indeed but too bad that different manufacturers can use the same code somehow. For example EAN 7636490074196 which is a Seagate 2TB SSD but also a Lacie external HDD.

The problem is not in the joining itself but due too the problem that it is used multiple times.

I am not sure if the explaination will be sufficient but i will try.

The script we use for joining on barcodes needs to be adapted/changed so that it will check the manufacturer of both files against a third file where all the different names are written. This third file is pure to have manufacturers like HP, WD etc caught without them getting ignored each time.

To describe it in steps:

Each line from the supplier file gets matched against the website file. With this is the complete line
Now the manufacturerfrom the website file gets checked against the manufacturer file so it can check how the manufacturer can be written by different suppliers.
The line that matches from those 2 has to be checked against the manufacturer from the supplier.
If this are the same it continues with the normal loop. If there is a difference it needs to be written to a new file which will be picked up for manual checking.
This manual file needs to have the full line from website and supplier in them so we can check if its a spelling error or a ean error.

I hope this clarifies it a bit.

Last edited by rbatte1; 09-14-2016 at 05:53 AM.. Reason: Converted text numbered list to formatted number-list

SDohmen

View Public Profile for SDohmen

Find all posts by SDohmen

08-22-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

No, it doesn't. Please try again, using input data for demonstration.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-23-2016

Registered User

57, 3

Join Date: Jan 2016

Last Activity: 19 September 2019, 10:01 AM EDT

Posts: 57

Thanks Given: 17

Thanked 3 Times in 2 Posts

Sorry for that but it is a very confusing bit indeed.

website_clean.csv

Code:

Barcode;Sku;Manufacturer
4260010852693,4000567150589;P00002801;Fujifilm
4960999575285;P00002800;Canon
4000461034213,4000461037818;P00002799;D�rr
0891257001526,8912570015266;P00002634;Gary Fong
0891257001106;P00002633;Gary Fong
0887111646026;P00002632;HP
0088698115763;P00002631;HP

supplier_clean.csv

Code:

ean;pps_reference;stock;price;mpn;manufacturer
4260010852693;1043154;84;743.42;70100118555;Fujifilm
4960999575285;273189;9400;141.80;2768B016;Canon
4000461034213,4960999575292;27433196;44;44.94;2768B017;Canon
8715946388540;2944686;1030;47.76;C13S042167;Epson
0088698115763,3141725001174;3654125;20;54.80;C1825A;Hewlett Packard

manufacturer_compare.csv

Code:

manufacturer
HP,Hewlett Packard, HP INC.
Canon
Fujifilm
Epson
WD,Western Digital

Above are 3 files in which the first 2 are the important ones for the joining part.

When i run the script with ./join_prijslijst.sh website_clean.csv supplier_clean.csv output.csv, i get the following output

Code:

ean;sku;pps_reference;mpn;stock;price;manufacturer;supplier_code
4260010852693;P00002801;1043154;84;743.42;70100118555;Fujifilm
4960999575285;P00002800;273189;9400;141.80;2768B016;Canon
4000461034213,4960999575292;P00002799;27433196;44;44.94;2768B017;Canon
0088698115763,3141725001174;P00002631;3654125;20;54.80;C1825A;Hewlett Packard

You would say that this is correct and also fine but the problem is the following line:

Code:

4000461034213,4000461037818;P00002799;D�rr

In the website file the manufacturer is Dorr but in the supplier file it is Canon thus making the wrong join and most likely also add a wrong price etc to this product.

What we want is to have this split up in 2 files in which 1 is as follows:
New_output.csv

Code:

ean;sku;pps_reference;mpn;stock;price;manufacturer;supplier_code
4260010852693;P00002801;1043154;84;743.42;70100118555;Fujifilm
4960999575285;P00002800;273189;9400;141.80;2768B016;Canon
0088698115763,3141725001174;P00002631;3654125;20;54.80;C1825A;Hewlett Packard

and another file with the following contents:
wrong_match.csv

Code:

4000461034213,4000461037818;P00002799;D�rr;4000461034213,4960999575292;27433196;44;44.94;2768B017;Canon

Then we can check that it is the same manufacturer or not. If for example the manufacturer would be HP inc instead of HP in our system we can just add that to the manufacturer_compare.csv file so it will get recognized the next time.

I hope this clears it up a bit.

SDohmen

View Public Profile for SDohmen

Find all posts by SDohmen

08-23-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Let me paraphrase this: when creating the output file, compare the supplier field from website_clean.csv with the supplier from supplier_clean.csv. If identical, fine, print the record. If can be reconciled via manufacturer_compare.csv, fine, print it, BUT: which supplier?
If it can't be reconciled, print to wrong_match.csv for later evaluation.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-23-2016

Registered User

57, 3

Join Date: Jan 2016

Last Activity: 19 September 2019, 10:01 AM EDT

Posts: 57

Thanks Given: 17

Thanked 3 Times in 2 Posts

Instead of the supplier field use the Manufacturer field and you are right on the spot.

The supplier field is something we add after the joining etc has been done. It can be ignored.

SDohmen

View Public Profile for SDohmen

Find all posts by SDohmen

Shell Programming and Scripting

CSV joining and checking multiple files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Export Oracle multiple tables to multiple csv files using UNIX shell scripting

Discussion started by: Hope

2. UNIX for Dummies Questions & Answers

Joining different columns from multiple files

Discussion started by: A-V

3. Shell Programming and Scripting

Other alternative for joining together columns from multiple files

Discussion started by: ida1215

4. Shell Programming and Scripting

checking csv files with empty fields..!

Discussion started by: sukhdip

5. Shell Programming and Scripting

Joining multiple files based on one column with different and similar values (shell or perl)

Discussion started by: seqbiologist

6. Shell Programming and Scripting

Checking the existance of multiple files

Discussion started by: vivek_damodaran

7. UNIX for Dummies Questions & Answers

Joining string on multiple files

Discussion started by: jdr0317

8. Shell Programming and Scripting

joining multiple files into one while putting the filename in the file

Discussion started by: phil_heath

9. UNIX for Dummies Questions & Answers

Joining files based on multiple keys

Discussion started by: Sebben

10. UNIX for Advanced & Expert Users

Joining 2 CSV files together

Discussion started by: chachabronson