awk assistance - Comparing 2 csv files

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers awk assistance - Comparing 2 csv files
# 1  
Old 07-04-2018
awk assistance - Comparing 2 csv files

Hello all,

I have searched high and low for a solution to this, many have come really close but not quite what I'm after.

I have 2 files. One contains GUID's, for example:

Code:
8121E002-96FE-4C9C-BC5A-6AFF20DACECD
84468F30-F3B7-418B-81F0-0908E80792BF

A second file, contains a path to the said guid, for example:

Code:
"test-data","TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
"test-data","TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

I need a 3rd csv file created, like this:

Code:
84468F30-F3B7-418B-81F0-0908E80792BF, "TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
8121E002-96FE-4C9C-BC5A-6AFF20DACECD, "TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

The closest i have come is the following:

Code:
awk -F "[,]" 'NR==FNR{q=$1;$1="";A[q]=$0;;next} ($2 in A) {print FILENAME, $2}' test.csv *.csv

(Note, there are several csv files to search for, hence *.csv)

I believe the issue is somehting to do with the $2 in A as its not "in" its more a like operator.

Any assistance would be a massive help, thank you!!

- Tirm


Moderator's Comments:
Mod Comment Please use CODE tags (for data as well) as required by forum rules!

Last edited by RudiC; 07-04-2018 at 02:52 PM.. Reason: Added CODE tags.
# 2  
Old 07-04-2018
Hi,
Maybe like as:
Code:
$ cat /tmp/a.csv
8121E002-96FE-4C9C-BC5A-6AFF20DACECD
84468F30-F3B7-418B-81F0-0908E80792BF
$ cat /tmp/b.csv 
"test-data","TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
"test-data","TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"
$ awk -F "," 'BEGIN{OFS=FS}NR==FNR{q=$1;$1="";A[q]=$0;;next} {q=$2;gsub(/.*\/|\..*/,"",$2)} ($2 in A) {print $2,q}' /tmp/a.csv /tmp/b.csv 
84468F30-F3B7-418B-81F0-0908E80792BF,"TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
8121E002-96FE-4C9C-BC5A-6AFF20DACECD,"TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

Regards.
# 3  
Old 07-04-2018
Please become accustomed to deploy utmost care when specifying your request, esp. constraints like "Exactly (no more and no less than) the GUIDs in file1", and give people a chance to understand what be test.csv, and what *.csv.
Why, for example, not just simply
Code:
awk -F, '{TMP = $2; gsub (/^.*\/|\..*$/, _, TMP); print TMP, $2}' OFS=, file2
84468F30-F3B7-418B-81F0-0908E80792BF,"TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
8121E002-96FE-4C9C-BC5A-6AFF20DACECD,"TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

I guess because you want to discriminate file2 with the values in file1. How about
Code:
awk -F "[,]" 'NR==FNR{A[$1]; next} { for (a in A) {if ($2 ~ a) print a, $2}}' OFS=, file[12]
84468F30-F3B7-418B-81F0-0908E80792BF,"TEST/DATA/84468F30-F3B7-418B-81F0-0908E80792BF.pdf"
8121E002-96FE-4C9C-BC5A-6AFF20DACECD,"TEST/DATA/8121E002-96FE-4C9C-BC5A-6AFF20DACECD.pdf"

# 4  
Old 07-04-2018
Thank you for the replies. Let me put some more clarify around the request, as suggested, as I'm not sure the awk suggestions are working.

First here is the csv, that contains the list of GUIDS:

Code:
$ cat test.csv
8121E002-96FE-4C9C-BC5A-6AFF20DACECD
84468F30-F3B7-418B-81F0-0908E80792BF
8121E002-96FE-4C9C-BC5A-6AFF20DACECD
1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944

Now, here is an example of the csv that contains the data I'm trying to retrieve:

Code:
"test","data/content/FN0/FN0/FN0/170535BB-A28D-42C4-92ED-767BB1469C8D%7BC0348F5A-0000-C624-9710-5ED2E8AA2B14%7D0"
"test","data/content/FN0/FN0/FN0/17ECDCFA-AF30-4C82-A156-99F941739352%7B1373E3F5-D5BE-475B-900F-B73ECB05C6AB%7D0"
"test","data/content/FN0/FN0/FN0/182FECB0-ADBF-4F27-9DD6-CE5508872AA5%7BC88C4971-F16A-4C28-9E14-07D0AD4E3C79%7D0"
"test","data/content/FN0/FN0/FN0/194D4F16-CD21-46EF-A584-8C378FBAD55F%7B439BC63F-C291-479D-BFEC-121BC86E3988%7D0"
"test","data/content/FN0/FN0/FN0/1AD46B75-8357-421D-A072-64872C6C763C%7BF299763B-D507-4303-A819-00BD0C60AAA5%7D0"
"test","data/content/FN0/FN0/FN0/1B810336-4EA7-49E3-8325-69487AC0CE95%7BD2F0EDEA-1486-451C-A09C-9AC39D582BBF%7D0"
"test","data/content/FN0/FN0/FN0/1BA93974-FFCC-4BE2-AFBE-11B92D579D4B%7B805AFF56-0000-CC14-9F56-C335802C5C15%7D0"
"test","data/content/FN0/FN0/FN0/1BBC8C9A-725B-428D-AC5A-9C0129C80F82%7BE515FB66-D51E-4057-97C8-94CDEC52F83A%7D0"
"test","data/content/FN0/FN0/FN0/1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944%7B496DDE3B-6102-4744-80FA-C30D64D91815%7D0"


As you can see, 1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944 is in both the first csv and the 2nd csv. The last line of the 2nd csv it is embedded into a double GUID. I want the output to be:

Code:
1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944, "data/content/FN0/FN0/FN0/1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944%7B496DDE3B-6102-4744-80FA-C30D64D91815%7D0"

I tried the following suggestion:

Code:
awk -F "," 'BEGIN{OFS=FS}NR==FNR{q=$1;$1="";A[q]=$0;;next} {q=$2;gsub(/.*\/|\..*/,"",$2)} ($2 in A) {print $2,q}' test.csv file2.csv

But it did not find the above GUID.

RudiC, for your suggestion, where do I pass in the first filename?

Hopefully that helps!

Regards

Tirm

Last edited by Don Cragun; 07-04-2018 at 10:15 PM.. Reason: Add missing CODE tags, again.
# 5  
Old 07-05-2018
Quote:
Originally Posted by tirmUK
I want the output to be:

Code:
1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944, "data/content/FN0/FN0/FN0/1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944%7B496DDE3B-6102-4744-80FA-C30D64D91815%7D0"

Which is, except for the space, wherever that comes from, exactly what my second suggestion would yield - given the input files are served in the correct order. The shell will expand file[12] to file1 file2


Quote:
RudiC, for your suggestion, where do I pass in the first filename?
Now, there are two possibilities to arrange two file names: file1 file2 or file2 file1. Which did you test - only one will yield an (THE) output?
# 6  
Old 07-05-2018
Hi RudiC,

That works great thank you. I have ran a few test's on smaller files, and its fine,

I have 2 further queries that extend this. First, the lookup file contains 1.87 million rows, and the CSV file to check it against, contains 400,000 rows (there are around 40 csv files to chec).

Running it takes a very long time, so I'm wondering if that might change the command at all, or if it just needs to run overnight.

The second query, is can I use the awk command above, but instead of searching in a csv, can i search a directory? If we take the example above:

Code:
data/content/FN0/FN0/FN0/1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944%7B496DDE3B-6102-4744-80FA-C30D64D91815%7D0

Can I instead add a find command? So read the first line of the csv and get:

Code:
1BCE1E40-D1BE-4DC1-8A0C-9EB236F56944

Then search

Code:
data/content/*

Recursively and find the filename that contains it?

Thank you so much for your help so far, very kind!

Regards

Tirm
# 7  
Old 07-05-2018
2 million lines is quite something, and you have to read all into memory. Checking against 40 * 0.4 E6 (16 million) will take its time. Not sure if the system will be already into swapping with data amounts like those. Try cutting the 2 million in half or quarter.


Not sure if I understand your second query. Your usage of "csv" is not quite consistent and obvious to me.
awk needs files to operate upon, not directories. If you open, read, and close 40 (or so) files for every single line read from the (lookup / test / csv / GUID) file, you'll thrash your file system. Not clever.

Last edited by RudiC; 07-18-2018 at 05:51 AM.. Reason: Corrected an unclear formulation.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Comparing two CSV files

I have two csv files and im trying to compare them. e.g. SAMPLE DATA: file one: ZipCode Name 20878 Washington 10023 Missouri 20304 Maryland file two: ID Name City ZipCode 11654 ... (11 Replies)
Discussion started by: dan139
11 Replies

2. Shell Programming and Scripting

Comparing two large unsorted csv files

Hi All, My requirement is to write a shell script to compare two large csv files. I've created sample files for explaining my problem i.e., a.csv and b.csv contents of files: ----------------- a.csv ------ Type,Memory (Kb),Location HD,Size (Mb),Serial # XT,640,D402,0,MG0010... (2 Replies)
Discussion started by: vasavi
2 Replies

3. Shell Programming and Scripting

Comparing 2 CSV files and sending the difference to a new csv file

(say) I have 2 csv files - file1.csv & file2.csv as mentioned below: file1.csv ID,version,cost 1000,1,30 2000,2,40 3000,3,50 4000,4,60 file2.csv ID,version,cost 1000,1,30 2000,2,45 3000,4,55 6000,5,70 ... (1 Reply)
Discussion started by: Naresh101
1 Replies

4. Shell Programming and Scripting

Comparing 2 difference csv files

Hello, I have about 10 csv files which range from csv1 - csv10. Each csv file has same type/set of tabs and we have around 5-6 tabs for each of the csv file which have slightly different content(data). A sample of CSV1 is shown below: Joins: Data related to Joins, it can be any number of... (2 Replies)
Discussion started by: bobby1015
2 Replies

5. Shell Programming and Scripting

Comparing two csv file fields using awk script

Hi All, I want to remove the rows from File1.csv by comparing the columns/fields in the File2.csv. I only need the records whose first column is same and the second column is different for the same record in both files.Here is an example on what I need. File1.csv: RAJAK|ACTIVE|1... (2 Replies)
Discussion started by: rajak.net
2 Replies

6. Shell Programming and Scripting

removing duplicate records comparing 2 csv files

Hi All, I want to remove the rows from File1.csv by comparing a column/field in the File2.csv. If both columns matches then I want that row to be deleted from File1 using shell script(awk). Here is an example on what I need. File1.csv: RAJAK,ACTIVE,1 VIJAY,ACTIVE,2 TAHA,ACTIVE,3... (6 Replies)
Discussion started by: rajak.net
6 Replies

7. Shell Programming and Scripting

comparing csv files

Hi! I'm just new to shell scripting n simple tasks looks so tough in initial stage. i need to write a script which will read a property file, property file will be containing count of the csv files, and in a folder(same folder) there will be respective csv files. like Property file data1=100... (3 Replies)
Discussion started by: sukhdip
3 Replies

8. Shell Programming and Scripting

Comparing Strings in 2 .csv/txt files?

EDIT: My problems have been solved thanks to the help of bartus11 and pravin27 This code is just to help me learn. It serves no purpose other than that. Here's a sample csv that I'm working with - #listofpeeps.csv Jackie Chan,1954,M Chuck Norris,1930,M Bruce Lee,1940,M This code is... (13 Replies)
Discussion started by: chickeneaterguy
13 Replies

9. Shell Programming and Scripting

Comparing 2 csv files and matching content

Hello, I have the following problem: There are two csv files csv-file #1: aaa1, aaa2, ... aaan aaa1, bbb2, ... bbbn aaa1, ccc2, ... cccn bbb1, bbb2, ... bbbn ... zzz1, zzz2, ... zzzn csv-file #2: aaa1, matchvalue1 ccc1, matchvalue2 (7 Replies)
Discussion started by: ghl10000
7 Replies

10. Shell Programming and Scripting

Last field problem while comparing two csv files

Hi All, I've two .csv files as below file1.csv abc, tdf, 223, tpx jgsd, tex, 342, rpy a, jdjdsd, 423, djfkld Where as file2.csv is the new version of file1.csv with some added fields in the end of each line and some additional lines. lfj, eru, 98, jkldj, 39, jdkj9 abc, tdf, 223, tpx,... (3 Replies)
Discussion started by: ganapati
3 Replies
Login or Register to Ask a Question