Reconciling two CSV files using shell scripting


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Reconciling two CSV files using shell scripting
# 1  
Old 11-12-2019
Reconciling two CSV files using shell scripting

I have two CSV files file1, file2 as below

Code:
File 1:
Key, Value1, Value2, Value3, Value4,......value
A,50,100,50,40,....,100

File 2:
Key, value1,Value2,Value3,Value 4 so on...
A,50,80,45,50.....

Now, I want to check if key from file 1 is present in file 2 or not if present I want to crate new file with following headers and data

Code:
Key, diff columns, f1_value1,f2_value1,value1_diff, f1_value2,f2_value2, value2_diff......
A,"Value2,Value3,Value4",50,50,0,100,80,20,50,45,5 and so on.

I have file with more 50k lines and around 60 columns...

Can someone help on suggesting how we can achieve this... I am new to shell.
# 2  
Old 11-12-2019
You can easily do this with PHP, for example, you can read the CSV files into arrays and and compare the arrays:


Quick examples to get you started (untested, for example only):

Code:
<?php
$csvFile1 = file('../somefile.csv');
$csvarray1 = [];
 foreach ($csvFile1 as $line) {
     $csvarray1[] = str_getcsv($line);
 }

$csvFile2 = file('../someotherfile.csv');
$csvarray2 = [];
 foreach ($csvFile2 as $line) {
     $csvarray2[] = str_getcsv($line);
 }

Then you can use one or two of myriad PHP array methods to check for existence of keys, differences in arrays, etc. See, for example, these methods:

Code:
<?php
array_diff();
array_keys();
array_diff_ukey();
array_key_exists()

Then, after you have your new array as you desire, then you can simply convert your temporary array back to PHP, for example:

Code:
<?php
fputcsv ();

In a nutshell, is easy to process CSV files with PHP either from a script, directly from the command line, or interactively from the command line; most notably converting CSV files to arrays, doing array operations, and converting back to a CSV file.

So personally I would do this in PHP and not use shell scripts because PHP is built to do this kind of processing easily.

OBTW, these days I tent to quickly prototype and test my PHP ideas interactively in the shell as follows:


Code:
php -a

Then in the shell in interactive mode, I test and debug logic quickly and easily.

This is how I process CSV files. You can also easily to this same type of CSV processing easily in Python, BTW.

Other may have more "shell script-like" approaches for you which do not use PHP or Python; I am only describing how I approach these types of issues in CSV, JSON or other standard file formats. Since most of my work all touches the Internet somehow (web servers), and those servers are mostly PHP based, I like to stick to code I can reuse and debug together, so that is why I tend to use PHP over Python. Actually, if my apps were not mostly PHP based, I would use Python more.
# 3  
Old 11-13-2019
How about
Code:
awk -F, '
FNR == 1        {FCNT++
                }
FNR > 1         {KEYS[$1]
                 for (i=2; i<=NF; i++) W[$1,FCNT,i] = $i
                }
END             {printf "Key, diff columns"
                 for (i=1; i<NF; i++) printf ",f1_value%d,f2_value%d,value%d_diff", i, i, i
                 printf RS
                 for (k in KEYS)        {for (i=NF; i>1; i--)   {N1 = W[k,1,i]
                                                                 N2 = W[k,2,i]
                                                                 D1 = N2 - N1
                                                                 OUT = sprintf (",%s,%s,%s", N1, N2, D1) OUT
                                                                 if (D1) COLS = ",Value" i-1 COLS
                                                                }
                                         print k, "\"" substr (COLS,2) "\"", substr (OUT, 2)
                                         OUT = COLS = ""
                                        }
                }
' OFS=, SUBSEP=, file[12]
Key, diff columns,f1_value1,f2_value1,value1_diff,f1_value2,f2_value2,value2_diff,f1_value3,f2_value3,value3_diff,f1_value4,f2_value4,value4_diff
A,"Value2,Value3,Value4",50,50,0,100,80,-20,50,45,-5,40,50.....,10


Last edited by RudiC; 11-13-2019 at 05:20 AM..
This User Gave Thanks to RudiC For This Post:
# 4  
Old 11-14-2019
Thanks Rudic,

Could you please explain the code. I could not understand it completely.
# 5  
Old 11-14-2019
Code:
awk -F, '
FNR == 1        {FCNT++                                                                                         # inc file counter with every new file
                }
FNR > 1         {KEYS[$1]                                                                                       # keep $1 in an array; overwrite duplicates
                 for (i=2; i<=NF; i++) W[$1,FCNT,i] = $i                                                        # keep fields in array indexed by key, file No., field No.
                }
END             {printf "Key, diff columns"                                                                     # start printing header
                 for (i=1; i<NF; i++) printf ",f1_value%d,f2_value%d,value%d_diff", i, i, i                     # complete header line for all fields
                 printf RS
                 for (k in KEYS)        {for (i=NF; i>1; i--)   {N1 = W[k,1,i]                                  # for all keys, for all fields, get values
                                                                 N2 = W[k,2,i]                                  # for both files,
                                                                 D1 = N2 - N1                                   # and calc difference
                                                                 OUT = sprintf (",%s,%s,%s", N1, N2, D1) OUT    # collect all those in temp var OUT
                                                                 if (D1) COLS = ",Value" i-1 COLS               # if diff exist, collect fields in temp var COLS
                                                                }
                                         print k, "\"" substr (COLS,2) "\"", substr (OUT, 2)                    # print all those, cutting off leading comma
                                         OUT = COLS = ""                                                        # reset temp vars
                                        }
                }
' file[12]                                                                                                      # OFS and SUBSEP relict from development, not needed

# 6  
Old 11-14-2019
Try:

Code:
awk '
  NR==FNR {                                         # When reading the first file (then NR is equal to FNR)
    A[$1]=$0                                        # Store the first file in array A with key $1
    next
  } 

  FNR==1 {                                          # On the first line of the second file
    split($0,Header)                                # Split the header labels in array "Header"
    $1=$1 OFS "diff columns"                        # Create the first 2 field headers
    for(i=2; i<=NF; i++)
      $i=sprintf("f1_%s,f2_%s,%s_diff",$i, $i, $i)  # Create the rest of the field headers
    print                                           # Print the field headers
  } 

  FNR>1 {                                           # Processing the content of file 2
    diffs=""                                        # Set the differences to ""
    if($1 in A) {                                   # if the key in $1 of file2 also occurs in file1
      split(A[$1], F)                               # Split the corresponding line of file 1 into Fields in array F
      for(i=2; i<=NF; i++) {                        # For field 2 until the last field
        if($i!=F[i])                                # if there is a value difference for that field
          diffs=diffs (diffs?OFS:"") Header[i]      # Add the corresponding header label to the differences
        $i=F[i] OFS $i OFS (F[i]-$i)                # Prepend the value of file1 and append the subtraction of file1 val - file val
      } 
      $1=$1 OFS "\"" diffs "\""                     # When all differences found, append them to field 1
      print                                         # print the result
    }
  }
' FS=', *' OFS=, file1 file2                        # set FS to a comma with spaces, set OFS to a comma and read file 1 and file2

Code:
Key,diff columns,f1_value1,f2_value1,value1_diff,f1_Value2,f2_Value2,Value2_diff,f1_Value3,f2_Value3,Value3_diff,f1_Value4,f2_Value4,Value4_diff,...
A,"Value2,Value3,Value4",50,50,0,100,80,20,50,45,5,40,50,-10,...


Last edited by Scrutinizer; 11-14-2019 at 06:38 PM..
These 2 Users Gave Thanks to Scrutinizer For This Post:
# 7  
Old 11-15-2019
I tried implementing this but getting an error that cannot read input file.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Export Oracle multiple tables to multiple csv files using UNIX shell scripting

Hello All, just wanted to export multiple tables from oracle sql using unix shell script to csv file and the below code is exporting only the first table. Can you please suggest why? or any better idea? export FILE="/abc/autom/file/geo_JOB.csv" Export= `sqlplus -s dev01/password@dEV3... (16 Replies)
Discussion started by: Hope
16 Replies

2. Shell Programming and Scripting

Read csv file, convert the data and make one text file in UNIX shell scripting

I have input data looks like this which is a part of a csv file 7,1265,76548,"0102:04" 8,1266,76545,"0112:04" I need to make the output data should look like this and the output data will be part of text file: 7|1265000 |7654899 |A| 8|12660000 |76545999 |B| The logic behind the... (6 Replies)
Discussion started by: RJG
6 Replies

3. Shell Programming and Scripting

Need a piece of shell scripting to remove column from a csv file

Hi, I need to remove first column from a csv file and i can do this by using below command. cut -f1 -d, --complement Mytest.csv I need to implement this in shell scripting, Whenever i am using the above command alone in command line it is working fine. I have 5 files in my directory and... (3 Replies)
Discussion started by: Samah
3 Replies

4. Shell Programming and Scripting

Shell script for field wise record count for different Files .csv files

Hi, Very good wishes to all! Please help to provide the shell script for generating the record counts in filed wise from the .csv file My question: Source file: Field1 Field2 Field3 abc 12f sLm 1234 hjd 12d Hyd 34 Chn My target file should generate the .csv file with the... (14 Replies)
Discussion started by: Kirands
14 Replies

5. Shell Programming and Scripting

How to calculate average of csv using shell scripting?

Hi, I need to calculate the average of the following values using shell scripitng. Can anyone please suggest a solution? ... (10 Replies)
Discussion started by: karan pratap si
10 Replies

6. Shell Programming and Scripting

How to calculate avg values of csv file using shell scripting .?

hi all i have a reporting work and i want it to be automated using shell scripting kindly let me know how can i make that possibe . eg data are :... (2 Replies)
Discussion started by: Avinash shaw
2 Replies

7. Shell Programming and Scripting

How to create or convert to pdf files from csv files using shell script?

Hi, Can anyone help me how to convert a .csv file to a .pdf file using shell script Thanks (2 Replies)
Discussion started by: ssk250
2 Replies

8. Shell Programming and Scripting

Shell scripting:from text file to CSV

Hello friends, I have a file as follows: "empty line" content1 content2 content3 content1 content2 content3 content1 content2 content3 It starts with an empty line, how can i get a csv like this: (12 Replies)
Discussion started by: kraterions
12 Replies

9. Shell Programming and Scripting

How to insert a sequence number column inside a pipe delimited csv file using shell scripting?

Hi All, I need a shell script which could insert a sequence number column inside a dat file(pipe delimited). I have the dat file similar to the one as shown below.. |A|B|C||D|E |F|G|H||I|J |K|L|M||N|O |P|Q|R||S|T As shown above, the column 4 is currently blank and i need to insert sequence... (5 Replies)
Discussion started by: nithins007
5 Replies

10. Shell Programming and Scripting

Help with shell scripting for accepting .csv files as CLA

I want to automate test script on shell scripting. There are 2 .csv files named account.csv and balance.csv.These files needs to passed as command line arguments and the following logic needs to applied further. Any account with a balance that was due before Oct 23, 2007 has an overdue... (2 Replies)
Discussion started by: coolguy123
2 Replies
Login or Register to Ask a Question