Reconciling two CSV files using shell scripting

11-12-2019

Registered User

4, 0

Join Date: Nov 2019

Last Activity: 16 November 2019, 12:07 PM EST

Posts: 4

Thanks Given: 2

Thanked 0 Times in 0 Posts

Reconciling two CSV files using shell scripting

I have two CSV files file1, file2 as below

Code:

File 1:
Key, Value1, Value2, Value3, Value4,......value
A,50,100,50,40,....,100

File 2:
Key, value1,Value2,Value3,Value 4 so on...
A,50,80,45,50.....

Now, I want to check if key from file 1 is present in file 2 or not if present I want to crate new file with following headers and data

Code:

Key, diff columns, f1_value1,f2_value1,value1_diff, f1_value2,f2_value2, value2_diff......
A,"Value2,Value3,Value4",50,50,0,100,80,20,50,45,5 and so on.

I have file with more 50k lines and around 60 columns...

Can someone help on suggesting how we can achieve this... I am new to shell.

hustler

View Public Profile for hustler

Find all posts by hustler

11-12-2019

Administrator

19,118, 3,359

Join Date: Sep 2000

Last Activity: 15 July 2022, 8:51 AM EDT

Location: Asia Pacific, Cyberspace, in the Dark Dystopia

Posts: 19,118

Thanks Given: 2,351

Thanked 3,359 Times in 1,878 Posts

You can easily do this with PHP, for example, you can read the CSV files into arrays and and compare the arrays:

Quick examples to get you started (untested, for example only):

Code:

<?php
$csvFile1 = file('../somefile.csv');
$csvarray1 = [];
 foreach ($csvFile1 as $line) {
     $csvarray1[] = str_getcsv($line);
 }

$csvFile2 = file('../someotherfile.csv');
$csvarray2 = [];
 foreach ($csvFile2 as $line) {
     $csvarray2[] = str_getcsv($line);
 }

Then you can use one or two of myriad PHP array methods to check for existence of keys, differences in arrays, etc. See, for example, these methods:

Code:

<?php
array_diff();
array_keys();
array_diff_ukey();
array_key_exists()

Then, after you have your new array as you desire, then you can simply convert your temporary array back to PHP, for example:

Code:

<?php
fputcsv ();

In a nutshell, is easy to process CSV files with PHP either from a script, directly from the command line, or interactively from the command line; most notably converting CSV files to arrays, doing array operations, and converting back to a CSV file.

So personally I would do this in PHP and not use shell scripts because PHP is built to do this kind of processing easily.

OBTW, these days I tent to quickly prototype and test my PHP ideas interactively in the shell as follows:

Code:

php -a

Then in the shell in interactive mode, I test and debug logic quickly and easily.

This is how I process CSV files. You can also easily to this same type of CSV processing easily in Python, BTW.

Other may have more "shell script-like" approaches for you which do not use PHP or Python; I am only describing how I approach these types of issues in CSV, JSON or other standard file formats. Since most of my work all touches the Internet somehow (web servers), and those servers are mostly PHP based, I like to stick to code I can reuse and debug together, so that is why I tend to use PHP over Python. Actually, if my apps were not mostly PHP based, I would use Python more.

Neo

View Public Profile for Neo

Visit Neo's homepage!

Find all posts by Neo

11-13-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

How about

Code:

awk -F, '
FNR == 1        {FCNT++
                }
FNR > 1         {KEYS[$1]
                 for (i=2; i<=NF; i++) W[$1,FCNT,i] = $i
                }
END             {printf "Key, diff columns"
                 for (i=1; i<NF; i++) printf ",f1_value%d,f2_value%d,value%d_diff", i, i, i
                 printf RS
                 for (k in KEYS)        {for (i=NF; i>1; i--)   {N1 = W[k,1,i]
                                                                 N2 = W[k,2,i]
                                                                 D1 = N2 - N1
                                                                 OUT = sprintf (",%s,%s,%s", N1, N2, D1) OUT
                                                                 if (D1) COLS = ",Value" i-1 COLS
                                                                }
                                         print k, "\"" substr (COLS,2) "\"", substr (OUT, 2)
                                         OUT = COLS = ""
                                        }
                }
' OFS=, SUBSEP=, file[12]
Key, diff columns,f1_value1,f2_value1,value1_diff,f1_value2,f2_value2,value2_diff,f1_value3,f2_value3,value3_diff,f1_value4,f2_value4,value4_diff
A,"Value2,Value3,Value4",50,50,0,100,80,-20,50,45,-5,40,50.....,10

Last edited by RudiC; 11-13-2019 at 05:20 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-14-2019

Registered User

4, 0

Join Date: Nov 2019

Last Activity: 16 November 2019, 12:07 PM EST

Posts: 4

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thanks Rudic,

Could you please explain the code. I could not understand it completely.

hustler

View Public Profile for hustler

Find all posts by hustler

11-14-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Code:

awk -F, '
FNR == 1        {FCNT++                                                                                         # inc file counter with every new file
                }
FNR > 1         {KEYS[$1]                                                                                       # keep $1 in an array; overwrite duplicates
                 for (i=2; i<=NF; i++) W[$1,FCNT,i] = $i                                                        # keep fields in array indexed by key, file No., field No.
                }
END             {printf "Key, diff columns"                                                                     # start printing header
                 for (i=1; i<NF; i++) printf ",f1_value%d,f2_value%d,value%d_diff", i, i, i                     # complete header line for all fields
                 printf RS
                 for (k in KEYS)        {for (i=NF; i>1; i--)   {N1 = W[k,1,i]                                  # for all keys, for all fields, get values
                                                                 N2 = W[k,2,i]                                  # for both files,
                                                                 D1 = N2 - N1                                   # and calc difference
                                                                 OUT = sprintf (",%s,%s,%s", N1, N2, D1) OUT    # collect all those in temp var OUT
                                                                 if (D1) COLS = ",Value" i-1 COLS               # if diff exist, collect fields in temp var COLS
                                                                }
                                         print k, "\"" substr (COLS,2) "\"", substr (OUT, 2)                    # print all those, cutting off leading comma
                                         OUT = COLS = ""                                                        # reset temp vars
                                        }
                }
' file[12]                                                                                                      # OFS and SUBSEP relict from development, not needed

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-14-2019

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Try:

Code:

awk '
  NR==FNR {                                         # When reading the first file (then NR is equal to FNR)
    A[$1]=$0                                        # Store the first file in array A with key $1
    next
  } 

  FNR==1 {                                          # On the first line of the second file
    split($0,Header)                                # Split the header labels in array "Header"
    $1=$1 OFS "diff columns"                        # Create the first 2 field headers
    for(i=2; i<=NF; i++)
      $i=sprintf("f1_%s,f2_%s,%s_diff",$i, $i, $i)  # Create the rest of the field headers
    print                                           # Print the field headers
  } 

  FNR>1 {                                           # Processing the content of file 2
    diffs=""                                        # Set the differences to ""
    if($1 in A) {                                   # if the key in $1 of file2 also occurs in file1
      split(A[$1], F)                               # Split the corresponding line of file 1 into Fields in array F
      for(i=2; i<=NF; i++) {                        # For field 2 until the last field
        if($i!=F[i])                                # if there is a value difference for that field
          diffs=diffs (diffs?OFS:"") Header[i]      # Add the corresponding header label to the differences
        $i=F[i] OFS $i OFS (F[i]-$i)                # Prepend the value of file1 and append the subtraction of file1 val - file val
      } 
      $1=$1 OFS "\"" diffs "\""                     # When all differences found, append them to field 1
      print                                         # print the result
    }
  }
' FS=', *' OFS=, file1 file2                        # set FS to a comma with spaces, set OFS to a comma and read file 1 and file2

Code:

Key,diff columns,f1_value1,f2_value1,value1_diff,f1_Value2,f2_Value2,Value2_diff,f1_Value3,f2_Value3,Value3_diff,f1_Value4,f2_Value4,Value4_diff,...
A,"Value2,Value3,Value4",50,50,0,100,80,20,50,45,5,40,50,-10,...

Last edited by Scrutinizer; 11-14-2019 at 06:38 PM..

These 2 Users Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

11-15-2019

Registered User

4, 0

Join Date: Nov 2019

Last Activity: 16 November 2019, 12:07 PM EST

Posts: 4

Thanks Given: 2

Thanked 0 Times in 0 Posts

I tried implementing this but getting an error that cannot read input file.

hustler

View Public Profile for hustler

Find all posts by hustler

UNIX for Beginners Questions & Answers

Reconciling two CSV files using shell scripting

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Export Oracle multiple tables to multiple csv files using UNIX shell scripting

Discussion started by: Hope

2. Shell Programming and Scripting

Read csv file, convert the data and make one text file in UNIX shell scripting

Discussion started by: RJG

3. Shell Programming and Scripting

Need a piece of shell scripting to remove column from a csv file

Discussion started by: Samah

4. Shell Programming and Scripting

Shell script for field wise record count for different Files .csv files

Discussion started by: Kirands

5. Shell Programming and Scripting

How to calculate average of csv using shell scripting?

Discussion started by: karan pratap si

6. Shell Programming and Scripting

How to calculate avg values of csv file using shell scripting .?

Discussion started by: Avinash shaw

7. Shell Programming and Scripting

How to create or convert to pdf files from csv files using shell script?

Discussion started by: ssk250

8. Shell Programming and Scripting

Shell scripting:from text file to CSV

Discussion started by: kraterions

9. Shell Programming and Scripting

How to insert a sequence number column inside a pipe delimited csv file using shell scripting?

Discussion started by: nithins007

10. Shell Programming and Scripting

Help with shell scripting for accepting .csv files as CLA

Discussion started by: coolguy123