Problem in comparing 2 files string by string

08-07-2011

Registered User

8, 0

Join Date: Jun 2011

Last Activity: 20 August 2011, 1:24 PM EDT

Location: Pune, India

Posts: 8

Thanks Given: 2

Thanked 0 Times in 0 Posts

Problem in comparing 2 files string by string

Hi Champs,
I am a newbie to unix world, and I am trying to built a script which seems to be far tough to be done alone by me.....

" I am having a raw csv file which contains around 50 fields..."
From that file I have to grep 2 fields "A" and "B"....field "A" is to be aligned vertically while field "B" is to be aligned horizontally...

Both the fields may have around 10-20 different numbers...so that I would be getting two arrays of numbers...vertical for field A and horizontal for field B....
#########CODE#########

Code:

#!/bin/bash
cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g 

cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g

#############
This code is giving me the unique counts for both fields in sorted form....
now I have made 2 files 1.txt and 2.txt having sorted array of fields A and B....
##############CODE###########

Code:

cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g | awk '{print $2}' > 1.txt

cat raw_file.csv | awk -F "," '{ print $B}' | sort | uniq -c | sort -k 2g | awk '{printf "%5s", $2}' > 2.txt

paste 1.txt 2.txt > 3.txt

##################
here I got the sorted aligned 2 arrays in file 3.txt.....
upto here is fine..

I need the individual count for each string of hoz array wrt each string of vertical array......
I want the output like....
##########

Code:

          B1 B2 B3 B4
A1     
A2   
A3
A4

###############
(vertical array is also having some blank values....)

Please help.....!!!!

jitendra.pat04

View Public Profile for jitendra.pat04

Find all posts by jitendra.pat04

08-07-2011

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Hi,

An example of your input file would be useful to try to get a solution. At least for me.

Regards,
Birei

birei

View Public Profile for birei

Find all posts by birei

08-07-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

If I am reading your post correctly, what you have is a file like this:

Code:

a,b,c,d,e,f,g,h
g,b,c,d,f,a,i,j
c,e,d,f,h,a,b,a
h,b,j,a,b,c,d,f
a,b,c,e,a,b,c

And for any two colums (2 and 3 lets say) you want to know for each value in column two, the number of times a value in column three appeared with that value. For instance, the value 'b' appears in column 2 with the value 'c' three times and the value j once. So for the above matrix, the desired output would be:

Code:

     COLA/COLB    c   d   j
      b           3   0   1
      e           0   1   0

which shows that column 2 (your A column) had two values (b and e) and column 3 (your B column) had three values (c, d, and j) with the listed combinations of counts.

If this makes sense, and is what you desire, then I think all you need is a simple awk to process your csv file directly:

Code:

#!/usr/bin/env ksh

awk -F "," -v c1=${1:-1} -v c2=${2:-2} '
    {
        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);

        count[$(c1),$(c2)]++;
    }

    END {
        printf( "%15s ", "COLA/COLB" );
        for( j=0; j < o2idx; j++ )              # header
            printf( "%15s ", order_2[j] );
        printf( "\n" );

        for( i = 0; i < o1idx; i++ )            # print matrix
        {
            printf( "%15s ", order_1[i] );
            for( j=0; j < o2idx; j++ )
                printf( "%15d ", count[order_1[i],order_2[j]] );

            printf( "\n" );
        }
    }
'

exit

The script takes colA and colB from the command line and reads the csv file from stdin. If you put this into foo.ksh, the command line to look at columns 5 and 10 would be:

Code:

foo.ksh 5 10 <file.csv

If no colums are given on the command line, the defaults are columns 1 and 2.

If this isn't what you had in mind, please clarify and add some sample input and desired output based on the sample.

agama

View Public Profile for agama

Find all posts by agama

08-09-2011

Registered User

8, 0

Join Date: Jun 2011

Last Activity: 20 August 2011, 1:24 PM EDT

Location: Pune, India

Posts: 8

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thanx a lot Agama

Thanks very much friend,
Basically my problem is solved upto 80%....but here are some more modifications that are needed in the o/p....
1) my csv file is very big and having around 50 fields out of which for field 39 there are 4 values say 1,2,3,4....
for each value of 39th field I have to manipulate the file like such that field 21 comes in vertical and field 26 comes in horizontal.....this 26th field is having many entries upto 15-20......for example....
cat jitendra.csv | awk -F "|" '{if ($39==3) print $0}'.....now for this manipulated file I have to work accordingly for field 21 & 26.......

2) the o/p for 21 and 26 field is not in sorted manner,,, actually I need it in sorted manner......

3) some of the values in field 21(vertical) are blank......so for that blank values I would like to print a "blank"....but no entry in hoz array is blank....

4) also the o/p is not aligned as pasted below.....

####################
COLA/COLB 9008 8739 8822
0 3281 874 389
255 123 18 11
1 3 2 0
133 6 0 0
254 120 6 3
19 240 7 17
137 110 13 3
34 22 0 0
17 4 0 0
###################

..................................................
Actually friend I had to do a lot of modifications more n this file......aside to the above queries.....

can u just help me in understanding that code....because I am a newbie and half of the code is not understandable by me.....
so that I could make some modifications by myself.....

............Any help wud be highly appreciated....thanx in advance.....

Last edited by jitendra.pat04; 08-09-2011 at 04:44 PM..

jitendra.pat04

View Public Profile for jitendra.pat04

Find all posts by jitendra.pat04

08-09-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Here's the code with comments in hopes that you can figure out what is going on. I added a sort (some awk versions don't have a built-in sort function, so I wrote one that shouldn't be too inefficient based on the number of rows/columns you've indicated.

Code:

#!/usr/bin/env ksh

awk -F "," -v c1=${1:-1} -v c2=${2:-2} '
    function sort_order(  what, l,       i, j )
    {
        for( i = 0; i < l; i++ )        # big pass across array
        {
            big = 0;                      # set first entry as the largest value
            for( j = 1; j < l-i; j++ )       # small pass (up to length - i)
                if( what[j] > what[big] )  # if this value larger, mark it 
                    big = j;
            if( big != j-1 )              # if largest is not the last checked
            {
                hold = what[j-1];      # swap the last for the largest
                what[j-1] = what[big];
                what[big] = hold;
            }
        }

    }

    {
        if( $(c2) == "" )                   # if value in c2 is missing, we dont count
            next;

        if( $(c1) == "" )
            $(c1) = "BLANK";                # easy eycatcher for empty field

        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);       # order each c1 was observed

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);       # order each c2 was observed

        count[$(c1),$(c2)]++;               # count the number of times the pair (c1,c2) appear together
    }

    END {
        sort_order( order_1, length( order_1 ) );       # sort values from c1, and c2
        sort_order( order_2, length( order_2 ) );

        printf( "%15s ", "COLA/COLB" );         # print the header line using the order_2 list
        for( j=0; j < o2idx; j++ )
            printf( "%15s ", order_2[j] );      # %15s will align columns based on a width of 15
        printf( "\n" );                         # new line for first row of matrix

        for( i = 0; i < o1idx; i++ )            # print matrix -- for each row (c1 value)
        {
            printf( "%15s ", order_1[i] );      # print the c1 value (again width of 15)
            for( j=0; j < o2idx; j++ )          # for each column (c2 values)
                printf( "%15d ", count[order_1[i],order_2[j]] );    # print each (width 15 again)

            printf( "\n" );                     # end the row by printing a newline
        }
    }
'

exit

As for it not lining up correctly, that seems odd. Each field is printed with a width of 15 characters (%15d or %15s) and provided that none of the data is more than 15 characters wide it should space out nicely. You might need to reduce the setting (from 15 to 8 or something like that) in order to suit your needs. The programme also prints the word "BLANK" for empty data (in column A) and does not count anything if column B is missing. I assumed that missing columns were adjacent comas (,,).

For testing I ran this small set of data:

Code:

6000,1000,100,5000,3000,100,1000,1000,2000,4000
2000,7000,4000,,8000,9000,3000,100,1000,2000
4000,6000,6000,9000,100,7000,8000,7000,9000,8000
4000,4000,5000,,8000,1000,6000,6000,5000,9000
9000,5000,8000,100,6000,4000,8000,8000,9000,7000
7000,1000,100,7000,8000,4000,6000,2000,5000,6000
100,5000,6000,3000,5000,5000,100,4000,4000,6000
7000,9000,4000,6000,4000,8000,8000,6000,6000,9000
9000,1000,4000,,5000,1000,1000,2000,9000,8000
7000,1000,5000,1,5000,3000,100,3000,4000,1000

to ensure sorting and blanks worked. With the command test_script 4 5 <test_data the following was the output:

Code:

      COLA/COLB             100            3000            4000            5000            6000            8000 
              1               0               0               0               1               0               0 
            100               0               0               0               0               1               0 
           3000               0               0               0               1               0               0 
           5000               0               1               0               0               0               0 
           6000               0               0               1               0               0               0 
           7000               0               0               0               0               0               1 
           9000               1               0               0               0               0               0 
          BLANK               0               0               0               1               0               2

The columns are 15 characters wide, and the BLANK value will sort out last if the data in the input is all numeric. You can replace "BLANK" with " " if you just want a space in the output. Unfortunately, I cannot say what might be happening if your columns are misaligned as it works for me in my test environment.

Hope this has been of some value.

This User Gave Thanks to agama For This Post:

agama

View Public Profile for agama

Find all posts by agama

08-10-2011

Registered User

8, 0

Join Date: Jun 2011

Last Activity: 20 August 2011, 1:24 PM EDT

Location: Pune, India

Posts: 8

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thanx a lot again Agama

Now it seems more accurate,,
a few issues are left again.....

I have attached for reference...my file..............

1) I have to cat my file and put filter for $49....as this field had 4 values.....1,2,3,4..........and then I have to process the above entire code for each value respectively.....
so can I use this code like......

cat scripts_test.txt | awk -F "," '{if ($49==1) print $0}' | CODE........????

2) now the field that I want in vertical is $31....and in hoz is $36................also for $31==255....we have to check wheather $13==0 or $13!=0....
if $13!=0.....then count of value for which $31==255....should be added to the count of value for $31==0......for each entry of $36....
else the value shuld be the same for $31==255..............

3) As I have to automate this script in crontab....so I cannot print in command line "test.sh 31 36 <script.test.txt"

can I directly insert my fields as ......
awk -F "," -v c1=${31:-31} -v c2=${36:-36} '....and run the script as
./test_sh.........which will read the data from the file script_test.txt......????

Please suggest.....!!!!!!!!!!!!

Lots of thanx in advance friend....!!!!

Last edited by jitendra.pat04; 08-20-2011 at 05:50 AM..

jitendra.pat04

View Public Profile for jitendra.pat04

Find all posts by jitendra.pat04

08-10-2011

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

Glad it's working! To answer your questions first:

Quote:

can I directly insert my fields as ......
awk -F "," -v c1=${31:-31} -v c2=${36:-36} '....and run the script as
./test_sh.........which will read the data from the file script_test.txt......????

Yes, but you will need to redirect your file in like this:

Code:

./test_sh <script_test.txt

Quote:

1) I have to cat my file and put filter for $49....as this field had 4 values.....1,2,3,4..........and then I have to process the above entire code for each value respectively.....
so can I use this code like......

cat scripts_test.txt | awk -F "," '{if ($49==1) print $0}' | CODE........????

Yes, but you don't need cat. Awk can read the file directly so something like this:

Code:

awk -F "," '{if ($49==1) print $0}' scripts_test.txt| CODE........????

However, you can have the programme that generates the matrix make a single pass across your input data and generate all 4 (or however many different values $49 has) matrices. Given that, the script below will print one matrix for each unique value in $49. Thus, there is no need to run the preprocess script that prints only records where $49 == 1 etc.

It will take from the command line, in this order, the column to print vertically, the column to print horizontally, the column which identifies the matrix. If no parms are given, then the defaults are 31 (vert) 36 (horiz) and 49 (id column).

Finally, in the special case where c1 has the value of 255, and $13 is non-zero, the count will be updated as though the value of c1 was zero.

Code:

#!/usr/bin/env ksh

# $1 == c1 == vertical
# $2 == c2 == horizontal
# $3 == matrix id col (midc)
#
awk -F "," -v ct=$ct -v c1=${1:-31} -v c2=${2:-36} -v midc=${3:-49} '
    function sort_order(  what, l,  i, j )
    {
        for( i = 0; i < l; i++ )
        {
            big = 0;
            for( j = 1; j < l-i; j++ )
                if( what[j] > what[big] )
                    big = j;
            if( big != j-1 )
            {
                hold = what[j-1];
                what[j-1] = what[big];
                what[big] = hold;
            }
        }
    }

    {
        mid = $(midc);                      # pick up the matrix id
        if( ! mseen[mid]++ )
            order_m[moidx++] = mid;         # capture for print at end

        if( $(c2) == "" )                   # if value in c2 is missing, we dont count
            next;

        if( $(c1) == "" )
            $(c1) = "BLANK";                # easy eycatcher for empty field

        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);       # order each c1 was observed

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);       # order each c2 was observed

        if( $(c1) == 255 && $13 != 0  )     # special case: c1 is 255 and $13 is non-zero, count as though c1 was 0
            $(c1) = 0;

        count[mid,$(c1),$(c2)]++;           # count the number of times the pair (c1,c2) appear together
    }

    function print_matrix( mid,     j, i )
    {
        printf( "%15s ", "COLA/COLB" );         # print the header line using the order_2 list
        for( j=0; j < o2idx; j++ )
            printf( "%15s ", order_2[j] );      # %15s will align columns based on a width of 15
        printf( "\n" );                         # new line for first row of matrix

        for( i = 0; i < o1idx; i++ )            # print matrix -- for each row (c1 value)
        {
            printf( "%15s ", order_1[i] );      # print the c1 value (again width of 15)
            for( j=0; j < o2idx; j++ )          # for each column (c2 values)
                printf( "%15d ", count[mid,order_1[i],order_2[j]] );    # print each (width 15 again)

            printf( "\n" );                     # end the row by printing a newline
        }
    }

    END {
        sort_order( order_1, length( order_1 ) );       # sort values from c1, and c2
        sort_order( order_2, length( order_2 ) );
        sort_order( order_m, length( order_m ) );

        for( i = 0; i < moidx; i++ )
        {
            printf( "\nMatrix: %s\n", order_m[i] );     # header for the matrix
            print_matrix( order_m[i] );
        }
    }
'

exit

Assuming you save this file in /home/userx/bin/gen_matrix.ksh and your test data is always in /usr2/data/matrix_input.csv, then you can schedule your job in cron with this command:

Code:

/home/userx/bin/gen_matrix.ksh /usr2/data/matrix_input.csv >/tmp/matrix.out

It will always write the output to the same file. If your input file will be different each day (maybe having the date), or you want your output file to be diffent each day (again with the date), then you will need to construct your input and output filenames in the script and use them on the awk command. Your cron command then just consists of the name of the script.

Examples of this:

Code:

#!/usr/bin/env ksh

date=$( date "+%Y%m%d" )   # date in yyyymmdd (mmdd sorts in order where ddmm does not)
input_file=/usr2/data/matrix_input_$date.csv   # build your file names 
output_file=/usr2/output/matrix_out.$date

# run the awk programme using your filenames on the last line
awk -F "," -v ct=$ct -v c1=${1:-31} -v c2=${2:-36} -v midc=${3:-49} '


#### body of awk programme ######


' $input_file >$output_file    # read from, write to, filenames you created

Hope this gets you further along.

Last edited by agama; 08-10-2011 at 10:08 PM.. Reason: typo

agama

View Public Profile for agama

Find all posts by agama

Shell Programming and Scripting

Problem in comparing 2 files string by string

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Help comparing string, please

Discussion started by: systemoper

2. Shell Programming and Scripting

Grep string in files and list file names that contain the string

Discussion started by: apenkov

3. Shell Programming and Scripting

How to append a string by comparing another string?

Discussion started by: jagdishrout

4. Shell Programming and Scripting

grep exact string from files and write to filename when string present in file

Discussion started by: JC_1

5. UNIX for Dummies Questions & Answers

Comparing a String variable with a string literal in a Debian shell script

Discussion started by: daveu7

6. Shell Programming and Scripting

Parsing a long string string problem for procmail

Discussion started by: cwiggler

7. Shell Programming and Scripting

Problem comparing String using IF stmt

Discussion started by: balesh

8. Shell Programming and Scripting

problem in comparing numeric with string

Discussion started by: naren_0101bits

9. Shell Programming and Scripting

Extracting a string from one file and searching the same string in other files

Discussion started by: mohancrr

10. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Discussion started by: amangeles