Problem in comparing 2 files string by string


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Problem in comparing 2 files string by string
# 1  
Old 08-07-2011
Problem in comparing 2 files string by string

Hi Champs,
I am a newbie to unix world, and I am trying to built a script which seems to be far tough to be done alone by me.....

" I am having a raw csv file which contains around 50 fields..."
From that file I have to grep 2 fields "A" and "B"....field "A" is to be aligned vertically while field "B" is to be aligned horizontally...

Both the fields may have around 10-20 different numbers...so that I would be getting two arrays of numbers...vertical for field A and horizontal for field B....
#########CODE#########
Code:
#!/bin/bash
cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g 

cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g

#############
This code is giving me the unique counts for both fields in sorted form....
now I have made 2 files 1.txt and 2.txt having sorted array of fields A and B....
##############CODE###########
Code:
cat raw_file.csv | awk -F "," '{ print $A}' | sort | uniq -c | sort -k 2g | awk '{print $2}' > 1.txt

cat raw_file.csv | awk -F "," '{ print $B}' | sort | uniq -c | sort -k 2g | awk '{printf "%5s", $2}' > 2.txt

paste 1.txt 2.txt > 3.txt

##################
here I got the sorted aligned 2 arrays in file 3.txt.....
upto here is fine..

I need the individual count for each string of hoz array wrt each string of vertical array......
I want the output like....
##########
Code:
          B1 B2 B3 B4
A1     
A2   
A3
A4

###############
(vertical array is also having some blank values....)

Please help.....!!!!
# 2  
Old 08-07-2011
Hi,

An example of your input file would be useful to try to get a solution. At least for me.

Regards,
Birei
# 3  
Old 08-07-2011
If I am reading your post correctly, what you have is a file like this:

Code:
a,b,c,d,e,f,g,h
g,b,c,d,f,a,i,j
c,e,d,f,h,a,b,a
h,b,j,a,b,c,d,f
a,b,c,e,a,b,c

And for any two colums (2 and 3 lets say) you want to know for each value in column two, the number of times a value in column three appeared with that value. For instance, the value 'b' appears in column 2 with the value 'c' three times and the value j once. So for the above matrix, the desired output would be:

Code:
     COLA/COLB    c   d   j
      b           3   0   1
      e           0   1   0

which shows that column 2 (your A column) had two values (b and e) and column 3 (your B column) had three values (c, d, and j) with the listed combinations of counts.

If this makes sense, and is what you desire, then I think all you need is a simple awk to process your csv file directly:

Code:
#!/usr/bin/env ksh

awk -F "," -v c1=${1:-1} -v c2=${2:-2} '
    {
        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);

        count[$(c1),$(c2)]++;
    }

    END {
        printf( "%15s ", "COLA/COLB" );
        for( j=0; j < o2idx; j++ )              # header
            printf( "%15s ", order_2[j] );
        printf( "\n" );

        for( i = 0; i < o1idx; i++ )            # print matrix
        {
            printf( "%15s ", order_1[i] );
            for( j=0; j < o2idx; j++ )
                printf( "%15d ", count[order_1[i],order_2[j]] );

            printf( "\n" );
        }
    }
'

exit

The script takes colA and colB from the command line and reads the csv file from stdin. If you put this into foo.ksh, the command line to look at columns 5 and 10 would be:

Code:
foo.ksh 5 10 <file.csv

If no colums are given on the command line, the defaults are columns 1 and 2.

If this isn't what you had in mind, please clarify and add some sample input and desired output based on the sample.
# 4  
Old 08-09-2011
Thanx a lot Agama

Thanks very much friend,
Basically my problem is solved upto 80%....but here are some more modifications that are needed in the o/p....
1) my csv file is very big and having around 50 fields out of which for field 39 there are 4 values say 1,2,3,4....
for each value of 39th field I have to manipulate the file like such that field 21 comes in vertical and field 26 comes in horizontal.....this 26th field is having many entries upto 15-20......for example....
cat jitendra.csv | awk -F "|" '{if ($39==3) print $0}'.....now for this manipulated file I have to work accordingly for field 21 & 26.......

2) the o/p for 21 and 26 field is not in sorted manner,,, actually I need it in sorted manner......

3) some of the values in field 21(vertical) are blank......so for that blank values I would like to print a "blank"....but no entry in hoz array is blank....

4) also the o/p is not aligned as pasted below.....

####################
COLA/COLB 9008 8739 8822
0 3281 874 389
255 123 18 11
1 3 2 0
133 6 0 0
254 120 6 3
19 240 7 17
137 110 13 3
34 22 0 0
17 4 0 0
###################

..................................................
Actually friend I had to do a lot of modifications more n this file......aside to the above queries.....

can u just help me in understanding that code....because I am a newbie and half of the code is not understandable by me.....
so that I could make some modifications by myself.....

............Any help wud be highly appreciated....thanx in advance.....

Last edited by jitendra.pat04; 08-09-2011 at 04:44 PM..
# 5  
Old 08-09-2011
Here's the code with comments in hopes that you can figure out what is going on. I added a sort (some awk versions don't have a built-in sort function, so I wrote one that shouldn't be too inefficient based on the number of rows/columns you've indicated.

Code:
#!/usr/bin/env ksh

awk -F "," -v c1=${1:-1} -v c2=${2:-2} '
    function sort_order(  what, l,       i, j )
    {
        for( i = 0; i < l; i++ )        # big pass across array
        {
            big = 0;                      # set first entry as the largest value
            for( j = 1; j < l-i; j++ )       # small pass (up to length - i)
                if( what[j] > what[big] )  # if this value larger, mark it 
                    big = j;
            if( big != j-1 )              # if largest is not the last checked
            {
                hold = what[j-1];      # swap the last for the largest
                what[j-1] = what[big];
                what[big] = hold;
            }
        }

    }

    {
        if( $(c2) == "" )                   # if value in c2 is missing, we dont count
            next;

        if( $(c1) == "" )
            $(c1) = "BLANK";                # easy eycatcher for empty field

        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);       # order each c1 was observed

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);       # order each c2 was observed

        count[$(c1),$(c2)]++;               # count the number of times the pair (c1,c2) appear together
    }

    END {
        sort_order( order_1, length( order_1 ) );       # sort values from c1, and c2
        sort_order( order_2, length( order_2 ) );

        printf( "%15s ", "COLA/COLB" );         # print the header line using the order_2 list
        for( j=0; j < o2idx; j++ )
            printf( "%15s ", order_2[j] );      # %15s will align columns based on a width of 15
        printf( "\n" );                         # new line for first row of matrix

        for( i = 0; i < o1idx; i++ )            # print matrix -- for each row (c1 value)
        {
            printf( "%15s ", order_1[i] );      # print the c1 value (again width of 15)
            for( j=0; j < o2idx; j++ )          # for each column (c2 values)
                printf( "%15d ", count[order_1[i],order_2[j]] );    # print each (width 15 again)

            printf( "\n" );                     # end the row by printing a newline
        }
    }
'

exit

As for it not lining up correctly, that seems odd. Each field is printed with a width of 15 characters (%15d or %15s) and provided that none of the data is more than 15 characters wide it should space out nicely. You might need to reduce the setting (from 15 to 8 or something like that) in order to suit your needs. The programme also prints the word "BLANK" for empty data (in column A) and does not count anything if column B is missing. I assumed that missing columns were adjacent comas (,,).

For testing I ran this small set of data:

Code:
6000,1000,100,5000,3000,100,1000,1000,2000,4000
2000,7000,4000,,8000,9000,3000,100,1000,2000
4000,6000,6000,9000,100,7000,8000,7000,9000,8000
4000,4000,5000,,8000,1000,6000,6000,5000,9000
9000,5000,8000,100,6000,4000,8000,8000,9000,7000
7000,1000,100,7000,8000,4000,6000,2000,5000,6000
100,5000,6000,3000,5000,5000,100,4000,4000,6000
7000,9000,4000,6000,4000,8000,8000,6000,6000,9000
9000,1000,4000,,5000,1000,1000,2000,9000,8000
7000,1000,5000,1,5000,3000,100,3000,4000,1000

to ensure sorting and blanks worked. With the command test_script 4 5 <test_data the following was the output:

Code:
      COLA/COLB             100            3000            4000            5000            6000            8000 
              1               0               0               0               1               0               0 
            100               0               0               0               0               1               0 
           3000               0               0               0               1               0               0 
           5000               0               1               0               0               0               0 
           6000               0               0               1               0               0               0 
           7000               0               0               0               0               0               1 
           9000               1               0               0               0               0               0 
          BLANK               0               0               0               1               0               2

The columns are 15 characters wide, and the BLANK value will sort out last if the data in the input is all numeric. You can replace "BLANK" with " " if you just want a space in the output. Unfortunately, I cannot say what might be happening if your columns are misaligned as it works for me in my test environment.

Hope this has been of some value.
This User Gave Thanks to agama For This Post:
# 6  
Old 08-10-2011
Thanx a lot again Agama

Now it seems more accurate,,
a few issues are left again.....

I have attached for reference...my file..............

1) I have to cat my file and put filter for $49....as this field had 4 values.....1,2,3,4..........and then I have to process the above entire code for each value respectively.....
so can I use this code like......

cat scripts_test.txt | awk -F "," '{if ($49==1) print $0}' | CODE........????

2) now the field that I want in vertical is $31....and in hoz is $36................also for $31==255....we have to check wheather $13==0 or $13!=0....
if $13!=0.....then count of value for which $31==255....should be added to the count of value for $31==0......for each entry of $36....
else the value shuld be the same for $31==255..............

3) As I have to automate this script in crontab....so I cannot print in command line "test.sh 31 36 <script.test.txt"

can I directly insert my fields as ......
awk -F "," -v c1=${31:-31} -v c2=${36:-36} '....and run the script as
./test_sh.........which will read the data from the file script_test.txt......????

Please suggest.....!!!!!!!!!!!!

Lots of thanx in advance friend....!!!!

Last edited by jitendra.pat04; 08-20-2011 at 05:50 AM..
# 7  
Old 08-10-2011
Glad it's working! To answer your questions first:
Quote:
can I directly insert my fields as ......
awk -F "," -v c1=${31:-31} -v c2=${36:-36} '....and run the script as
./test_sh.........which will read the data from the file script_test.txt......????
Yes, but you will need to redirect your file in like this:
Code:
./test_sh <script_test.txt

Quote:
1) I have to cat my file and put filter for $49....as this field had 4 values.....1,2,3,4..........and then I have to process the above entire code for each value respectively.....
so can I use this code like......

cat scripts_test.txt | awk -F "," '{if ($49==1) print $0}' | CODE........????
Yes, but you don't need cat. Awk can read the file directly so something like this:

Code:
awk -F "," '{if ($49==1) print $0}' scripts_test.txt| CODE........????

However, you can have the programme that generates the matrix make a single pass across your input data and generate all 4 (or however many different values $49 has) matrices. Given that, the script below will print one matrix for each unique value in $49. Thus, there is no need to run the preprocess script that prints only records where $49 == 1 etc.

It will take from the command line, in this order, the column to print vertically, the column to print horizontally, the column which identifies the matrix. If no parms are given, then the defaults are 31 (vert) 36 (horiz) and 49 (id column).

Finally, in the special case where c1 has the value of 255, and $13 is non-zero, the count will be updated as though the value of c1 was zero.

Code:
#!/usr/bin/env ksh

# $1 == c1 == vertical
# $2 == c2 == horizontal
# $3 == matrix id col (midc)
#
awk -F "," -v ct=$ct -v c1=${1:-31} -v c2=${2:-36} -v midc=${3:-49} '
    function sort_order(  what, l,  i, j )
    {
        for( i = 0; i < l; i++ )
        {
            big = 0;
            for( j = 1; j < l-i; j++ )
                if( what[j] > what[big] )
                    big = j;
            if( big != j-1 )
            {
                hold = what[j-1];
                what[j-1] = what[big];
                what[big] = hold;
            }
        }
    }

    {
        mid = $(midc);                      # pick up the matrix id
        if( ! mseen[mid]++ )
            order_m[moidx++] = mid;         # capture for print at end

        if( $(c2) == "" )                   # if value in c2 is missing, we dont count
            next;

        if( $(c1) == "" )
            $(c1) = "BLANK";                # easy eycatcher for empty field

        if( !seen_1[$(c1)]++ )
            order_1[o1idx++] = $(c1);       # order each c1 was observed

        if( !seen_2[$(c2)]++ )
            order_2[o2idx++] = $(c2);       # order each c2 was observed

        if( $(c1) == 255 && $13 != 0  )     # special case: c1 is 255 and $13 is non-zero, count as though c1 was 0
            $(c1) = 0;

        count[mid,$(c1),$(c2)]++;           # count the number of times the pair (c1,c2) appear together
    }

    function print_matrix( mid,     j, i )
    {
        printf( "%15s ", "COLA/COLB" );         # print the header line using the order_2 list
        for( j=0; j < o2idx; j++ )
            printf( "%15s ", order_2[j] );      # %15s will align columns based on a width of 15
        printf( "\n" );                         # new line for first row of matrix

        for( i = 0; i < o1idx; i++ )            # print matrix -- for each row (c1 value)
        {
            printf( "%15s ", order_1[i] );      # print the c1 value (again width of 15)
            for( j=0; j < o2idx; j++ )          # for each column (c2 values)
                printf( "%15d ", count[mid,order_1[i],order_2[j]] );    # print each (width 15 again)

            printf( "\n" );                     # end the row by printing a newline
        }
    }

    END {
        sort_order( order_1, length( order_1 ) );       # sort values from c1, and c2
        sort_order( order_2, length( order_2 ) );
        sort_order( order_m, length( order_m ) );

        for( i = 0; i < moidx; i++ )
        {
            printf( "\nMatrix: %s\n", order_m[i] );     # header for the matrix
            print_matrix( order_m[i] );
        }
    }
'

exit

Assuming you save this file in /home/userx/bin/gen_matrix.ksh and your test data is always in /usr2/data/matrix_input.csv, then you can schedule your job in cron with this command:

Code:
/home/userx/bin/gen_matrix.ksh /usr2/data/matrix_input.csv >/tmp/matrix.out

It will always write the output to the same file. If your input file will be different each day (maybe having the date), or you want your output file to be diffent each day (again with the date), then you will need to construct your input and output filenames in the script and use them on the awk command. Your cron command then just consists of the name of the script.

Examples of this:
Code:
#!/usr/bin/env ksh

date=$( date "+%Y%m%d" )   # date in yyyymmdd (mmdd sorts in order where ddmm does not)
input_file=/usr2/data/matrix_input_$date.csv   # build your file names 
output_file=/usr2/output/matrix_out.$date

# run the awk programme using your filenames on the last line
awk -F "," -v ct=$ct -v c1=${1:-31} -v c2=${2:-36} -v midc=${3:-49} '


#### body of awk programme ######


' $input_file >$output_file    # read from, write to, filenames you created

Hope this gets you further along.

Last edited by agama; 08-10-2011 at 10:08 PM.. Reason: typo
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

Help comparing string, please

Good morning, I need compare this string. if || || ; then But this line not work, somebody can say me what is the error. Thank you for advanced. (5 Replies)
Discussion started by: systemoper
5 Replies

2. Shell Programming and Scripting

Grep string in files and list file names that contain the string

Hi, I have a list of zipped files. I want to grep for a string in all files and get a list of file names that contain the string. But without unzipping them before that, more like using something like gzcat. My OS is: SunOS test 5.10 Generic_142900-13 sun4u sparc SUNW,SPARC-Enterprise (8 Replies)
Discussion started by: apenkov
8 Replies

3. Shell Programming and Scripting

How to append a string by comparing another string?

Hi , I have one file like BUD,BDL BUDCAR BUD,BDL BUDLAMP ABC,CDF,KLT ABISKAR ABC,CDF,KLT CORNEL ABC,CDF,KLT KANNAD JKL,HNM,KTY,KJY JAGAN JKL,HNM,KTY,KJY HOUSE JKL,HNM,KTY,KJY KATAK JKL,HNM,KTY,KJY KOLKA The o/p should be like BUD,BDL BUDCAR,BUDLAMP ABC,CDF,KLT... (4 Replies)
Discussion started by: jagdishrout
4 Replies

4. Shell Programming and Scripting

grep exact string from files and write to filename when string present in file

I am attempting to grep an exact string from a series of files within a directory and append that output to the filename when it is present in the file. I've been after this all day with no luck. Thanks for your help in advance :wall:. (4 Replies)
Discussion started by: JC_1
4 Replies

5. UNIX for Dummies Questions & Answers

Comparing a String variable with a string literal in a Debian shell script

Hi All, I am trying to to compare a string variable with a string literal inside a loop but keep getting the ./testifstructure.sh: line 6: #!/bin/sh BOOK_LIST="BOOK1 BOOK2" for BOOK in ${BOOK_LIST} do if then echo '1' else echo '2' fi done Please use next... (1 Reply)
Discussion started by: daveu7
1 Replies

6. Shell Programming and Scripting

Parsing a long string string problem for procmail

Hi everyone, I am working on fetchmail + procmail to filter mails and I am having problem with parsing a long line in the body of the email. Could anyone help me construct a reg exp for this string below. It needs to match exactly as this string. GetRyt... (4 Replies)
Discussion started by: cwiggler
4 Replies

7. Shell Programming and Scripting

Problem comparing String using IF stmt

Hi frnds Im facing an issues while trying to compare string using IF stmt, my code is: chkMsgName=`Service Fee Detail` if then if then if then echo "Valid File Ready for processing" fi fi ... (5 Replies)
Discussion started by: balesh
5 Replies

8. Shell Programming and Scripting

problem in comparing numeric with string

Hi all, I am having a problem in comparing numeric value with string. I have a variable in my script which gets the value dynamically. It can be a numeric value or a string. I have to do separate task based on its value numeric or sting variable VARIABLE. I grep FILE_COUNT and obtained... (7 Replies)
Discussion started by: naren_0101bits
7 Replies

9. Shell Programming and Scripting

Extracting a string from one file and searching the same string in other files

Hi, Need to extract a string from one file and search the same in other files. Ex: I have file1 of hundred lines with no delimiters not even space. I have 3 more files. I should get 1 to 10 characters say substring from each line of file1 and search that string in rest of the files and get... (1 Reply)
Discussion started by: mohancrr
1 Replies

10. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Hi guys, I hope you can help me with my problem. I have a text file that contains lines like this: 78 ANGELO -809.05 79 ANGELO2 -5,000.06 I need to find all occurences of amounts that are negative and replace them with x's 78 ANGELO xxxxxxx 79... (4 Replies)
Discussion started by: amangeles
4 Replies
Login or Register to Ask a Question