awk filter by columns of file csv


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk filter by columns of file csv
# 1  
Old 09-29-2013
awk filter by columns of file csv

Hi,

I would like extract some lines from file csv using awk , below the example:
I have the file test.csv with in content below.

Code:
FLUSSO;COD;DATA_LAV;ESITO
ULL;78;17/09/2013;OL
ULL;45;05/09/2013;Apertura
NP;45;13/09/2013;Riallineamento
ULLNP;78;17/09/2013;OL
NPG;14;12/09/2013;AperturaTK
NPG;14;12/09/2013;Controllo
NPNG;14;12/09/2013;AperturaTK
NP;14;12/09/2013;Controllo

I would like to have in new file :
Code:
ULL;78;17/09/2013;OL
NPNG;14;12/09/2013;AperturaTK

I tried with :

Code:
sort -n -k 2 -k 3 -k 4  test.csv > test1.csv
awk 'BEGIN {FS=OFS=";";} {  if (a[$2]++ > 1 && a[$3]++ > 1 && a[$4]++ > 1  ){   print $0";ROLE 3";    } }  ' test1.csv > test_end.csv;

but unsuccess Smilie Please can you help me ? Thanks in advance

Last edited by Don Cragun; 09-29-2013 at 05:40 PM.. Reason: Add sample input and output CODE tags.
# 2  
Old 09-29-2013
What are the rules by which you are selecting the rows?
# 3  
Old 09-29-2013
What exactly are your output criteria? The first of every set of matching fields 2, 3 & 4?
# 4  
Old 09-29-2013
I agree with what bartus11 and CarloM have said. It is not at all clear why you selected the two lines you would like to have in test_end.csv.

The awk script that you have will print any line in which the 2nd field has appeared 3 or more times, the 3rd field has appeared 3 or more times, and the 4th field has appeared 3 or more times adding a 5th field containing ROLE 3 to the end of those lines. Since your sample input file doesn't have any 4th field value that appears more than 2 times, there is no output.

If we go back a step to your sort statement, since there is no -t option specifying anything other than the default field separator, your input file only contains 1 field. You are sorting the 2nd field to the end of the line as a numeric value as your primary sort key, the 3rd field to the end of the line as a numeric value as your secondary sort key, the 4th field to the end of the line as a numeric value as your tertiary sort key, and finally (since the 2nd, 3rd, and 4th fields on all of your input lines are empty, the only key that matters) the entire line sorted alphabetically. If you added a -t ";" option and option argument, you would be sorting on the 2nd field, the day of month portion of the 3rd field, and the (usually missing) digit string at the start of the 4th field, and (again) the entire line.

Since your desired output is:
Code:
ULL;78;17/09/2013;OL
NPNG;14;12/09/2013;AperturaTK

rather than:
Code:
NPG;14;12/09/2013;Controllo;ROLE 3
NPNG;14;12/09/2013;AperturaTK;ROLE 3
ULLNP;78;17/09/2013;OL;ROLE 3

(which would have been 2nd and later occurrences of identical contents of the concatenation of fields 2, 3 and 4 with your new field added), I have no idea what you're trying to do.

Please give us a clear statement of what logic is to be used to determine what is supposed to be produced as a result of evaluating your input file.

Last edited by Don Cragun; 09-29-2013 at 07:03 PM.. Reason: Fix option letter typo and clarify needed info.
# 5  
Old 09-30-2013
Thankyou Don Cragun , I'm sorry for my error.
The statment condition are :
if field 2 > 0 && field 3 > 0 && field 4 > 0 put line into a newfile.csv.
In fact in the example that I posted there are 2 lines that well done satisfaction condition.
Thanks again,

---------- Post updated at 03:07 AM ---------- Previous update was at 03:05 AM ----------

I forgot.... probably the sort command it's no necessary for result
# 6  
Old 09-30-2013
Quote:
Originally Posted by giankan
Thankyou Don Cragun , I'm sorry for my error.
The statment condition are :
if field 2 > 0 && field 3 > 0 && field 4 > 0 put line into a newfile.csv.
In fact in the example that I posted there are 2 lines that well done satisfaction condition.
Thanks again,

---------- Post updated at 03:07 AM ---------- Previous update was at 03:05 AM ----------

I forgot.... probably the sort command it's no necessary for result
Let's go back to your first post, where you specified a sample input file:
Code:
FLUSSO;COD;DATA_LAV;ESITO
ULL;78;17/09/2013;OL
ULL;45;05/09/2013;Apertura
NP;45;13/09/2013;Riallineamento
ULLNP;78;17/09/2013;OL
NPG;14;12/09/2013;AperturaTK
NPG;14;12/09/2013;Controllo
NPNG;14;12/09/2013;AperturaTK
NP;14;12/09/2013;Controllo

and a desired output file:
Code:
ULL;78;17/09/2013;OL
NPNG;14;12/09/2013;AperturaTK

In awk, $2 > 0 is true for every input line you have except the header line.
In awk, $3 > 0 is true for every input line you have except the header line.
And, in awk, $4 > 0 is false for every input line you have including the header line.
So, your criteria for determining lines to be printed does not even come close to matching the output you say you want. (Note also that there is a huge difference between:
  1. the string in field 2 treated as a string of decimal digits and converted to an integer is greater than zero, or the string in field 3 or 4 collates higher than the string "0" (as stated above as field x > 0), and
  2. the number of occurrences the strings in fields 2, 3, and 4 seen so far in any line's field 2, 3 or 4 are all more than 2 (as implemented in your sample awk code as a[$2]++ > 1 && a[$3]++ > 1 && a[$4]++ > 1).

My best guess based on the output you say you want and some guess work based on the script you're using is that you want to print all but one line for each set of lines where the awk expression $2";"$3";"$4 expands to the same string. But, if that is the case, why isn't there supposed to a line in your output corresponding to the two lines shown in red in your sample input? If that is not what you're trying to do, please try again to clearly explain what criteria is used to determine if a line is to be printed!

Your desired output shows that you chose the 1st line containing 78;17/09/2013;OL, but you chose the 2nd line containing 14;12/09/2013;AperturaTK. (This would be true whether you sorted the input using the sort command you provided or just processed the input without sorting it.)

Does the following simple awk script do what you want?:
Code:
awk '
BEGIN { FS = OFS = ";" }
a[$2,$3,$4]++ { print $0, "ROLE 3" }
' test.csv

If you want to run this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk.

With your sample input, the script above produces the output:
Code:
ULLNP;78;17/09/2013;OL;ROLE 3
NPNG;14;12/09/2013;AperturaTK;ROLE 3
NP;14;12/09/2013;Controllo;ROLE 3

# 7  
Old 09-30-2013
Short cut processing of
Code:
if (a[$2]++ > 1 && a[$3]++ > 1 && a[$4]++ > 1

will circumvent the evaluation of a[$3] and/or a[$4] should either of the earlier comparisons fail. Try a modification of your script:
Code:
awk     'BEGIN          {FS=OFS=";"}
                        {a[$2]++;a[$3]++;a[$4]++
                         if (a[$2] > 1 && a[$3] > 1 && a[$4] > 1){print $0";ROLE 3";}
                        }
        ' file
ULLNP;78;17/09/2013;OL;ROLE 3
NPNG;14;12/09/2013;AperturaTK;ROLE 3
NP;14;12/09/2013;Controllo;ROLE 3

which is - funny enough - identical to Don Cragun's result. Sorting will in almost any case modify that result, as the order in which the lines fulfill the conditions will be scrambled.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter using awk in CSV files

Hello Gentlemen, Finding difficulties to play with my Input files:confused: . Your guidance will certainly help as always. After converting to csv file from XLSM file, I am getting some extra ""(double quote) characters which I want to terminate inside shell script and process it further. ... (6 Replies)
Discussion started by: pradyumnajpn10
6 Replies

2. Shell Programming and Scripting

Add 8 columns at the end of .csv file using awk

Hello all, I have a .csv file of 16 columns consists of bunch of numbers. 6.45E+01 1.17E+01 8.10E+04 8.21E+01 8.50E+00 1.20E+01 1.02E+01 1.88E+01 1.86E+04 3.53E+03 1.09E+07 3.82E+04 2.09E+03 3.57E+03 2.98E+03 3.93E+03 6.34E+01 3.23E+01 9.24E+04 ... (5 Replies)
Discussion started by: Zam_1234
5 Replies

3. Shell Programming and Scripting

Match columns from two csv files and update field in one of the csv file

Hi, I have a file of csv data, which looks like this: file1: 1AA,LGV_PONCEY_LES_ATHEE,1,\N,1,00020460E1,0,\N,\N,\N,\N,2,00.22335321,0.00466628 2BB,LES_POUGES_ASF,\N,200,200,00006298G1,0,\N,\N,\N,\N,1,00.30887539,0.00050312... (10 Replies)
Discussion started by: djoseph
10 Replies

4. Linux

Filter a .CSV file based on the 5th column values

I have a .CSV file with the below format: "column 1","column 2","column 3","column 4","column 5","column 6","column 7","column 8","column 9","column 10 "12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""... (2 Replies)
Discussion started by: dhruuv369
2 Replies

5. Shell Programming and Scripting

Need help with awk statement to break nth column in csv file into 3 separate columns

Hello Members, I have a csv file in the format below. Need help with awk statement to break nth column into 3 separate columns and export the changes to new file. input file --> file.csv cat file.csv|less "product/fruit/mango","location/asia/india","type/alphonso" need output in... (2 Replies)
Discussion started by: awk-admirer
2 Replies

6. Shell Programming and Scripting

Deleting all the fields(columns) from a .csv file if all rows in that columns are blanks

Hi Friends, I have come across some files where some of the columns don not have data. Key, Data1,Data2,Data3,Data4,Data5 A,5,6,,10,, A,3,4,,3,, B,1,,4,5,, B,2,,3,4,, If we see the above data on Data5 column do not have any row got filled. So remove only that column(Here Data5) and... (4 Replies)
Discussion started by: ks_reddy
4 Replies

7. Shell Programming and Scripting

awk filter by occurence of at least two columns

Hi, Using AWK script I want to pick those rows that has AT LEAST TWO columns EACH has a count >=3. i.e. two conditions: at least two columns, each of which has a count at least 3. There must be a simple way to do this job, but could not figure it out by myself. Input file (thousand of... (3 Replies)
Discussion started by: yifangt
3 Replies

8. UNIX for Advanced & Expert Users

Help in Deleting columns and Renaming Mutliple columns in a .Csv File

Hi All, i have a .Csv file in the below format startTime, endTime, delta, gName, rName, rNumber, m2239max, m2239min, m2239avg, m100016509avg, m100019240max, metric3min, m100019240avg, propValues 11-Mar-2012 00:00:00, 11-Mar-2012 00:05:00, 300.0, vma3550a, a-1_CPU Index<1>, 200237463, 0.0,... (9 Replies)
Discussion started by: mahi_mayu069
9 Replies

9. Shell Programming and Scripting

AWK : Add columns in the end of csv file

Hi everybody, I need some help please I have a csv file named masterFile1.csv header1,header2,header3 value1,value2,value3 value4,value5,value6 I am trying to add new columns in the end of the csv to have a new csv file named masterFile2.csv like this :... (3 Replies)
Discussion started by: villebonnais
3 Replies

10. Shell Programming and Scripting

validation of data using filter (awk or other that works...) in csv files

Hello People I have the following file.csv: date,string,float,number,boolean 20080303,abc,1.5,123,Y 20080304,abc,1.2,345,N 20080229,nvh,1.4,098,Y 20080319,ugy,1.9,586,N 20080315,gyh,2.4,345,Y 20080316,erf,3.1,932,N I need to filter the date field where I have a data bigger than I... (1 Reply)
Discussion started by: Rafael.Buria
1 Replies
Login or Register to Ask a Question