awk filter by columns of file csv

09-29-2013

Registered User

7, 0

Join Date: Jun 2012

Last Activity: 30 September 2013, 5:17 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

awk filter by columns of file csv

Hi,

I would like extract some lines from file csv using awk , below the example:
I have the file test.csv with in content below.

Code:

FLUSSO;COD;DATA_LAV;ESITO
ULL;78;17/09/2013;OL
ULL;45;05/09/2013;Apertura
NP;45;13/09/2013;Riallineamento
ULLNP;78;17/09/2013;OL
NPG;14;12/09/2013;AperturaTK
NPG;14;12/09/2013;Controllo
NPNG;14;12/09/2013;AperturaTK
NP;14;12/09/2013;Controllo

I would like to have in new file :

Code:

ULL;78;17/09/2013;OL
NPNG;14;12/09/2013;AperturaTK

I tried with :

Code:

sort -n -k 2 -k 3 -k 4  test.csv > test1.csv
awk 'BEGIN {FS=OFS=";";} {  if (a[$2]++ > 1 && a[$3]++ > 1 && a[$4]++ > 1  ){   print $0";ROLE 3";    } }  ' test1.csv > test_end.csv;

but unsuccess

Please can you help me ? Thanks in advance

Last edited by Don Cragun; 09-29-2013 at 05:40 PM.. Reason: Add sample input and output CODE tags.

giankan

View Public Profile for giankan

Find all posts by giankan

09-29-2013

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

What are the rules by which you are selecting the rows?

bartus11

View Public Profile for bartus11

Find all posts by bartus11

09-29-2013

Registered User

1,119, 264

Join Date: Oct 2011

Last Activity: 14 August 2020, 12:53 PM EDT

Location: London, UK

Posts: 1,119

Thanks Given: 134

Thanked 264 Times in 247 Posts

What exactly are your output criteria? The first of every set of matching fields 2, 3 & 4?

CarloM

View Public Profile for CarloM

Find all posts by CarloM

09-29-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I agree with what bartus11 and CarloM have said. It is not at all clear why you selected the two lines you would like to have in test_end.csv.

The awk script that you have will print any line in which the 2nd field has appeared 3 or more times, the 3rd field has appeared 3 or more times, and the 4th field has appeared 3 or more times adding a 5th field containing ROLE 3 to the end of those lines. Since your sample input file doesn't have any 4th field value that appears more than 2 times, there is no output.

If we go back a step to your sort statement, since there is no -t option specifying anything other than the default field separator, your input file only contains 1 field. You are sorting the 2nd field to the end of the line as a numeric value as your primary sort key, the 3rd field to the end of the line as a numeric value as your secondary sort key, the 4th field to the end of the line as a numeric value as your tertiary sort key, and finally (since the 2nd, 3rd, and 4th fields on all of your input lines are empty, the only key that matters) the entire line sorted alphabetically. If you added a -t ";" option and option argument, you would be sorting on the 2nd field, the day of month portion of the 3rd field, and the (usually missing) digit string at the start of the 4th field, and (again) the entire line.

Since your desired output is:

Code:

ULL;78;17/09/2013;OL
NPNG;14;12/09/2013;AperturaTK

rather than:

Code:

NPG;14;12/09/2013;Controllo;ROLE 3
NPNG;14;12/09/2013;AperturaTK;ROLE 3
ULLNP;78;17/09/2013;OL;ROLE 3

(which would have been 2nd and later occurrences of identical contents of the concatenation of fields 2, 3 and 4 with your new field added), I have no idea what you're trying to do.

Please give us a clear statement of what logic is to be used to determine what is supposed to be produced as a result of evaluating your input file.

Last edited by Don Cragun; 09-29-2013 at 07:03 PM.. Reason: Fix option letter typo and clarify needed info.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-30-2013

Registered User

7, 0

Join Date: Jun 2012

Last Activity: 30 September 2013, 5:17 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thankyou Don Cragun , I'm sorry for my error.
The statment condition are :
if field 2 > 0 && field 3 > 0 && field 4 > 0 put line into a newfile.csv.
In fact in the example that I posted there are 2 lines that well done satisfaction condition.
Thanks again,

---------- Post updated at 03:07 AM ---------- Previous update was at 03:05 AM ----------

I forgot.... probably the sort command it's no necessary for result

giankan

View Public Profile for giankan

Find all posts by giankan

09-30-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by giankan

Let's go back to your first post, where you specified a sample input file:

Code:

FLUSSO;COD;DATA_LAV;ESITO
ULL;78;17/09/2013;OL
ULL;45;05/09/2013;Apertura
NP;45;13/09/2013;Riallineamento
ULLNP;78;17/09/2013;OL
NPG;14;12/09/2013;AperturaTK
NPG;14;12/09/2013;Controllo
NPNG;14;12/09/2013;AperturaTK
NP;14;12/09/2013;Controllo

and a desired output file:

Code:

ULL;78;17/09/2013;OL
NPNG;14;12/09/2013;AperturaTK

In awk, $2 > 0 is true for every input line you have except the header line.
In awk, $3 > 0 is true for every input line you have except the header line.
And, in awk, $4 > 0 is false for every input line you have including the header line.
So, your criteria for determining lines to be printed does not even come close to matching the output you say you want. (Note also that there is a huge difference between:

the string in field 2 treated as a string of decimal digits and converted to an integer is greater than zero, or the string in field 3 or 4 collates higher than the string "0" (as stated above as field x > 0), and
the number of occurrences the strings in fields 2, 3, and 4 seen so far in any line's field 2, 3 or 4 are all more than 2 (as implemented in your sample awk code as a[$2]++ > 1 && a[$3]++ > 1 && a[$4]++ > 1).

My best guess based on the output you say you want and some guess work based on the script you're using is that you want to print all but one line for each set of lines where the awk expression $2";"$3";"$4 expands to the same string. But, if that is the case, why isn't there supposed to a line in your output corresponding to the two lines shown in red in your sample input? If that is not what you're trying to do, please try again to clearly explain what criteria is used to determine if a line is to be printed!

Your desired output shows that you chose the 1st line containing 78;17/09/2013;OL, but you chose the 2nd line containing 14;12/09/2013;AperturaTK. (This would be true whether you sorted the input using the sort command you provided or just processed the input without sorting it.)

Does the following simple awk script do what you want?:

Code:

awk '
BEGIN { FS = OFS = ";" }
a[$2,$3,$4]++ { print $0, "ROLE 3" }
' test.csv

If you want to run this on a Solaris/SunOS system, use /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk instead of awk.

With your sample input, the script above produces the output:

Code:

ULLNP;78;17/09/2013;OL;ROLE 3
NPNG;14;12/09/2013;AperturaTK;ROLE 3
NP;14;12/09/2013;Controllo;ROLE 3

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

09-30-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Short cut processing of

Code:

if (a[$2]++ > 1 && a[$3]++ > 1 && a[$4]++ > 1

will circumvent the evaluation of a[$3] and/or a[$4] should either of the earlier comparisons fail. Try a modification of your script:

Code:

awk     'BEGIN          {FS=OFS=";"}
                        {a[$2]++;a[$3]++;a[$4]++
                         if (a[$2] > 1 && a[$3] > 1 && a[$4] > 1){print $0";ROLE 3";}
                        }
        ' file
ULLNP;78;17/09/2013;OL;ROLE 3
NPNG;14;12/09/2013;AperturaTK;ROLE 3
NP;14;12/09/2013;Controllo;ROLE 3

which is - funny enough - identical to Don Cragun's result. Sorting will in almost any case modify that result, as the order in which the lines fulfill the conditions will be scrambled.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

awk filter by columns of file csv

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Filter using awk in CSV files

Discussion started by: pradyumnajpn10

2. Shell Programming and Scripting

Add 8 columns at the end of .csv file using awk

Discussion started by: Zam_1234

3. Shell Programming and Scripting

Match columns from two csv files and update field in one of the csv file

Discussion started by: djoseph

4. Linux

Filter a .CSV file based on the 5th column values

Discussion started by: dhruuv369

5. Shell Programming and Scripting

Need help with awk statement to break nth column in csv file into 3 separate columns

Discussion started by: awk-admirer

6. Shell Programming and Scripting

Deleting all the fields(columns) from a .csv file if all rows in that columns are blanks

Discussion started by: ks_reddy

7. Shell Programming and Scripting

awk filter by occurence of at least two columns

Discussion started by: yifangt

8. UNIX for Advanced & Expert Users

Help in Deleting columns and Renaming Mutliple columns in a .Csv File

Discussion started by: mahi_mayu069

9. Shell Programming and Scripting

AWK : Add columns in the end of csv file

Discussion started by: villebonnais

10. Shell Programming and Scripting

validation of data using filter (awk or other that works...) in csv files

Discussion started by: Rafael.Buria