Xargs, awk, match, if greater - as a one-liner

08-10-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

This is what I get from your above sample file

Code:

awk 'match ($0, /PVALUE=[0-9.]*/) && substr($0, RSTART+7, RLENGTH-7) > 0.05' *.txt
chrX	110000	NRHITS=10;PVALUE=0.6
chrX	120000	NRHITS=18;PVALUE=0.2

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-10-2016

Registered User

7, 0

Join Date: Aug 2016

Last Activity: 10 August 2016, 4:29 PM EDT

Posts: 7

Thanks Given: 5

Thanked 0 Times in 0 Posts

You are right. The command works but how can i use it with xargs. I have multiple files to process and i want the separate output files for each.

ts89490

View Public Profile for ts89490

Find all posts by ts89490

08-10-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

xargs won't help you here. It will put the file names found onto the command line as parameters to awk, like the shell does in above proposal. In either case, awk will work on that input stream writing ALL results to stdout. If you want the output by input file name, you need to redirect within awk.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-10-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The 1st post in this thread explicitly requested that every field in your input files be searched for PVALUE=number. But, the sample data provided never shows more than once such string on an input line and, on lines that do have something matching that pattern, it always appears in the last field on the line. But, we have no indication of whether or not the sample data provided in post #5 in this thread is representative of the actual data that needs to be processed. From the code samples posted, it appears that the submitter wants one output file produced for each input file that contains matched lines. The submitter seems to also want to have 15 copies of awk running in parallel (which only makes sense if those 15 awk commands won't be thrashing CPU and/or disk accesses.

Assuming that parallel processing won't really help much here (and might actually slow down processing), avoiding xargs completely, and assuming that an input line may contain more than one of the patterns above; I would try something more like:

Code:

awk -F';' '
FNR == 1 {
	if(of != "")
		close(of)
	of = FILENAME ".fail"
}
{	for(i = 1; i <= NF; i++)
		if($i ~ /^PVALUE=/ && (substr($i, 8) + 0) > .05) {
			print > of
			next
		}
}' *.txt

Note that this doesn't create an output file for every input file; it only creates an output file if one or more lines in the corresponding input file meets your criteria.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-10-2016

Registered User

7, 0

Join Date: Aug 2016

Last Activity: 10 August 2016, 4:29 PM EDT

Posts: 7

Thanks Given: 5

Thanked 0 Times in 0 Posts

thanks Don
Parallel processing is not an issue here as I am doing it on a cluster.

The string to match appears invariably in third column but the order of variables NRHITS and PVALUE in third column might vary.

While the code does not write separate output files for each input, I am wondering if a combination of xargs and sed can help. If so, how ?
Thanks

Quote:

Originally Posted by Don Cragun

The 1st post in this thread explicitly requested that every field in your input files be searched for PVALUE=number. But, the sample data provided never shows more than once such string on an input line and, on lines that do have something matching that pattern, it always appears in the last field on the line. But, we have no indication of whether or not the sample data provided in post #5 in this thread is representative of the actual data that needs to be processed. From the code samples posted, it appears that the submitter wants one output file produced for each input file that contains matched lines. The submitter seems to also want to have 15 copies of awk running in parallel (which only makes sense if those 15 awk commands won't be thrashing CPU and/or disk accesses.

Assuming that parallel processing won't really help much here (and might actually slow down processing), avoiding xargs completely, and assuming that an input line may contain more than one of the patterns above; I would try something more like:

Code:

awk -F';' '
FNR == 1 {
	if(of != "")
		close(of)
	of = FILENAME ".fail"
}
{	for(i = 1; i <= NF; i++)
		if($i ~ /^PVALUE=/ && (substr($i, 8) + 0) > .05) {
			print > of
			next
		}
}' *.txt

Note that this doesn't create an output file for every input file; it only creates an output file if one or more lines in the corresponding input file meets your criteria.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

---------- Post updated at 01:35 PM ---------- Previous update was at 01:12 PM ----------

I received a suggestion of using perl -ne. I used the following command.

Code:

ls *.txt | xargs -I {} sh -c "perl -ne 'if ($_ =~ m/PVALUE=(\d+)/) {if ($1 >= 0.05){print $_}}' {} >{}.fail"

But i get an error:

Code:

syntax error at -e line 1, near "( =~"
syntax error at -e line 1, near "}}"
Execution of -e aborted due to compilation errors.

Any suggestions to correct the above command ?

Last edited by ts89490; 08-10-2016 at 03:24 PM..

ts89490

View Public Profile for ts89490

Find all posts by ts89490

08-10-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by ts89490

thanks Don
Parallel processing is not an issue here as I am doing it on a cluster.

The string to match appears invariably in third column but the order of variables NRHITS and PVALUE in third column might vary.

While the code does not write separate output files, I am wondering if a combination of xargs and sed can help. If so, how ?
Thanks

OK. I completely misunderstood your example. I thought your field separator was <semicolon>, but now I'm guessing that <tab> is your field separator, and <semicolon> is a subfield separator in your third field.

And you are wrong. The code I suggested produces a separate output file for each input file that contains lines that meet your criteria.

Using your updated description (but assuming that no <semicolon> characters appear anywhere in the 1st two fields in your input files AND assuming that a single <tab> character separates the first three fields), my code adjusted for your new description of the problem is:

Code:

awk -F'[\t;]' '
FNR == 1 {
	if(of != "")
		close(of)
	of = FILENAME ".fail"
}
{	for(i = 3; i <= NF; i++)
		if($i ~ /^PVALUE=/ && (substr($i, 8) + 0) > .05) {
			print > of
			next
		}
}' *.txt

And, with the following input files:
file1.txt:

Code:

#row1	
#row2	
#row3	
#row4	
#row5	
CHR	POS	INFO
chrX	10000	NRHITS=35;PVALUE=0.04
chrX	109000	NRHITS=6;PVALUE=
chrX	110000	NRHITS=10;PVALUE=0.6
chrX	120000	NRHITS=18;PVALUE=0.2
chrX	130000	NRHITS=39;PVALUE=0.035

file2.txt:

Code:

#row1	
#row2	
#row3	
#row4	
#row5	
CHR	POS	INFO
chrX	10000	PVALUE=0.04;NRHITS=35
chrX	109000	PVALUE=;NRHITS=6
chrX	110000	PVALUE=0.6;NRHITS=10
chrX	120000	PVALUE=0.2;NRHITS=18
chrX	130000	PVALUE=0.035;NRHITS=39

file3.txt:

Code:

#row1	
#row2	
#row3	
#row4	
#row5	
CHR	POS	INFO
chrX	10000	EXTRA=1;NRHITS=35;PVALUE=0.04
chrX	109000	NRHITS=6;PVALUE=;EXTRA=2
chrX	110000	NRHITS=10;EXTRA=3;PVALUE=0.6
chrX	120000	EXTRA=4;NRHITS=18;PVALUE=0.2
chrX	130000	NRHITS=39;PVALUE=0.035;EXTRA=5

file4.txt:

Code:

#row1	
#row2	
#row3	
#row4	
#row5	
CHR	POS	INFO
chrX	10000	NRHITS=35;PVALUE=0.04
chrX	109000	NRHITS=6;PVALUE=
chrX	110000	NRHITS=10;PVALUE=0.006
chrX	120000	NRHITS=18;PVALUE=0.02
chrX	130000	NRHITS=39;PVALUE=0.035

It produces the output files:
file1.txt.fail:

Code:

chrX	110000	NRHITS=10;PVALUE=0.6
chrX	120000	NRHITS=18;PVALUE=0.2

file2.txt.fail:

Code:

chrX	110000	PVALUE=0.6;NRHITS=10
chrX	120000	PVALUE=0.2;NRHITS=18

file3.txt.fail:

Code:

chrX	110000	NRHITS=10;EXTRA=3;PVALUE=0.6
chrX	120000	EXTRA=4;NRHITS=18;PVALUE=0.2

Note that there is no file4.txt.fail file because no line in file4.txt meets your criteria.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-10-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

WHY do you insist on xargs? You have received some proposals working entirely without it, although they may be somewhat off target as the target is not THAT clear. perl, sed, awk - they all will do what (we think) you need on an input stream of the desired file names.

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Xargs, awk, match, if greater - as a one-liner

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed one Liner inverse range match-help required

Discussion started by: TomG

2. Shell Programming and Scripting

awk one liner

Discussion started by: cabrao

3. Shell Programming and Scripting

HELP with AWK one-liner. Need to employ an If condition inside AWK to check for array variable ?

Discussion started by: shell_boy23

4. Shell Programming and Scripting

Need one liner to search pattern and print everything expect 6 lines from where pattern match made

Discussion started by: chidori

5. Shell Programming and Scripting

Search & Replace regex Perl one liner to AWK one liner

Discussion started by: verge

6. Shell Programming and Scripting

Awk one-liner?

Discussion started by: palex

7. UNIX for Dummies Questions & Answers

need an awk one liner

Discussion started by: kenneth.mcbride

8. UNIX for Dummies Questions & Answers

awk one liner

Discussion started by: kenneth.mcbride

9. Shell Programming and Scripting

awk one liner

Discussion started by: repinementer

10. Shell Programming and Scripting

AWK greater than?

Discussion started by: dlam