Select lines where at least x columns above threshold value


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Select lines where at least x columns above threshold value
# 1  
Old 03-14-2013
Select lines where at least x columns above threshold value

I have a file with 20 columns. I'd like to retain only the lines for which the values in at least x columns, looking only at columns 6-20, are above a threshold.

For example, I'd like to retain only the lines in the file below that have at least 8 columns (again, looking only at columns 6-20) with the value of at least 0.75. (I would like to be able to easily modify the code so that I could play around with the number of minimum columns (8 in this case) as well as the threshold (0.75)).

File:
Code:
s_20331    822    1    1.000    5.0    0.00000000    0.14395044    0.00000000    0.00000000    0.00000000    0.20102041    0.00000000    0.00000000    0.00000000    0.28091837    0.11224490    0.03571429    0.00000000    0.00000000    0.00000000
s_20416    154    1    1.000    5.0    0.00000000    1.00000000    0.66666667    0.40000000    0.30216165    1.00000000    0.66666667    0.45142857    0.35714286    0.11111111    0.32659933    0.55245256    0.17424242    0.32832080    0.10345717
s_20476    114    1    1.000    5.0    0.00000000    1.00000000    0.42857143    0.85100619    1.00000000    1.00000000    0.42857143    0.86996904    1.00000000    0.25000000    0.13039843    0.00000000    0.19697069    0.25000000    0.10607391
s_20477    162    1    1.000    6.0    0.20987654    0.79423868    0.81481481    0.78395062    0.77777778    1.00000000    1.00000000    1.00000000    1.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000    0.00000000

Output:
Code:
s_20477    162    1    1.000    6.0    0.20987654    0.79423868     0.81481481    0.78395062    0.77777778    1.00000000    1.00000000     1.00000000    1.00000000    0.00000000    0.00000000    0.00000000     0.00000000    0.00000000    0.00000000

I'm a novice and all I have so far is an awk command to set a threshold in individual columns, and then pipe that to another awk command screening another column. This obviously is inelegant as well as ineffective for allowing some columns to remain below the threshold.
Code:
awk '{if($6>=0.75)print;}' | awk '{if($7>=0.9)print;}' | awk '{if($8>=0.9)print;}'  | awk '{if($9>=0.9)print;}' [...etc]

Moderator's Comments:
Mod Comment code tags also for data files

Last edited by Scrutinizer; 03-14-2013 at 04:01 PM.. Reason: additional code tags
# 2  
Old 03-14-2013
You could try something like:
Code:
#!/bin/ksh
# SYNOPSIS:
# colcheck [file [first_column [last_column [threshhold [pass_count]]]]]
# DESCRIPTION:
# Print all lines in the file named by "file" (default file is input) in which
# at least "pass_count" (default value 8) values in columns "first_column"
# (default value 6) through "last_column" (default value 20) are greater than or
# equal to "threshold" (default value 0.75).
file=${1:-input}
fc=${2:-6}
lc=${3:-20}
threshold=${4:-0.75}
pass_count=${5:-8}
awk -v f="$fc" -v l="$lc" -v t="$threshold" -v p="$pass_count" '
{       c = p
        for(i = f; i <= l && c; i++) if($i >= t) c--
        if(c == 0) print
}' "$file"

If you are using a Solaris/SunOS system, use /usr/xpg4/bin/awk or nawk instead of awk.

I use the Korn shell, but this should also work with any other shell that accepts Bourne shell syntax (such as bash).

Last edited by Don Cragun; 03-14-2013 at 03:41 PM.. Reason: Fix typo in a comment.
This User Gave Thanks to Don Cragun For This Post:
# 3  
Old 03-14-2013
try also:
Code:
awk '{count=0; for (col=6; col<=20; col++) ($col >= .75) ? count++ : 0; if (count>=8) print}' infile

This User Gave Thanks to rdrtx1 For This Post:
# 4  
Old 03-14-2013
In your sample code, you don't have identical thresholds for the columns, but in your spec, you do. I'll assume the latter, as it's easier for a start.
For playing around, it might be best to have all parameters as variables:
Code:
$ awk '{cnt=0; for (i=FST; i<=LST; i++) cnt+=($i>THR)} cnt>=MIN' FST=6 LST=20 THR=0.75 MIN=8 file
s_20477    162    1    1.000    6.0    0.20987654    0.79423868    0.81481481 etc . . .

or, shamelessly stealing Don Cragun's ideas, this should do as well:
Code:
d$ awk '{cnt=MIN; for (i=FST; i<=LST && cnt; i++) cnt-=($i>THR)} !cnt' FST=6 LST=20 THR=0.75 MIN=8 file
s_20477    162    1    1.000    6.0    0.20987654    0.79423868    0.81481481    0.78395062 etc . . .

If you want exactly MIN columns to exceed the threshold, remove the && cnt in the for (...).

Last edited by RudiC; 03-14-2013 at 06:47 PM..
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to select rows that have opposite values (A vs B, or B vs A) on first two columns?

I have a dateset like this: Gly1 Gly2 2 1 0 Gly3 Gly4 3 4 5 Gly3 Gly5 1 3 2 Gly2 Gly1 3 6 2 Gly4 Gly3 2 2 1 Gly6 Gly4 4 2 1what I expected is: Gly1 Gly2 2 1 0 Gly2 Gly1 3 6 2 Gly3 Gly4 3 4 5 Gly4 Gly3 2 2 1 A vs B, or B vs A are the same... (7 Replies)
Discussion started by: nengcheng
7 Replies

2. Shell Programming and Scripting

How do I select certain columns with matching pattern and rest of the lines?

I want to select 2nd, 3rd columns if line has "key3" and print rest of the lines as is. # This is my sample input key1="val1" key2="val2" key3="val3" key4="val4" some text some text some text some text key1="val1" key2="val2" key3="val3" key4="val4" some text some text some text some... (3 Replies)
Discussion started by: kchinnam
3 Replies

3. Shell Programming and Scripting

Select all the even columns from a file

Hi, I can select all the even columns from a file like this: awk '{ for (i=1;i<=NF;i+=2) $i="" }1' file > new file How can I select the 1st and all the even columns using awk? Thanks! (1 Reply)
Discussion started by: forU
1 Replies

4. Shell Programming and Scripting

sum of a column and selecting lines with value above threshold

Hi again, I need to further process the results of a previous manipulation. I have a file with three columns e.g. AAA5 0.00175 1.97996e-06 AAA5 0.01334 2.14159e-05 AAA5 0.01340 4.12155e-05 AAA5 0.01496 1.10312e-05 AAA5 0.51401 0.0175308 BB0 0.00204 2.8825e-07 BB0 0.01569 7.94746e-07 BB0... (6 Replies)
Discussion started by: f_o_555
6 Replies

5. UNIX for Dummies Questions & Answers

help! script to select line with greatest value 2 between columns

Hi, I’m trying to do something I haven’t done before and I’m struggling with how to even create the command or script. I have the following space delim file: gene accession chr chr_st begin end NN1 NC_024540 chr3 - 14000 14020 NN1 ... (10 Replies)
Discussion started by: wolf_blue
10 Replies

6. Shell Programming and Scripting

Select columns from a matrix given within a range in BASH

I have a huge matrix file which looks like this (example matrix): 1 2 3 5 4 5 6 7 7 6 8 9 1 2 4 2 7 6 5 1 3 2 1 9 As one can see, this matrix has 4 columns and 6 rows. But my original matrix has some 3 million rows and 6000 columns. For example, on this matrix I can define my task as... (2 Replies)
Discussion started by: shoaibjameel123
2 Replies

7. Shell Programming and Scripting

Select lines in which column have value greater than some percent of total file lines

i have a file in following format 1 32 3 4 6 4 4 45 1 45 4 61 54 66 4 5 65 51 56 65 1 12 32 85 now here the total number of lines are 8(they vary each time) Now i want to select only those lines in which the values... (6 Replies)
Discussion started by: vaibhavkorde
6 Replies

8. Shell Programming and Scripting

[Solved] Select the columns which have value greater than particular number

i have a file of the form 9488 14392 1 1.8586e-07 5702 7729 1 1.8586e-07 9048 14018 1 1.8586e-07 5992 12556 1 1.8586e-07 9488 14393 1 1.8586e-07 9048 14019 1 1.8586e-07 5992 12557 1 1.8586e-07 9488 14394 ... (1 Reply)
Discussion started by: vaibhavkorde
1 Replies

9. Shell Programming and Scripting

Select and display sum depending upon even columns

Select and display sum depending upon even columns i have a input as : 2898 | homy | pune | 7/4/09 1 :6298 | anna | chennai | 7/4/08 2 :3728 | gonna | kol | 8/2/10 3 :3987 | hogja | mumbai | 8/5/09 4 :6187 | galma | london | 9/5/01 5 :9167 | tamina | ny | 8/3/10 6 :3981 | dastan | bagh |... (1 Reply)
Discussion started by: adityamitra
1 Replies

10. UNIX for Dummies Questions & Answers

Select and display sum depending upon even columns

i have a input as : 2898 | homy | pune | 7/4/09 1 :6298 | anna | chennai | 7/4/08 2 :3728 | gonna | kol | 8/2/10 3 :3987 | hogja | mumbai | 8/5/09 4 :6187 | galma | london | 9/5/01 5 :9167 | tamina | ny | 8/3/10 6 :3981 | dastan | bagh | 8/2/07 7 :4617 | vazir | ny now,i want to get... (2 Replies)
Discussion started by: adityamitra
2 Replies
Login or Register to Ask a Question