Processing result file based on a minimal value


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Processing result file based on a minimal value
# 1  
Old 10-12-2011
Processing result file based on a minimal value

After the great awk solution to my last problem (that saved me days of work) I thought I would try again.

I now have a result file that consists of two identifier columns and then columns of data for each sample, with tabs as delimiters (note the sample number can vary depending on the experiment, so the number of columns would be larger) :
infile
Code:
ID	Name	 172A_005.txt	172B_005.txt	172C_005.txt	172D_005.txt
0.1.4 	 Bacteroidetes 	0.538113889725821	0.61304400533398	0.818303617437906	0.5287288235991
0.1.4.2 	 Bacteroidia 	0.536908707442001	0.611710510364893	0.817891373801917	0.527188721715437
0.1.4.2.1.9 	 RF16 	0.00150647785477553	0.00521275306097709	0.00628671544883026	0.00154010188366307
0.1.4.2.1.11 	 S24-7 	0.000753238927387767	0.0018184022305734	0.002258064516129	0.00201397938632863
0.1.4.4 	 Cytophagia 	0.000602591141910214	0	0	0
0.1.12 	 Cyanobacteria 	0.00361554685146128	0.0147896714753303	0.00927548180974956	0.0129131619476365
0.1.12.5 	 SubsectionIII 	0.00361554685146128	0.0129712692447569	0.00793568999278574	0.00225091813766141
0.1.12.5.1 	 Crinalium 	0.00361554685146128	0.0129712692447569	0.00783262908378852	0.00213244876199502

I'm looking to keep the first header line and then delete any lines that do not have at least one value at or above a specific target, in this case 0.005. the result would look like-

outfile:
Code:
ID	Name	 172A_005.txt	172B_005.txt	172C_005.txt	172D_005.txt
0.1.4 	 Bacteroidetes 	0.538113889725821	0.61304400533398	0.818303617437906	0.5287288235991
0.1.4.2 	 Bacteroidia 	0.536908707442001	0.611710510364893	0.817891373801917	0.527188721715437
0.1.4.2.1.9 	 RF16 	0.00150647785477553	0.00521275306097709	0.00628671544883026	0.00154010188366307
0.1.12 	 Cyanobacteria 	0.00361554685146128	0.0147896714753303	0.00927548180974956	0.0129131619476365
0.1.12.5 	 SubsectionIII 	0.00361554685146128	0.0129712692447569	0.00793568999278574	0.00225091813766141
0.1.12.5.1 	 Crinalium 	0.00361554685146128	0.0129712692447569	0.00783262908378852	0.00213244876199502

Any help appreciated!
# 2  
Old 10-12-2011
Code:
$ awk -v T=0.005 '{ I=0; for (N=3; (I==0)&&(N<=NF); N++) if($N >= T) I=1; } I || (NR==1)' < data

ID      Name     172A_005.txt   172B_005.txt    172C_005.txt    172D_005.txt
0.1.4    Bacteroidetes  0.538113889725821       0.61304400533398        0.818303617437906       0.5287288235991
0.1.4.2          Bacteroidia    0.536908707442001       0.611710510364893      0.817891373801917        0.527188721715437
0.1.4.2.1.9      RF16   0.00150647785477553     0.00521275306097709     0.00628671544883026     0.00154010188366307
0.1.12   Cyanobacteria  0.00361554685146128     0.0147896714753303      0.00927548180974956     0.0129131619476365
0.1.12.5         SubsectionIII  0.00361554685146128     0.0129712692447569     0.00793568999278574      0.00225091813766141
0.1.12.5.1       Crinalium      0.00361554685146128     0.0129712692447569     0.00783262908378852      0.00213244876199502

$


Last edited by Corona688; 10-12-2011 at 07:58 PM.. Reason: typoes
This User Gave Thanks to Corona688 For This Post:
# 3  
Old 10-12-2011
Code:
awk -v T=0.005 '{s=0;for (i=3;i<=NF;i++) if ($i>=T) s++} s||(NR==1)' infile

This User Gave Thanks to rdcwayx For This Post:
# 4  
Old 10-13-2011
Some further processing?

When I used a real world result file I came up with a couple of other problems. The file now has the right data but the classification produces extraneous data in the form of duplicated "unclassified" entries, and the sort order is out of whack.

Code:
ID	Name 	MA1_01.txt	MA_01.txt	MB_01.txt	MC_01.txt
0.1.44.6.26.2.17	uncultured	0.008904934	0.010950786	0.02002887	0.009102323	0.001050788
0.1.44.6.26.2.17.1	unclassified	0.008904934	0.010950786	0.02002887	0.009102323	0.001050788
0.1.44.6.26.2.17.1.1	unclassified	0.008904934	0.010950786	0.02002887	0.009102323	0.001050788
0.1.44.6.26.2.17.1.1.1	unclassified	0.008904934	0.010950786	0.02002887	0.009102323	0.001050788
0.1.44.6.26.2.8	Luteimonas	0.048856799	0.046508632	0.045110069	0.050219711	0.006654991
0.1.44.6.26.2.8.1	unclassified	0.048856799	0.046508632	0.045110069	0.050219711	0.006654991
0.1.44.6.26.2.8.1.1	unclassified	0.048856799	0.046508632	0.045110069	0.050219711	0.006654991
0.1.44.6.26.2.8.1.1.1	unclassified	0.048856799	0.046508632	0.045110069	0.050219711	0.006654991
0.1.44.6.8	Chromatiales	0.005535499	0.007729967	0.001082642	0.005963591	0.001050788
0.1.44.6.8.1	Chromatiaceae	0.005294826	0.007214635	0	0.005649718	0.001050788
0.1.44.6.8.1.8	Rheinheimera	0.004572804	0.006441639	0	0.005021971	0
0.1.44.6.8.1.8.1	unclassified	0.004572804	0.006441639	0	0.005021971	0
0.1.44.6.8.1.8.1.1	unclassified	0.004572804	0.006441639	0	0.005021971	0
0.1.44.6.8.1.8.1.1.1	unclassified	0.004572804	0.006441639	0	0.005021971	0
0.1.44.6.9	Enterobacteriales	0.007220217	0.012625612	0.018585348	0	0.122942207
0.1.44.6.9.1	Enterobacteriaceae	0.007220217	0.012625612	0.018585348	0	0.122942207
0.1.44.6.9.1.10	Enteric_Bacteria_cluster	0.006738869	0.011852615	0.017683147	0	0.116287215
0.1.44.6.9.1.10.1	Brenneria	0.004091456	0.005024478	0.008841573	0	0.070753065
0.1.44.6.9.1.10.1.1	unclassified	0.004091456	0.005024478	0.008841573	0	0.070753065
0.1.44.6.9.1.10.1.1.1	unclassified	0.004091456	0.005024478	0.008841573	0	0.070753065

I'd like to get rid of any lines that have unclassified in them and have the first column sorted based on the hierarchy of the numbers. So that for example 0.1.44.6.8 and following would come before 0.1.44.6.26 etc.
# 5  
Old 10-13-2011
YMMV:
Code:
sed -n '/unclassified/p'  foz.txt | sort -nr -k1.1

# 6  
Old 10-13-2011
Almost

That clued me in and I used
Code:
sed -d '/unclassified/d'  result_B.txt > 1.txt

to get rid of the lines with unclassified.

The sort is still a problem. I can't seem to get the hierarchical order correct.

How would 0.1.2.1 be sorted before 0.1.11.1?
# 7  
Old 10-13-2011
A numeric sort won't work when there's .'s in the middle of it...

If you turned 1.2.3.4 into 001.002.003.004 then it'd sort... then you could turn it back.

working on it.

---------- Post updated at 12:32 PM ---------- Previous update was at 12:23 PM ----------

Code:
$ cat gsort.awk

BEGIN { OFS="\t" }
{
        split($1, A, ".");
        $1="";  P="";
        for(N=1; length(A[N]); N++) { $1=$1 P sprintf("%03d", A[N]); P="."; }
} NR>1

$ cat gunsort.awk

BEGIN { OFS="\t" }
{       split($1, A, ".");
        $1="";  P="";
        for(N=1; length(A[N]); N++) { $1=$1 P (A[N]+0); P="."; }
} 1

# Put header line in the outfile, since awk ignores it for sorting purposes
$ head -n 1 < infile > outfile

# Translate 1.2.3 into 001.002.003 with gsort.awk, sort it, 
# convert 001.002.003 back to 1.2.3 with gunsort, and append it to outfile
$ awk -f gsort.awk < infile | sort | awk -f gunsort.awk >> outfile
$ cat outfile

ID      Name    MA1_01.txt      MA_01.txt       MB_01.txt       MC_01.txt
0.1.44.6.8      Chromatiales    0.005535499     0.007729967     0.001082642    0.005963591      0.001050788
0.1.44.6.8.1    Chromatiaceae   0.005294826     0.007214635     0       0.005649718     0.001050788
0.1.44.6.8.1.8  Rheinheimera    0.004572804     0.006441639     0       0.005021971     0
0.1.44.6.8.1.8.1        unclassified    0.004572804     0.006441639     0      0.005021971      0
0.1.44.6.8.1.8.1.1      unclassified    0.004572804     0.006441639     0      0.005021971      0
0.1.44.6.8.1.8.1.1.1    unclassified    0.004572804     0.006441639     0      0.005021971      0
0.1.44.6.9      Enterobacteriales       0.007220217     0.012625612     0.018585348     0       0.122942207
0.1.44.6.9.1    Enterobacteriaceae      0.007220217     0.012625612     0.018585348     0       0.122942207
0.1.44.6.9.1.10 Enteric_Bacteria_cluster        0.006738869     0.011852615    0.017683147      0       0.116287215
0.1.44.6.9.1.10.1       Brenneria       0.004091456     0.005024478     0.008841573     0       0.070753065
0.1.44.6.9.1.10.1.1     unclassified    0.004091456     0.005024478     0.008841573     0       0.070753065
0.1.44.6.9.1.10.1.1.1   unclassified    0.004091456     0.005024478     0.008841573     0       0.070753065
0.1.44.6.26.2.8 Luteimonas      0.048856799     0.046508632     0.045110069    0.050219711      0.006654991
0.1.44.6.26.2.8.1       unclassified    0.048856799     0.046508632     0.045110069     0.050219711     0.006654991
0.1.44.6.26.2.8.1.1     unclassified    0.048856799     0.046508632     0.045110069     0.050219711     0.006654991
0.1.44.6.26.2.8.1.1.1   unclassified    0.048856799     0.046508632     0.045110069     0.050219711     0.006654991
0.1.44.6.26.2.17        uncultured      0.008904934     0.010950786     0.02002887      0.009102323     0.001050788
0.1.44.6.26.2.17.1      unclassified    0.008904934     0.010950786     0.02002887      0.009102323     0.001050788
0.1.44.6.26.2.17.1.1    unclassified    0.008904934     0.010950786     0.02002887      0.009102323     0.001050788
0.1.44.6.26.2.17.1.1.1  unclassified    0.008904934     0.010950786     0.02002887      0.009102323     0.001050788
$

This User Gave Thanks to Corona688 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to parse file and display result based on text

I am trying using awk to open an input file and check a column 2/field $2 and if there is a warning then that is displayed (variantchecker): G not found at position 459, found A instead. The attached Sample1.txt is that file. If in that column/field there is a black space, then the text after... (6 Replies)
Discussion started by: cmccabe
6 Replies

2. Shell Programming and Scripting

ksh : Building an array based on condition result

I want to build an Errorlog. I would like to build an array as I move through the if statements and print the array once all error conditions have been defined. The results need to be comma delimited. tsver will be static "1.9.6(2)" other vars $prit $lt $rt can have the same or a different... (1 Reply)
Discussion started by: popeye
1 Replies

3. Shell Programming and Scripting

How to sort grep result based on timestamp?

Hi, Trying to sort grep result based on timestamp of the filename. I have the following result and want to sort them on timestampgrep -i 'ERROR' *log*2013* s_m_xxx_xxx_xxx_xxx_xxxx.log.20130906092431:TRANSF_1_1_1> DBG_21216 Finished transformations for Source Qualifier . Total errors ... (5 Replies)
Discussion started by: bobbygsk
5 Replies

4. Shell Programming and Scripting

Output block of lines in a file based on grep result

Hi I would appreciate your help with this. I have a output file from a command. It is broken based on initial of the users. Exmaple of iitials MN & SS. Under each section there is information pertaining to the user however each section can have different number of lines. MY challenge is to ... (5 Replies)
Discussion started by: mnassiri
5 Replies

5. Programming

awk processing / Shell Script Processing to remove columns text file

Hello, I extracted a list of files in a directory with the command ls . However this is not my computer, so the ls functionality has been revamped so that it gives the filesizes in front like this : This is the output of ls command : I stored the output in a file filelist 1.1M... (5 Replies)
Discussion started by: ajayram
5 Replies

6. Shell Programming and Scripting

Help with File processing - Adding predefined text to particular record based on condition

I am generating a output: Name Count_1 Count_2 abc 12 12 def 15 14 ghi 16 16 jkl 18 18 mno 7 5 I am sending the output in html email, I want to add the code: <font color="red"> NAME COLUMN record </font> for the Name... (8 Replies)
Discussion started by: karumudi7
8 Replies

7. Shell Programming and Scripting

How to processing the log file within certain dates based on the file name

Hi I am working on the script parsing specific message "TEST" from multiple file. The log file name looks like: N3.2009-11-26-03-05-02.console.log.tar.gz N4.2009-11-29-00-25-03.console.log.tar.gz N6.2009-12-01-10-05-02.console.log.tar.gz I am using the following command: zgrep -a --text... (1 Reply)
Discussion started by: shyork2001
1 Replies

8. Shell Programming and Scripting

Filter the column and print the result based on condition

Hi all This is my output of the some SQL Query TABLESPACE_NAME FILE_NAME TOTALSPACE FREESPACE USEDSPACE Free ------------------------- ------------------------------------------------------- ---------- --------- ---------... (2 Replies)
Discussion started by: jhon
2 Replies

9. Shell Programming and Scripting

printing in certain column based on some result

hii every one can anybody help me writing shell script with this, i have a file ... amit arun vivek and i want to read something from the user and print next to amit or arun in certain column.. like amit 23-wall street 2000 arun 34343 vivek 4758 is... (6 Replies)
Discussion started by: kumar_amit
6 Replies

10. Shell Programming and Scripting

Time based Processing of the Scripts

Any one can tell me how can i execute the processes for every 10 min.Actually iam having 3 Processes for every 10 min i want to run these 3 Process,one process at every 10 min. If any of the process is busy i just want to execute the free one. first 10 min execute P1 next 10 min execute P2... (3 Replies)
Discussion started by: krk_555
3 Replies
Login or Register to Ask a Question