Creating a new percentage summary file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Creating a new percentage summary file
# 1  
Old 06-13-2012
Creating a new percentage summary file

Hello Forumites.

You guys really helped me out in the past with manipulating some files with awk commands. Now, the output from the analysis program has changed and I would like to rework the data.

The output now looks something like file below, where the top row will contain 5 initial columns, followed by however many samples have been analysed. The file is tab delimited. I would like to get the percentage sequences, using the Bacteria in each sample as a denominator, above a certain threshold (say 0.005 (a half of a percent) occurring in at least one sample) and have the result in a new file.

Code:
taxlevel	 rankID	 taxon	 daughterlevels	 total	D1	D13	D17	D19	
0	0	Root	1	167944	3323	4018	4704	3634	
1	0.1	Bacteria	25	167944	3323	4018	4704	3634

5	0.1.7.4.1.18	Prevotellaceae	3	38447	923	1198	1727	1267
6	0.1.7.4.1.18.2	Prevotella	1	24834	658	915	1235	734
7	0.1.7.4.1.18.2.1	unclassified	0	24834	658	915	1235	734
6	0.1.7.4.1.18.3	Xylanibacter	1	756	3	2	41	3
7	0.1.7.4.1.18.3.1	unclassified	0	756	3	2	3	0
6	0.1.7.4.1.18.5	uncultured	1	12857	262	281	451	533	
7	0.1.7.4.1.18.5.1	unclassified	0	12857	262	281	451	533	
5	0.1.7.4.1.19	RF16	1	2196	14	39	77	58

Would become something like (calc errors possible, I did it by hand):

Code:
taxlevel	 rankID	 taxon	 daughterlevels	 total	D1	D13	D17	D19	
0	0	Root	1	167944	3323	4018	4704	3634	
1	0.1	Bacteria	25	167944	3323	4018	4704	3634

5	0.1.7.4.1.18	Prevotellaceae	3	0.2289	0.2777	0.2982	0.3671	0.3486
6	0.1.7.4.1.18.2	Prevotella	1	0.1478	0.1980	0.2277	0.2625	0.2020
7	0.1.7.4.1.18.2.1	unclassified	0	0.1478	0.1980	0.2277	0.2625	0.2020
6	0.1.7.4.1.18.3	Xylanibacter	1	0.00450	0.0000	0.0000	0.0087	0.0000
6	0.1.7.4.1.18.5	uncultured	1	0.0765	.07884	.0699	.0959	0.1467	
7	0.1.7.4.1.18.5.1	unclassified	0	0.0765	.07884	.0699	.0959	0.1467
5	0.1.7.4.1.19	RF16	1	0.0131	0.0042	0.0097	0.0164	0.0159

Xylanibacter would stay be in the table, as sample D17 is above the threshold, but 0.1.7.4.1.18.3.1 unclassified would not.

Any ideas greatly appreciated!
# 2  
Old 06-13-2012
Confused...

Are you trying to figure percents across the row? Or down the rows?
If across a row, does this give you an approach?

Code:
$ echo 12 13 14 | awk '{ss=$1+$2+$3; s1=$1/ss; s2=$2/ss; s3=$3/ss; print s1,s2,s3}'
0.307692 0.333333 0.358974

Hard to understand what you said, especially since the columns do not line up on the display.
# 3  
Old 06-13-2012
Yes across the rows

The display is indeed poor, is this better?

So for Prevotellaceae, D1 the result would be =923/3323
Xylanibacter sample D17 would be =41/4704

Code:
taxlevel	 rankID	             taxon	             daughterlevels	 total	        D1	D13	 D17	D19	
0	           0	                     Root	             1	                         167944	3323	4018	4704	3634	
1	           0.1	                     Bacteria	     25	                 167944	3323	4018	4704	3634
5	           0.1.7.4.1.18	     Prevotellaceae     3	                        38447	923	1198	1727	1267
6	           0.1.7.4.1.18.2	     Prevotella	      1	                        24834	658	915	1235	734
7	           0.1.7.4.1.18.2.1     unclassified	      0	                        24834	658	915	1235	734
6	           0.1.7.4.1.18.3	     Xylanibacter	      1	                        756	        3	2	41	3
7	           0.1.7.4.1.18.3.1     unclassified	      0	                        756	        3	2	3	0
6	           0.1.7.4.1.18.5	     uncultured	      1	                        12857	262	281	451	533	
7	           0.1.7.4.1.18.5.1     unclassified	      0	                        12857	262	281	451	533	
5	           0.1.7.4.1.19	     RF16	              1	                        2196	        14	39	77	5

---------- Post updated at 11:45 AM ---------- Previous update was at 11:41 AM ----------

The display is indeed poor, is this better (still can't get it quite right)?

So for Prevotellaceae, D1 the result would be =923/3323
Xylanibacter sample D17 would be =41/4704

Code:
taxlevel	 rankID	             taxon	             daughterlevels	 total	        D1	D13	 D17	D19	
0	           0	                     Root	             1	                    167944	3323	4018	4704	3634	
1	           0.1	                     Bacteria	     1                           167944	3323	4018	4704	3634
5	           0.1.7.4.1.18	     Prevotellaceae     3	                        38447	923	1198	1727	1267
6	           0.1.7.4.1.18.2	     Prevotella	      1	                        24834	658	915	1235	734
7	           0.1.7.4.1.18.2.1     unclassified	      0	                        24834	658	915	1235	734
6	           0.1.7.4.1.18.3	     Xylanibacter	      1	                        756	        3	2	41	3
7	           0.1.7.4.1.18.3.1     unclassified	      0	                        756	        3	2	3	0
6	           0.1.7.4.1.18.5	     uncultured	      1	                        12857	262	281	451	533	
7	           0.1.7.4.1.18.5.1     unclassified	      0	                        12857	262	281	451	533	
5	           0.1.7.4.1.19	     RF16	              1	                        2196	        14	39	77	5

# 4  
Old 06-13-2012
what about something like:

Without doing it all....
Notes
my 2nd command is only to show the stored value
I guessed that the valid data lines had 0.1. in them

Code:
$ d1=`grep Bacteria  <sample14.txt | cut -f6`

$ echo $d1
3323

$ grep "0\.1\." <sample14.txt | awk -v d1=$d1 '{p1=$6/d1; print $3,$6,p1}'
Prevotellaceae 923 0.277761
Prevotella 658 0.198014
unclassified 658 0.198014
Xylanibacter 3 0.000902799
unclassified 3 0.000902799
uncultured 262 0.0788444
unclassified 262 0.0788444
RF16 14 0.00421306

# 5  
Old 06-13-2012
Yes, the data lines would start with 0.1 in the second column. However I need this to work over files with different numbers of samples (more or less columns) and incorporate a filter to get rid of low abundance samples (0.005 or greater not occurring in any row).
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with awk percentage calculation from a file

i have a file say test with the below mentioned details Folder Name Total space Space used /test/test1 500.1GB 112.0 GB /test/test2 3.2 TB 5TB /test/test3 3TB 100GB i need to calculate percentage of each row based on total space and space used and copy... (9 Replies)
Discussion started by: venkitesh
9 Replies

2. Shell Programming and Scripting

Summary report csv file

Hello, I have 2 csv files with 4 columns each. file1.csv A, AA, AAA, AAAA B, BB, BBB, BBBB file2.csv C, CC, CCC, CCCC D, DD, DDD, DDDD I would like to use shell commands (sed, awk...) to copy the content of the 2 files (2x4 columns) into a final csv template file. Expected... (2 Replies)
Discussion started by: inMyZone35
2 Replies

3. Shell Programming and Scripting

How to calculate what percentage of X value is there in the file?

Input File: 5081 2058 175 8282 2358 7347 6612 3459 END OF INPUT FILE I need to know how to calculate minimum,maximum,average of the values in the file and also what percentage is the values over some user defined value for example 1000 and what percentage of value is over 5000. By... (2 Replies)
Discussion started by: aroragaurav.84
2 Replies

4. Shell Programming and Scripting

Using awk to create a summary of a structured file

I am trying to use awk to create a summary of a structured file. Here is what it looks like: (random text) H1 H2 H3 H4 44 78 99 30 31 -- 32 21 12 33 55 21 I'd like to be able to specify a column, say H2, and then have information about that column printed. ... (4 Replies)
Discussion started by: afulldevnull
4 Replies

5. AIX

File system percentage to the hole size ?

Hi, I'd like to know how can I figure out my disk space area on AIX machine, for example to the situation of ( df -g ) which I have in my system : the area used by (/opt/oracle) file system is (98%) now. the free area on (/opt/oracle) is (0.75) now. the total size in Gigabyte... (1 Reply)
Discussion started by: arm
1 Replies

6. Shell Programming and Scripting

awk script to count percentage from log file

Hi, I have a log like this : actually i want to get the log like this : where % can get from : 100 * pmTotNoRrcConnectReqSucc / pmTotNoRrcConnectReq Thanks in advance.. :) (8 Replies)
Discussion started by: justbow
8 Replies

7. Shell Programming and Scripting

Need to find the percentage of the directory in the file system.

Hi All, I want to find the percentage occupied by the directory in the file system. Say, i have the file system /home/arun/work under this file system i have the directories /home/arun/work/yesterday /home/arun/work/today /home/arun/work/tomorrow The size of the file system is... (5 Replies)
Discussion started by: Arunprasad
5 Replies

8. Shell Programming and Scripting

Create Summary file containg information

Folks, I have multiple files in a folder containing some information (there is around 100 of them). What I would like to do would be able to import some of the information into a summary text file so that it will be easier to read a glance. The name of the files all start with the naming... (4 Replies)
Discussion started by: lodey
4 Replies
Login or Register to Ask a Question