Creating a new percentage summary file

06-13-2012

Registered User

14, 0

Join Date: Oct 2011

Last Activity: 31 July 2013, 6:35 PM EDT

Location: Alberta

Posts: 14

Thanks Given: 5

Thanked 0 Times in 0 Posts

Creating a new percentage summary file

Hello Forumites.

You guys really helped me out in the past with manipulating some files with awk commands. Now, the output from the analysis program has changed and I would like to rework the data.

The output now looks something like file below, where the top row will contain 5 initial columns, followed by however many samples have been analysed. The file is tab delimited. I would like to get the percentage sequences, using the Bacteria in each sample as a denominator, above a certain threshold (say 0.005 (a half of a percent) occurring in at least one sample) and have the result in a new file.

Code:

taxlevel	 rankID	 taxon	 daughterlevels	 total	D1	D13	D17	D19	
0	0	Root	1	167944	3323	4018	4704	3634	
1	0.1	Bacteria	25	167944	3323	4018	4704	3634

5	0.1.7.4.1.18	Prevotellaceae	3	38447	923	1198	1727	1267
6	0.1.7.4.1.18.2	Prevotella	1	24834	658	915	1235	734
7	0.1.7.4.1.18.2.1	unclassified	0	24834	658	915	1235	734
6	0.1.7.4.1.18.3	Xylanibacter	1	756	3	2	41	3
7	0.1.7.4.1.18.3.1	unclassified	0	756	3	2	3	0
6	0.1.7.4.1.18.5	uncultured	1	12857	262	281	451	533	
7	0.1.7.4.1.18.5.1	unclassified	0	12857	262	281	451	533	
5	0.1.7.4.1.19	RF16	1	2196	14	39	77	58

Would become something like (calc errors possible, I did it by hand):

Code:

taxlevel	 rankID	 taxon	 daughterlevels	 total	D1	D13	D17	D19	
0	0	Root	1	167944	3323	4018	4704	3634	
1	0.1	Bacteria	25	167944	3323	4018	4704	3634

5	0.1.7.4.1.18	Prevotellaceae	3	0.2289	0.2777	0.2982	0.3671	0.3486
6	0.1.7.4.1.18.2	Prevotella	1	0.1478	0.1980	0.2277	0.2625	0.2020
7	0.1.7.4.1.18.2.1	unclassified	0	0.1478	0.1980	0.2277	0.2625	0.2020
6	0.1.7.4.1.18.3	Xylanibacter	1	0.00450	0.0000	0.0000	0.0087	0.0000
6	0.1.7.4.1.18.5	uncultured	1	0.0765	.07884	.0699	.0959	0.1467	
7	0.1.7.4.1.18.5.1	unclassified	0	0.0765	.07884	.0699	.0959	0.1467
5	0.1.7.4.1.19	RF16	1	0.0131	0.0042	0.0097	0.0164	0.0159

Xylanibacter would stay be in the table, as sample D17 is above the threshold, but 0.1.7.4.1.18.3.1 unclassified would not.

Any ideas greatly appreciated!

fozrun

View Public Profile for fozrun

Find all posts by fozrun

06-13-2012

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

Confused...

Are you trying to figure percents across the row? Or down the rows?
If across a row, does this give you an approach?

Code:

$ echo 12 13 14 | awk '{ss=$1+$2+$3; s1=$1/ss; s2=$2/ss; s3=$3/ss; print s1,s2,s3}'
0.307692 0.333333 0.358974

Hard to understand what you said, especially since the columns do not line up on the display.

joeyg

View Public Profile for joeyg

Find all posts by joeyg

06-13-2012

Registered User

14, 0

Join Date: Oct 2011

Last Activity: 31 July 2013, 6:35 PM EDT

Location: Alberta

Posts: 14

Thanks Given: 5

Thanked 0 Times in 0 Posts

Yes across the rows

The display is indeed poor, is this better?

So for Prevotellaceae, D1 the result would be =923/3323
Xylanibacter sample D17 would be =41/4704

Code:

taxlevel	 rankID	             taxon	             daughterlevels	 total	        D1	D13	 D17	D19	
0	           0	                     Root	             1	                         167944	3323	4018	4704	3634	
1	           0.1	                     Bacteria	     25	                 167944	3323	4018	4704	3634
5	           0.1.7.4.1.18	     Prevotellaceae     3	                        38447	923	1198	1727	1267
6	           0.1.7.4.1.18.2	     Prevotella	      1	                        24834	658	915	1235	734
7	           0.1.7.4.1.18.2.1     unclassified	      0	                        24834	658	915	1235	734
6	           0.1.7.4.1.18.3	     Xylanibacter	      1	                        756	        3	2	41	3
7	           0.1.7.4.1.18.3.1     unclassified	      0	                        756	        3	2	3	0
6	           0.1.7.4.1.18.5	     uncultured	      1	                        12857	262	281	451	533	
7	           0.1.7.4.1.18.5.1     unclassified	      0	                        12857	262	281	451	533	
5	           0.1.7.4.1.19	     RF16	              1	                        2196	        14	39	77	5

---------- Post updated at 11:45 AM ---------- Previous update was at 11:41 AM ----------

The display is indeed poor, is this better (still can't get it quite right)?

So for Prevotellaceae, D1 the result would be =923/3323
Xylanibacter sample D17 would be =41/4704

Code:

taxlevel	 rankID	             taxon	             daughterlevels	 total	        D1	D13	 D17	D19	
0	           0	                     Root	             1	                    167944	3323	4018	4704	3634	
1	           0.1	                     Bacteria	     1                           167944	3323	4018	4704	3634
5	           0.1.7.4.1.18	     Prevotellaceae     3	                        38447	923	1198	1727	1267
6	           0.1.7.4.1.18.2	     Prevotella	      1	                        24834	658	915	1235	734
7	           0.1.7.4.1.18.2.1     unclassified	      0	                        24834	658	915	1235	734
6	           0.1.7.4.1.18.3	     Xylanibacter	      1	                        756	        3	2	41	3
7	           0.1.7.4.1.18.3.1     unclassified	      0	                        756	        3	2	3	0
6	           0.1.7.4.1.18.5	     uncultured	      1	                        12857	262	281	451	533	
7	           0.1.7.4.1.18.5.1     unclassified	      0	                        12857	262	281	451	533	
5	           0.1.7.4.1.19	     RF16	              1	                        2196	        14	39	77	5

fozrun

View Public Profile for fozrun

Find all posts by fozrun

06-13-2012

Registered User

2,524, 241

Join Date: Dec 2007

Last Activity: 17 March 2020, 2:04 PM EDT

Posts: 2,524

Thanks Given: 173

Thanked 241 Times in 206 Posts

what about something like:

Without doing it all....
Notes
my 2nd command is only to show the stored value
I guessed that the valid data lines had 0.1. in them

Code:

$ d1=`grep Bacteria  <sample14.txt | cut -f6`

$ echo $d1
3323

$ grep "0\.1\." <sample14.txt | awk -v d1=$d1 '{p1=$6/d1; print $3,$6,p1}'
Prevotellaceae 923 0.277761
Prevotella 658 0.198014
unclassified 658 0.198014
Xylanibacter 3 0.000902799
unclassified 3 0.000902799
uncultured 262 0.0788444
unclassified 262 0.0788444
RF16 14 0.00421306

joeyg

View Public Profile for joeyg

Find all posts by joeyg

06-13-2012

Registered User

14, 0

Join Date: Oct 2011

Last Activity: 31 July 2013, 6:35 PM EDT

Location: Alberta

Posts: 14

Thanks Given: 5

Thanked 0 Times in 0 Posts

Yes, the data lines would start with 0.1 in the second column. However I need this to work over files with different numbers of samples (more or less columns) and incorporate a filter to get rid of low abundance samples (0.005 or greater not occurring in any row).

fozrun

View Public Profile for fozrun

Find all posts by fozrun

Shell Programming and Scripting

Creating a new percentage summary file

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with awk percentage calculation from a file

Discussion started by: venkitesh

2. Shell Programming and Scripting

Summary report csv file

Discussion started by: inMyZone35

3. Shell Programming and Scripting

How to calculate what percentage of X value is there in the file?

Discussion started by: aroragaurav.84

4. Shell Programming and Scripting

Using awk to create a summary of a structured file

Discussion started by: afulldevnull

5. AIX

File system percentage to the hole size ?

Discussion started by: arm

6. Shell Programming and Scripting

awk script to count percentage from log file

Discussion started by: justbow

7. Shell Programming and Scripting

Need to find the percentage of the directory in the file system.

Discussion started by: Arunprasad

8. Shell Programming and Scripting

Create Summary file containg information

Discussion started by: lodey