Average columns based on header name Post: 302969068

Sponsored Content

Homework and Emergencies Emergency UNIX and Linux Support Average columns based on header name Post 302969068 by jacobs.smith on Thursday 17th of March 2016 03:28:46 PM

03-17-2016

Banned

Quote:

Originally Posted by Don Cragun

I do not understand what output you are trying to produce.

For each input gene line do you want:

one output line with the gene number and 10 averages from all 13 samples,
thirteen output lines with the gene number and the 10 averages from one sample on each line, or
one output line with the gene number and 130 averages where each set of 10 averages comes from one sample?

Can you show us the exact output you're hoping to produce from the data provided in your sample input for genes 1 and 2?

Was the data for gene 3 in your sample truncated, or will some inputs have missing fields that should be treated as zero values when computing the averages?

Hi Don,

Thank you for your response. Good questions and I am glad at least you replied. Smilie

Thank you so much.

Here are the answers for your questions.

I want "one output line with the gene number and 130 averages where each set of 10 averages comes from one sample"

Since the data is very big, I will use the below small example assuming I have only one sample.

Code:

cat input_example1
Gene 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14% 15% 16% 17% 18% 19% 20%....30%....40%....50%....60%....70%....80%....90%..100%
Gene 1 2 3 4 5 4 3 2 1 1 22.5 2 3.5 3 3 3 4 5 6 7...1..2..3..5..9..11..134.33...0.1

Code:

cat output_example1
Gene Average_10% Average_20% ..........Average_100%
Gene 2.6 5.9..............x

Here the 2.6 is coming from averaging 1 in second column through 1 in 11th column. 5.9 is coming from averaging from 22.5(12th column) till 7 (21st column). Please remember that if the 11th column was to be any value greater than 10.0% like 10.1% or 10.8% or anything, then we will be averaging only until the 10th column.

The main input file from the link that I gave in my earlier post, has 13 samples' 1% to 100% values and only two rows. In the output file, we will have Gene column plus 130 average values. Makes sense?

There will be no gene3 in an input file. It will always be 2 rows (first line is header and second line is the values for averaging) with 13 sample being chopped into different varying percentages. There will be no missing fields. For every column header, there is a value associated with it in the input file.

Also I have a batch of files like 4000. I give a folder with the extension, and the script should be reading each file and following the averaging conditions of differentiating exact 10.0% (to consider for average) and any other values greater than 10.0% like 10.1% or 10.2% or 10.3% etc (to not consider this column and the column before it).

Please ask as many questions as possible and I will be glad to answer.

Coming to what I have tried so far, I have been trying to read the headers and print each set into a different file and then do the computation and put it back and then move to the other sample. This seems to be very time taking.

All your time and understanding is highly appreciated.

Thanks in advance

jacobs.smith

View Public Profile for jacobs.smith

Find all posts by jacobs.smith

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk based script to find the average of all the columns in a data file

Hi All, I need the modification for the below mentioned code (found in one more post https://www.unix.com/shell-programming-scripting/27161-script-generate-average-values.html) to find the average values for all the columns(but for a specific rows) and print the averages side by side. I have...

2. Shell Programming and Scripting

Average of columns with values of other column with same name

I have a lot of input files that have the following form: Sample Cq Sample Cq Sample Cq Sample Cq Sample Cq 1WBIN 23.45 1WBIN 23.45 1CVSIN 23.96 1CVSIN 23.14 S1 31.37 1WBIN 23.53 1WBIN 23.53 1CVSIN 23.81 1CVSIN 23.24 S1 31.49 1WBIN 24.55 1WBIN 24.55 1CVSIN 23.86 1CVSIN 23.24 S1 31.74 ...

3. Shell Programming and Scripting

Average, min and max in file with header, using awk

Hi, I have a file which looks like this: FID IID MISS_PHENO N_MISS N_GENO F_MISS 12AB43131 12AB43131 N 17774 906341 0.01961 65HJ87451 65HJ87451 N 10149 906341 0.0112 43JJ21345 43JJ21345 N 2826 906341 0.003118I would...

4. Shell Programming and Scripting

Extract columns based on header

Hi to all, I have two files. File1 has no header, two columns: sample1 A sample2 B sample3 B sample4 C sample5 A sample6 D sample7 D File2 has a header, except for the first 3 columns (chr,start,end). "sample1" is the header for the 4th ,5th ,6th columns, "sample2" is the header...

5. Shell Programming and Scripting

Make copy of text file with columns removed (based on header)

Hello, I have some tab delimited text files with a three header rows. The headers look like, (sorry the tabs look so messy). index group Name input input input input input input input input input input input...

6. Shell Programming and Scripting

Average across multiple columns group by

Hi experts, I want to group by average, for multiple columns starting column $7 until NF, group by ($1-$5), please help For just 7th column, I can do awk ' NR>1{ arr += $7 count += 1 } END{ for (a in arr) { print a, arr/count ...

7. UNIX for Beginners Questions & Answers

Keep only columns in first two rows based on partial header pattern.

I have this code below that only prints out certain columns from the first two rows (doesn't affect rows 3 and beyond). How can I do the same on a partial header pattern �G_TP� instead of having to know specific column numbers (e.g. 374-479)? I've tried many other commands within this pipe with no...

8. Shell Programming and Scripting

Find columns in a file based on header and print to new file

Hello, I have to fish out some specific columns from a file based on the header value. I have the list of columns I need in a different file. I thought I could read in the list of headers I need, # file with header names of required columns in required order headers_file=$2 # read contents...

9. Shell Programming and Scripting

Average of a columns from three files

hello, I have three files in the following order ==> File1 <== 1 20977000 20977000 A C 1.00 0,15 15 45 1 115829313 115829313 G A 0.500 6,7 13 99 ==> File2 <== 1 20977000 20977000 A C 1.00 0,13 13 39 1 115829313 ...

10. UNIX for Beginners Questions & Answers

Average of columns

I have files that have the following columns chr pos ref alt sample 1 sample 2 sample 3 chr2 179644035 G A 1,107 0,1 58,67 chr7 151945167 G T 142,101 100,200 500,700 chr13 31789169 CTT CT,C 6,37,8 0,0,0 15,46,89 chr22 ...