Mean score value by ID over a defined genomic region


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Mean score value by ID over a defined genomic region
# 1  
Old 09-14-2015
Mean score value by ID over a defined genomic region

Hi,

I would like to know how can I get a mean score value by ID over a defined genomic region. Here it is an example:

file1
Code:
12 100 103 id1
12 110 112 id1
12 200 203 id2

file2
Code:
12 100 101 1
12 101 102 0.8
12 102 103 0.7
12 110 111 2.5
12 111 112 2.8
12 200 201 10.1
12 201 202 12.0
12 202 203 11.7

Desired output file
Code:
id1 1.56
id2 11.3


Thanks in advance
Moderator's Comments:
Mod Comment Please use CODE tags (not ICODE tags) for multiline displays.

Last edited by Don Cragun; 09-14-2015 at 10:13 PM.. Reason: Fix CODE tags.
# 2  
Old 09-14-2015
Please explain how file1 and ile2 are connected. Could there be lines in file2 that must not be counted?
How many decimal places do you need in the output?

Last edited by RudiC; 09-14-2015 at 06:23 AM..
# 3  
Old 09-14-2015
If the genomic regions in file 2 overlap any of the genomic regions of file 1, average the scores by ID

file1 fields explanation:
chromosome startPosition endPosition ID

file2 fields explanation:
chromosome startPosition endPosition score


One decimal place is enough. Thank you.
# 4  
Old 09-14-2015
Try
Code:
awk '
NR==FNR         {P1[NR]=$2
                 P2[NR]=$3
                 ID[NR]=$4
                 next
                }
                {for (i in ID)  {if ($2 >= P1[i] &&
                                     $3 <= P2[i])       {SUM[ID[i]]+=$4
                                                         CNT[ID[i]]++
                                                         break
                                                        }
                                }
                }

END             {for (s in SUM) printf "%s %.1f\n", s, SUM[s]/CNT[s]}
' file1 file2
id1 1.6
id2 11.3

# 5  
Old 09-14-2015
Thank you, but I only get the mean scores in the output, not the IDs.
# 6  
Old 09-14-2015
What if working on the sample files in post#1?
# 7  
Old 09-15-2015
Then the output is:

1.6
id2 11.3
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Average score

awk '{if(len==0){last=$4;total=$6;len=1;getline}if($4!=last){printf("%s\t%f\n", last, total/len);last=$4;total=$6;len=1}else{total+=$6;len+=1}}END{printf("%s\t%f\n", last, total/len)}' exon.txt > output.txt In the attached file I am just trying to group all the same names in column $4 and then... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

Split a file in more files based on score content

Dear All, I have the following file tabulated: ID distanceTSS score 8434 571269 10 10122 393912 9 7652 6 10 4863 1451 9 8419 39 2 9363 564 21 9333 7714 22 9638 8334 9 1638 1231 11 10701 918 1000 6587 32056 111 What I would like to do is the following, create 100 new files based... (5 Replies)
Discussion started by: paolo.kunder
5 Replies

3. AIX

Change lv REGION in HDISK1

Dears my rootvg is missed up i can not extend the /opt as soon as i try to extend the Filesystem its give me that there is not enough space . as there any way to change the REGION of the LVs in HDISK1 ? lspv -p hdisk0 hdisk0: PP RANGE STATE REGION LV NAME TYPE ... (8 Replies)
Discussion started by: thecobra151
8 Replies

4. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2... (1 Reply)
Discussion started by: fadista
1 Replies

5. Shell Programming and Scripting

Region between lines

How can I find the regions between specific lines? I have a file which contains lines like this: chr1 0 17388 0 chr1 17388 17444 1 chr1 17444 17599 2 chr1 17599 17601 1 chr1 17601 569791 0 chr1 569791 569795 1 chr1 569795 569808 2 chr1 569808 569890 3 chr1 569890 570047 4 ... (9 Replies)
Discussion started by: linseyr
9 Replies

6. UNIX for Dummies Questions & Answers

Genomic data processing

Dear fellow members, I've just joined the forum and am a newbie to shell scripting and programming. I'm stuck on the following problem. I'm working with large scale genomic data and need to do some analyses on it. Essentially it is text processing problem, so please don't mind the scientific... (0 Replies)
Discussion started by: mvaishnav
0 Replies

7. Shell Programming and Scripting

Grade Score Script Project

What I thought would be an extremely simple project has proven more difficult for me than I thought. Here are the parameters: Thus far, I've been able to sort the final grades, but I'm having a lot of trouble with appending the correlating letter grade to the end of each line. Any help would be... (3 Replies)
Discussion started by: lazypeterson
3 Replies

8. Shell Programming and Scripting

remove lines based on score criteria

Hi guys, Please guide for Solution. PART-I INPUT FILE (has 2 columns ID and score) TC5584_1 93.9 DV161411_2 79.5 BP132435_5 46.8 EB682112_1 34.7 BP132435_4 29.5 TC13860_2 10.1 OUTPUT FILE (It shudn't contain the line ' BP132435_4 29.5 ' as BP132435 is repeated... (2 Replies)
Discussion started by: smriti_shridhar
2 Replies

9. Post Here to Contact Site Administrators and Moderators

I cant updated the score on space invaders

Hello The same thing happen to me yesterday I canīt record my score on invaders game. (0 Replies)
Discussion started by: lo-lp-kl
0 Replies

10. UNIX for Advanced & Expert Users

stack region

how can i determine that what percentage of stack region is currently is used? (i am using tru64 unix) (2 Replies)
Discussion started by: yakari
2 Replies
Login or Register to Ask a Question