Help with processing coordinates in a file.


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Help with processing coordinates in a file.
# 1  
Old 02-06-2019
Help with processing coordinates in a file.

I have a variation table (variation.txt) which is a very big file. The first column in the chromosome number and the second column is the position of the variation. I have a second file annotation.txt which has a list of 37,000 genes (1st column), their chromosome number(2nd column), their start and end coordinates (3rd and 4th column), followed by some details

I have to assign the variations (based on chromosome number and its position) to the genes. First, it should look for the **matching chromosome number** in both files, and if that matches, the coordinate of the variation should be within (including) start and end position of the gene. I have made an attempt to print the entries of variation table. I want to append the gene IDs also from annotation table in the first column of the output. How do I do that ?

variation.txt

Code:
SL3.0ch02     702679    C     A     -     -     -     -     -     -     -     -    
SL3.0ch01     711131    A     G     -     -     -     -     -     -     -     -
SL3.0ch00     715124    G     A     -     -     -     -     -     -     -     -
SL3.0ch00     719289    C     T     -     -     -     -     -     -     -     -
SL3.0ch00     720926    A     C     -     -     -     -     -     -     -     -
SL3.0ch00     723860    A     C     Solyc00g005060.1     CDS     NONSYNONYMOUS     W/G     52    0    novel     DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00     723867    A     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     49    1    novel     TOLERATED
SL3.0ch00     723903    T     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     37    1    novel     TOLERATED

annotation.txt

Code:
Solyc00g005000.3.1    SL3.0ch02    702600    702900    +    Eukaryotic aspartyl protease family protein
Solyc00g005040.3.1    SL3.0ch01    715100    715200    +    Potassium channel
Solyc00g005050.3.1    SL3.0ch00    715150    715300    -    UPF0664 stress-induced protein C29B12.11c
Solyc00g005060.1.1    SL3.0ch00    723741    724013    -    LOW QUALITY:Cyclin/Brf1-like TBP-binding protein
Solyc00g005080.2.1    SL3.0ch00    723800    723900    -    LOW QUALITY:Protein Ycf2
Solyc00g005084.1.1    SL3.0ch05    809593    813633    +    UDP-Glycosyltransferase superfamily protein
Solyc00g005090.1.1    SL3.0ch07    1061632    1061916    -    LOW QUALITYYNAMIN-like 1B
Solyc00g005092.1.1    SL3.0ch01    1127794    1144385    +    Serine/threonine phosphatase-like protein
Solyc00g005094.1.1    SL3.0ch00    1144958    1146952    -    Glucose-6-phosphate 1-dehydrogenase 3, chloroplastic
Solyc00g005096.1.1    SL3.0ch00    1734562    1736567    +    RWP-RK domain-containing protein

awk script:

Code:
awk '
NR==FNR {
    a[$2][$3 " " $4]=$0
    next
}
($1 in a){
    for(i in a[$1])
        if(split(i,t)&&$2>=t[1]&&$2<=t[2])
            print
}' annotation.txt variation.txt

Desired output:

Code:
Solyc00g005060.1.1    SL3.0ch02    702679    C    A    -    -    -    -    -    -    -    -
Solyc00g005060.1.1    SL3.0ch00    723860    A    C    Solyc00g005060.1    CDS    NONSYNONYMOUS    W/G    52    0    novel    DELETERIOUS (*WARNING! Lowconfidence)
Solyc00g005080.2.1    SL3.0ch00    723860    A    C    Solyc00g005060.1    CDS    NONSYNONYMOUS    W/G    52    0    novel    DELETERIOUS (*WARNING! Lowconfidence)
Solyc00g005060.1.1    SL3.0ch00    723867    A    C    Solyc00g005060.1    CDS    SYNONYMOUS    G/G    49    1    novel    TOLERATED
Solyc00g005080.2.1    SL3.0ch00    723867    A    C    Solyc00g005060.1    CDS    SYNONYMOUS    G/G    49    1    novel    TOLERATED
Solyc00g005060.1.1    SL3.0ch00    723903    T    C    Solyc00g005060.1    CDS    SYNONYMOUS    G/G    37    1    novel    TOLERATED

Current output:

Code:
SL3.0ch02   702679  C   A   -   -   -   -   -   -   -   -
SL3.0ch00   723860  A   C   Solyc00g005060.1    CDS     NONSYNONYMOUS   W/G     52  0   novel   DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00   723860  A   C   Solyc00g005060.1    CDS     NONSYNONYMOUS   W/G     52  0   novel   DELETERIOUS (*WARNING! Low confidence)
SL3.0ch00   723867  A   C   Solyc00g005060.1    CDS     SYNONYMOUS  G/G     49  1   novel   TOLERATED
SL3.0ch00   723867  A   C   Solyc00g005060.1    CDS     SYNONYMOUS  G/G     49  1   novel   TOLERATED
SL3.0ch00   723903  T   C   Solyc00g005060.1    CDS     SYNONYMOUS  G/G     37  1   novel   TOLERATED

# 2  
Old 02-06-2019
Please explain how this line
Code:
Solyc00g005060.1.1    SL3.0ch02    702679    C    A    -    -    -    -    -    -    -    -

made it into the desired ouput - there is no connection between Solyc00g005060.1.1 and SL3.0ch02 in your annotation file, but it exists for the Solyc00g005000.3.1 and SL3.0ch02 pair. And, why isn't Solyc00g005050.3.1 / SL3.0ch00 mentioned? Shouldn't that match the third variation line?


Howsoever, pls try and report back
Code:
awk '
NR==FNR         {a[$2 FS $3 FS $4] = $1
                 next
                }
                {for (i in a) if (split (i,t) && $2>=t[2] && $2<=t[3])  print a[i], $0
                }
' annotation.txt variation.txt
Solyc00g005000.3.1 SL3.0ch02     702679    C     A     -     -     -     -     -     -     -     -    
Solyc00g005040.3.1 SL3.0ch00     715124    G     A     -     -     -     -     -     -     -     -
Solyc00g005060.1.1 SL3.0ch00     723860    A     C     Solyc00g005060.1     CDS     NONSYNONYMOUS     W/G     52    0    novel     DELETERIOUS (*WARNING! Low confidence)
Solyc00g005080.2.1 SL3.0ch00     723860    A     C     Solyc00g005060.1     CDS     NONSYNONYMOUS     W/G     52    0    novel     DELETERIOUS (*WARNING! Low confidence)
Solyc00g005060.1.1 SL3.0ch00     723867    A     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     49    1    novel     TOLERATED
Solyc00g005080.2.1 SL3.0ch00     723867    A     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     49    1    novel     TOLERATED
Solyc00g005060.1.1 SL3.0ch00     723903    T     C     Solyc00g005060.1     CDS     SYNONYMOUS     G/G     37    1    novel     TOLERATED

Its results match the desired output except for the two mentioned lines plus, mayhap, the field separators (which weren't specified, btw).
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Add coordinates to line of output extracted from input file

I am trying to compare/confirm the output of an script using the perl below, which does execute. However I can not seem to print $1 and $2 in each line of the input separated by a tab in each line of the output. The input file is quite large so I have only included a portion, but the format is the... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. Shell Programming and Scripting

Reducing the decimal points of numbers (3d coordinates) in a file; how to input data to e.g. Python

I have a file full of coordinates of the form: 37.68899917602539 58.07500076293945 57.79100036621094 The numbers don't always have the same number of decimal points. I need to reduce the decimal points of all the numbers (there are 128 rows of 3 numbers) to 2. I have tried to do this... (2 Replies)
Discussion started by: crunchgargoyle
2 Replies

3. Programming

awk processing / Shell Script Processing to remove columns text file

Hello, I extracted a list of files in a directory with the command ls . However this is not my computer, so the ls functionality has been revamped so that it gives the filesizes in front like this : This is the output of ls command : I stored the output in a file filelist 1.1M... (5 Replies)
Discussion started by: ajayram
5 Replies

4. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2... (1 Reply)
Discussion started by: fadista
1 Replies

5. Shell Programming and Scripting

Differential substring removal using coordinates

Hello all, this might be better suited for a bioinformatics forum, but I thought I'd try my luck here as well. I have several tabular text files of DNA sequence reads that appear as such: File_1.txt >H01BA45XW GATTACAGATTCGACATCCAACTGAGGCATT >H02BG78WR CCTTACAGACTGGGCATGAATATTGCATACC... (3 Replies)
Discussion started by: vectorborne5
3 Replies

6. Shell Programming and Scripting

Determination n points between two coordinates

Hi guys. Can anyone tell me how to determine points between two coardinates. For example: Which type of command line gives me 50 points between (8, -5, 7) and (2, 6, 9) points Thanks (5 Replies)
Discussion started by: rpf
5 Replies

7. Shell Programming and Scripting

Pulling data by GPS coordinates from text file

Hi there, I'm having a problem trying to extract data from within a text file. I'm trying to extract this manually for a lack of better words. I need any items that fall within latitude 36.5 to 39.5 and long -75.3 to -83.9 I have been doing this using cat neta.txt | grep '!38' and working... (6 Replies)
Discussion started by: Mikey
6 Replies

8. Shell Programming and Scripting

place cursor in specific coordinates

Hi, I have this problem on how to place the cursor in a text editor (for example: pico). I made this script that would attach comments to a script file then open the script file, I would like to know how to place the cursor in a specific place, for example at the end of the comments, ... (1 Reply)
Discussion started by: lechelle
1 Replies

9. Shell Programming and Scripting

Search for particular tag and arrange as coordinates

Hi I have a file whose sample contents are shown here, 1.2.3.4->2.4.2.4 a(10) b(20) c(30) 1.2.3.4->2.9.2.4 a(10) c(20) 2.3.4.3->3.6.3.2 b(40) d(50) c(20) 2.3.4.3->3.9.0.2 a(40) e(50) c(20) 1.2.3.4->3.4.2.4 a(10) c(30) 6.2.3.4->2.4.2.5 c(10) . . . . Here I need to search... (5 Replies)
Discussion started by: AKD
5 Replies

10. Shell Programming and Scripting

Defining X and Y Coordinates Inside A Window

Hello, I am starting up an Xnest window and trying to place a program inside of it. I have the window inside of it now but it always spawns with the top left corner at (0, 0). I need to find a way to set the x and y coordinates to something other than (0, 0). I tried using the -geometry option... (1 Reply)
Discussion started by: lesnaubr
1 Replies
Login or Register to Ask a Question