Map snps into a ref gene file


 
Thread Tools Search this Thread
Top Forums UNIX for Advanced & Expert Users Map snps into a ref gene file
# 1  
Old 01-19-2017
Map snps into a ref gene file

I have the following data set about the snps ID txt file

Code:
   POS ID	
    	78599583	rs987435
    	33395779	rs345783
    	189807684	rs955894
    	33907909	rs6088791
    	75664046	rs11180435
    	218890658	rs17571465
    	127630276	rs17011450
    	90919465	rs6919430

and a gene reference file, txt file

Code:
 genename	name	chrom	strand	txstart	txend
    CDK1	NM_001786	chr10	+	62208217	62224616
    CALB2	NM_001740	chr16	+	69950116	69981843
    STK38	NM_007271	chr6	-	36569637	36623271
    YWHAE	NM_006761	chr17	-	1194583	1250306
    SYT1	NM_005639	chr12	+	77782579	78369919
    ARHGAP22	NM_001347736	chr10	-	49452323	49534316
    PRMT2	NM_001535	chr21	+	46879934	46909464
    CELSR3	NM_001407	chr3	-	48648899	48675352

I'm trying to match the genes with the SNps using snps location, so include the snps that has

POS >= txstart and POS<= txend

for example I want a data set that has the following columns

Code:
genename   SNPID   chrom   position   txstart   txend


Last edited by Don Cragun; 01-19-2017 at 09:48 PM.. Reason: Add more CODE and ICODE tags.
# 2  
Old 01-19-2017
And what output are you trying to get from the two sample input files you provided?

What happens if there is no ID in the 1st file that appears in a range specified by the 2nd file?

What happens if there is more than one ID in the 1st file that fits in a range specified by a single line in the 2nd file?

What happens if there is no range in the 2nd file for a position specified in the 1st file?

What have you tried to solve this problem on your own?
# 3  
Old 01-19-2017
I'm expecting that a gene might have more than snpID,
and there might be genes that don't have snps it will be NA
and there might be one snpID for pre one gene

Code:
awk 'FNR==1 {next} FILENAME=="pre_snpinfo_tumor.txt" {k++; POS[k]=$2; ID[k]=$2;} \  
                   FILENAME=="refFlat.txt" {i++; \
                                     if(POS[i]>=$5 && POS[i]<=$6) \
                                          print $1, ID[i], $3, POS[i], $5, $6} \
    ' pre_snpinfo_tumor.txt  refFlat.txt

but there is an error can you help please
# 4  
Old 01-19-2017
One might think that something more like:
Code:
awk '
FNR==1 {next}
FNR == NR {
        POS[++k]=$1
        ID[k]=$2
        next
}
{       for(i = 1; i <= k; i++)
                if(POS[i]>=$5 && POS[i]<=$6)
                        print $1, ID[i], $3, POS[i], $5, $6
}' pre_snpinfo_tumor.txt  refFlat.txt

would work, but since absolutely none of the positions specified in your 1st sample input file are in any of the ranges specified by your 2nd sample input file, no output is produced. I guess that is to be expected because I asked you what output you wanted your script to produce from your sample input files and you didn't give an answer to that question.

If this doesn't work for your real data, you might consider giving us some sample input that you think should produce some output and actually show us what output you are trying to produce from those inputs.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.
# 5  
Old 01-19-2017
the output is file2 which is the gene info and add to it the SNPID

**
Code:
 names seqnames**** start****** end**  GENEID 
* rs3753344**** chr1** 1142150** 1142150** ** TNFRSF18******
* rs3753344**** chr1** 1142150** 1142150 **** NA
 rs12191877**** chr6* 31252925* 31252925  HLA-B******* 
** rs881375**** chr9  123652898 123652898 *** NA


Last edited by marwah; 01-19-2017 at 11:00 PM..
# 6  
Old 01-19-2017
One last time: Please show us exactly what output you want your code to produce when given the input files your provided in post #1 in this thread. If you are unwilling to do that, I'll close the thread.
# 7  
Old 01-19-2017
NO please I have added the data I want to see as an output

the output is file2 which is the gene info and add to it the SNPID

**
Code:
 names   seqnames     start    end      GENEID 
* rs3753344    chr1    1142150    1142150   TNFRSF18
* rs3753344**** chr1** 1142150** 1142150 **** NA
 rs12191877**** chr6* 31252925* 31252925  HLA-B******* 
** rs881375**** chr9  123652898 123652898 *** NA

I don't know where the stars came from but this is the data without the stars
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Snps annotation

I have the following Snps data CHROM POS ID chr7 78599583 rs987435 chr15 33395779 rs987436 chr1 189807684 rs987437 chr20 33907909 rs987438 chr12 75664046 rs987439 and the following gene data genename name chrom strand txstart txend... (8 Replies)
Discussion started by: marwah
8 Replies

2. Shell Programming and Scripting

awk to average target and gene

I am trying to modify the awk below to include the gene name ($5) for each target and can not seem to do so. Also, I'm not sure the calculation is right (average of all targets that are the same is $4 using the values in $7)? Thank you :). awk '{if((NR>1)&&($4!=last)){printf("%s\t%f\t%s\n",... (1 Reply)
Discussion started by: cmccabe
1 Replies

3. Shell Programming and Scripting

Extract a string between 2 ref string from a file

Hi, May i ask if someone share some command for extracting a string between 2 ref string in a txt file My objective: i had a file with multiple lines and wants only to extract the string "watch?v=IbkAXOmEHpY" or "watch?v=<11 random character>", when i used "grep 'watch?=*' i got a results per... (4 Replies)
Discussion started by: jao_madn
4 Replies

4. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Hey, I've been trying to break a massive fasta formatted file into files containing each gene separately. Could anyone help me? I've tried to use the following code but i've recieved errors every time: for i in *.rtf.out do awk '/^>/{f=++d".fasta"} {print > $i.out}' $i done (1 Reply)
Discussion started by: Ann Mc Cartney
1 Replies

5. Shell Programming and Scripting

File merging using first column as the ref

I had two files 1.txt 2.txt. I want a 3rd file(o/p) 3.txt like below based on the common elements from the first coulmns of 1.txt and 2.txt. 1.txt 11 12 13 14 15 16 17 18 19 20 21 2.txt (6 Replies)
Discussion started by: p_sai_ias
6 Replies

6. UNIX for Advanced & Expert Users

cannot find map file

Hi, all: My writed network device driver works fine when the transmitted file is under several MegaBytes, but above this size, especially dozens of or hundreds of MegaBytes, the kernel panic ocurres! I check the kern.log and find this error : 522 Nov 14 19:35:32 liklstar-server kerneNov 14... (2 Replies)
Discussion started by: liklstar
2 Replies

7. Shell Programming and Scripting

Script to search and extract the gene sub-location from gff file.

Hi, my problem is that I have two files. File no. 1 is a gff text file (say gi1) that has gene information like : ******************** gene 39389788..39395643 /gene="RPSA" /note="Derived by automated computational analysis using ... (2 Replies)
Discussion started by: reena2305
2 Replies

8. Shell Programming and Scripting

Append file from ref file AWK

FILE1 abc:xxx:abc:123:wer:AAA:12 csf:xxx:123:aeg:sar:BBB:13 asq:yer:321:wsa:qqq:CCC:14 FILE2 AAA:12:SET1:R1 AAA:12:SSS1:RR1 AAA:11:SET4:R3 BBB:13:SET2:R2 OUTPUT abc:xxx:abc:123:wer:AAA:12:SET1:R1:SSS1:RR1 csf:xxx:123:aeg:sar:BBB:13:SET2:R2::... (4 Replies)
Discussion started by: greycells
4 Replies

9. Shell Programming and Scripting

File merging using first column as the ref

I had two files 1.txt 2.txt. I want a 3rd file(o/p) 3.txt like below (using awk) 1.txt 11 a1 12 a2 13 a3 14 a4 15 a5 16 a6 17 a7 18 a8 19 a9 20 a10 2.txt 14 b1 15 b2 16 b3 (8 Replies)
Discussion started by: p_sai_ias
8 Replies

10. Shell Programming and Scripting

Reading a path (including ref to shell variable) from file

Hi! 1. I have a parameter file containing path to log files. For this example both paths are the same, one is stated directly and the second using env variables. /oracle/admin/orcl/bdump/:atlas:trc:N ${ORACLE_BASE}/admin/${ORACLE_SID}/bdump/:${ORACLE_SID}:trc:N 2. I try to parse the path... (1 Reply)
Discussion started by: lojzev
1 Replies
Login or Register to Ask a Question