Map snps into a ref gene file

01-19-2017

Registered User

11, 0

Join Date: Jan 2017

Last Activity: 29 January 2017, 1:48 PM EST

Posts: 11

Thanks Given: 0

Thanked 0 Times in 0 Posts

Map snps into a ref gene file

I have the following data set about the snps ID txt file

Code:

   POS ID	
    	78599583	rs987435
    	33395779	rs345783
    	189807684	rs955894
    	33907909	rs6088791
    	75664046	rs11180435
    	218890658	rs17571465
    	127630276	rs17011450
    	90919465	rs6919430

and a gene reference file, txt file

Code:

 genename	name	chrom	strand	txstart	txend
    CDK1	NM_001786	chr10	+	62208217	62224616
    CALB2	NM_001740	chr16	+	69950116	69981843
    STK38	NM_007271	chr6	-	36569637	36623271
    YWHAE	NM_006761	chr17	-	1194583	1250306
    SYT1	NM_005639	chr12	+	77782579	78369919
    ARHGAP22	NM_001347736	chr10	-	49452323	49534316
    PRMT2	NM_001535	chr21	+	46879934	46909464
    CELSR3	NM_001407	chr3	-	48648899	48675352

I'm trying to match the genes with the SNps using snps location, so include the snps that has

POS >= txstart and POS<= txend

for example I want a data set that has the following columns

Code:

genename   SNPID   chrom   position   txstart   txend

Last edited by Don Cragun; 01-19-2017 at 09:48 PM.. Reason: Add more CODE and ICODE tags.

marwah

View Public Profile for marwah

Find all posts by marwah

01-19-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

And what output are you trying to get from the two sample input files you provided?

What happens if there is no ID in the 1st file that appears in a range specified by the 2nd file?

What happens if there is more than one ID in the 1st file that fits in a range specified by a single line in the 2nd file?

What happens if there is no range in the 2nd file for a position specified in the 1st file?

What have you tried to solve this problem on your own?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-19-2017

Registered User

11, 0

Join Date: Jan 2017

Last Activity: 29 January 2017, 1:48 PM EST

Posts: 11

Thanks Given: 0

Thanked 0 Times in 0 Posts

I'm expecting that a gene might have more than snpID,
and there might be genes that don't have snps it will be NA
and there might be one snpID for pre one gene

Code:

awk 'FNR==1 {next} FILENAME=="pre_snpinfo_tumor.txt" {k++; POS[k]=$2; ID[k]=$2;} \  
                   FILENAME=="refFlat.txt" {i++; \
                                     if(POS[i]>=$5 && POS[i]<=$6) \
                                          print $1, ID[i], $3, POS[i], $5, $6} \
    ' pre_snpinfo_tumor.txt  refFlat.txt

but there is an error can you help please

marwah

View Public Profile for marwah

Find all posts by marwah

01-19-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

One might think that something more like:

Code:

awk '
FNR==1 {next}
FNR == NR {
        POS[++k]=$1
        ID[k]=$2
        next
}
{       for(i = 1; i <= k; i++)
                if(POS[i]>=$5 && POS[i]<=$6)
                        print $1, ID[i], $3, POS[i], $5, $6
}' pre_snpinfo_tumor.txt  refFlat.txt

would work, but since absolutely none of the positions specified in your 1st sample input file are in any of the ranges specified by your 2nd sample input file, no output is produced. I guess that is to be expected because I asked you what output you wanted your script to produce from your sample input files and you didn't give an answer to that question.

If this doesn't work for your real data, you might consider giving us some sample input that you think should produce some output and actually show us what output you are trying to produce from those inputs.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-19-2017

Registered User

11, 0

Join Date: Jan 2017

Last Activity: 29 January 2017, 1:48 PM EST

Posts: 11

Thanks Given: 0

Thanked 0 Times in 0 Posts

the output is file2 which is the gene info and add to it the SNPID

**

Code:

 names seqnames**** start****** end**  GENEID 
* rs3753344**** chr1** 1142150** 1142150** ** TNFRSF18******
* rs3753344**** chr1** 1142150** 1142150 **** NA
 rs12191877**** chr6* 31252925* 31252925  HLA-B******* 
** rs881375**** chr9  123652898 123652898 *** NA

Last edited by marwah; 01-19-2017 at 11:00 PM..

marwah

View Public Profile for marwah

Find all posts by marwah

01-19-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

One last time: Please show us exactly what output you want your code to produce when given the input files your provided in post #1 in this thread. If you are unwilling to do that, I'll close the thread.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

01-19-2017

Registered User

11, 0

Join Date: Jan 2017

Last Activity: 29 January 2017, 1:48 PM EST

Posts: 11

Thanks Given: 0

Thanked 0 Times in 0 Posts

NO please I have added the data I want to see as an output

the output is file2 which is the gene info and add to it the SNPID

**

Code:

 names   seqnames     start    end      GENEID 
* rs3753344    chr1    1142150    1142150   TNFRSF18
* rs3753344**** chr1** 1142150** 1142150 **** NA
 rs12191877**** chr6* 31252925* 31252925  HLA-B******* 
** rs881375**** chr9  123652898 123652898 *** NA

I don't know where the stars came from but this is the data without the stars

marwah

View Public Profile for marwah

Find all posts by marwah

UNIX for Advanced & Expert Users

Map snps into a ref gene file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Snps annotation

Discussion started by: marwah

2. Shell Programming and Scripting

awk to average target and gene

Discussion started by: cmccabe

3. Shell Programming and Scripting

Extract a string between 2 ref string from a file

Discussion started by: jao_madn

4. UNIX for Dummies Questions & Answers

Breaking a fasta formatted file into multiple files containing each gene separately

Discussion started by: Ann Mc Cartney

5. Shell Programming and Scripting

File merging using first column as the ref

Discussion started by: p_sai_ias

6. UNIX for Advanced & Expert Users

cannot find map file

Discussion started by: liklstar

7. Shell Programming and Scripting

Script to search and extract the gene sub-location from gff file.

Discussion started by: reena2305

8. Shell Programming and Scripting

Append file from ref file AWK

Discussion started by: greycells

9. Shell Programming and Scripting

File merging using first column as the ref

Discussion started by: p_sai_ias

10. Shell Programming and Scripting

Reading a path (including ref to shell variable) from file

Discussion started by: lojzev