Help generating a script for next-generation sequencing data

11-23-2011

Registered User

10, 0

Join Date: Nov 2010

Last Activity: 26 February 2014, 10:11 PM EST

Posts: 10

Thanks Given: 10

Thanked 0 Times in 0 Posts

Help generating a script for next-generation sequencing data

I am not sure if this is entirely possible, but I want to compare data in a particular column in several .txt files and have a new file generated. I am a biologist with limited unix knowledge. There are currently no programs written for this type of analysis.

First I would like to define the output file:

Code:

echo Please enter family number
read FALS
OUT=�$FALS_affected_variants.txt� (is this correct?)
echo �The output file is $OUT�

echo Please enter number of affected samples in family
read ?

I want use this number that is inputted for the number of .txt files I want to compare because this will vary every time, so for example if I entered 4, then:

Code:

echo Please enter affected sample 1
read AFF1
echo Please enter affected sample 2
read AFF2
echo Please enter affected sample 3
read AFF3
echo Please enter affected sample 4
read AFF4

Now, I want to compare the .txt files from these samples. So I want to directly compare $AFF1.txt $AFF2.txt $AFF3.txt $AFF4.txt
If the same data in column 15 (not a ".", but if there is something written i.e. NM_123456) is in two or more .txt files (anywhere in the entire file), I want this entire line outputted to a new .txt file
OUT=�$FALS_affected_variants.txt� with a new column added on (so a 19th column in the file) with how many (integer) .txt files this data is present in, and a heading of that column with �subjects�

Next, I would like to compare this $FALS_affected_variants.txt file to another list of .txt files (control files). All of these control files will be be in their own directory e.g. /home/user/NGS/controls and there will probably be ~10 .txt files

I want to compare the data in column 15 (not the ".", but if there is something written i.e. PRAMEF2:NM_023014:exon4:c.G1177A

.A393T) in each line (one line at a time) in the $FALS_affected_variants.txt file to the �control� .txt files. If the data in column 15 from $FALS_affected_variants.txt is present in ANY of the �control� .txt files, I want to add an extra column to $FALS_affected_variants.txt (a 20th column with heading in_controls) with the word �yes�, or if it is NOT present in any of the �control� .txt files, the word �no� added to column 20. Or, if it is easier, generate a new output file $FALS_affected_variants_with_control_data.txt with the same 19 columns from the original $FALS_affected_variants.txt with a new 20th column "in_controls" with "yes" or "no"

Here is an example of the files and what I want

AFF1:

Code:

chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	11298631	11298631	G	A	het	85	38	20	.	.	exonic	MTOR	synonymous SNV	MTOR:NM_004958:exon12:c.C1830T:p.F610F,	.	.	.
chr01	11589817	11589817	C	T	het	54	15	9	.	.	intronic	PTCHD2	.	.	.	.	.
chr01	12908349	12908349	-	T	het	1835	128	54	.	.	intronic	HNRNPCL1	.	.	.	.	.
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008
chr01	12939546	12939546	-	G	het	1535	157	50	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.
chr01	12939568	12939568	C	G	het	48	159	52	.	.	exonic	PRAMEF4	nonsynonymous SNV	PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L,	.	.	.
chr01	12954490	12954490	A	G	het	128	74	35	.	.	exonic	PRAMEF10	nonsynonymous SNV	PRAMEF10:NM_001039361:exon3:c.T793C:p.C265R,	.	1000g2010nov_all	0.065

AFF2:

Code:

chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	12834987	12834987	-	T	het	683	53	22	.	.	UTR5	PRAMEF12	.	.	.	.	.
chr01	12908349	12908349	-	T	het	1943	153	61	.	.	intronic	HNRNPCL1	.	.	.	.	.
chr01	12921386	12921386	G	A	het	228	234	119	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008
chr01	12939546	12939546	-	G	het	3903	397	118	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.
chr01	12939568	12939568	C	G	het	55	344	120	.	.	exonic	PRAMEF4	nonsynonymous SNV	PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L,	.	.	.

AFF3:

Code:

chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	12041977	12041977	-	T	het	64	6	3	snp131	rs34602102	intronic	MFN2	.	.	.	.	.
chr01	12267373	12267373	G	T	het	22	6	2	.	.	UTR3	TNFRSF1B	.	.	.	.	.
chr01	12268023	12268023	-	AA	het	1278	46	1	.	.	UTR3	TNFRSF1B	.	.	.	.	.
chr01	12368706	12368706	T	-	het	83	15	3	.	.	intronic	VPS13D	.	.	.	.	.
chr01	12677725	12677725	A	-	het	157	29	5	.	.	UTR5	DHRS3	.	.	.	.	.
chr01	12908349	12908349	-	T	het	841	57	24	.	.	intronic	HNRNPCL1	.	.	.	.	.
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008
chr01	12939546	12939546	-	G	het	1535	157	50	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.

Control1:

Code:

chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	12035057	12035057	-	G	het	20	33	1	.	.	UTR3	PLOD1	.	.	.	.	.
chr01	12254746	12254746	-	A	het	128	10	3	.	.	intronic	TNFRSF1B	.	.	.	.	.
chr01	12268023	12268023	-	TGAA	hom	1661	24	22	.	.	UTR3	TNFRSF1B	.	.	.	.	.
chr01	12677725	12677725	A	-	het	50	10	2	.	.	UTR5	DHRS3	.	.	.	.	.
chr01	12908349	12908349	-	T	het	643	53	21	.	.	intronic	HNRNPCL1	.	.	.	.	.
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008

So the first output file (comparing column 15 in AFF1, AFF2 and AFF3) would look like this:

$FALS_affected_variants.txt

Code:

chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq	subjects
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008	3
chr01	12939546	12939546	-	G	het	1535	157	50	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.	3
chr01	12939568	12939568	C	G	het	48	159	52	.	.	exonic	PRAMEF4	nonsynonymous SNV	PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L,	.	.	.	2

Then, I would like to compare this file to control.txt files (here I am only using 1 control file)

I would like the new file to be as follows

Code:

chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq	subjects	in_controls
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008	3	yes
chr01	12939546	12939546	-	G	het	1535	157	50	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.	3	no
chr01	12939568	12939568	C	G	het	48	159	52	.	.	exonic	PRAMEF4	nonsynonymous SNV	PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L,	.	.	.	2	no

Is this possible? And can anyone help me out.

Many thanks,

Kelly

kellywilliams

View Public Profile for kellywilliams

Find all posts by kellywilliams

11-30-2011

Registered User

313, 60

Join Date: Dec 2010

Last Activity: 7 December 2012, 7:50 PM EST

Location: Albany, NY

Posts: 313

Thanks Given: 15

Thanked 60 Times in 60 Posts

Kelly,

Can the (non-".") value of the 15th column of data occur more than once in a file. And if so, does that increment the frequency of that value? If so, then you might find something like:

Code:

awk '
    BEGIN { FS = OFS = "\t"; } # tab is the column separator?
    FNR == 1 { next; }
    { N[$(15)]++; }
    END { for (p in N) { print n, N[p]; }
' file1 file2... > outputfile

might help get you started.

m.d.ludwig

View Public Profile for m.d.ludwig

Find all posts by m.d.ludwig

Shell Programming and Scripting

Help generating a script for next-generation sequencing data

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Is there a way to handle commas inside the data when generating a csv file from shell script?

Discussion started by: patk625

2. Shell Programming and Scripting

Generating summary data (use awk?)

Discussion started by: Michael Stora

3. Shell Programming and Scripting

Generating CSV from Column data

Discussion started by: landossa

4. UNIX for Dummies Questions & Answers

Generating 512MB file with dd using random data

Discussion started by: razolo13

5. Shell Programming and Scripting

Sliding window for sequencing data

Discussion started by: biobio

6. Shell Programming and Scripting

generating reports based on time field of network data

Discussion started by: renukaprasadb

7. Virtualization and Cloud Computing

Cloud Enabling Computing for the Next Generation Data Center

Discussion started by: Linux Bot

8. Shell Programming and Scripting

generating data for 1 hour

Discussion started by: aajan