Help generating a script for next-generation sequencing data


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help generating a script for next-generation sequencing data
# 1  
Old 11-23-2011
Help generating a script for next-generation sequencing data

I am not sure if this is entirely possible, but I want to compare data in a particular column in several .txt files and have a new file generated. I am a biologist with limited unix knowledge. There are currently no programs written for this type of analysis.

First I would like to define the output file:
Code:
echo Please enter family number
read FALS
OUT=”$FALS_affected_variants.txt” (is this correct?)
echo “The output file is $OUT”

echo Please enter number of affected samples in family
read ?

I want use this number that is inputted for the number of .txt files I want to compare because this will vary every time, so for example if I entered 4, then:
Code:
echo Please enter affected sample 1
read AFF1
echo Please enter affected sample 2
read AFF2
echo Please enter affected sample 3
read AFF3
echo Please enter affected sample 4
read AFF4

Now, I want to compare the .txt files from these samples. So I want to directly compare $AFF1.txt $AFF2.txt $AFF3.txt $AFF4.txt
If the same data in column 15 (not a ".", but if there is something written i.e. NM_123456) is in two or more .txt files (anywhere in the entire file), I want this entire line outputted to a new .txt file
OUT=”$FALS_affected_variants.txt” with a new column added on (so a 19th column in the file) with how many (integer) .txt files this data is present in, and a heading of that column with “subjects”

Next, I would like to compare this $FALS_affected_variants.txt file to another list of .txt files (control files). All of these control files will be be in their own directory e.g. /home/user/NGS/controls and there will probably be ~10 .txt files

I want to compare the data in column 15 (not the ".", but if there is something written i.e. PRAMEF2:NM_023014:exon4:c.G1177ASmilie.A393T) in each line (one line at a time) in the $FALS_affected_variants.txt file to the “control” .txt files. If the data in column 15 from $FALS_affected_variants.txt is present in ANY of the “control” .txt files, I want to add an extra column to $FALS_affected_variants.txt (a 20th column with heading in_controls) with the word “yes”, or if it is NOT present in any of the “control” .txt files, the word “no” added to column 20. Or, if it is easier, generate a new output file $FALS_affected_variants_with_control_data.txt with the same 19 columns from the original $FALS_affected_variants.txt with a new 20th column "in_controls" with "yes" or "no"

Here is an example of the files and what I want

AFF1:
Code:
chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	11298631	11298631	G	A	het	85	38	20	.	.	exonic	MTOR	synonymous SNV	MTOR:NM_004958:exon12:c.C1830T:p.F610F,	.	.	.
chr01	11589817	11589817	C	T	het	54	15	9	.	.	intronic	PTCHD2	.	.	.	.	.
chr01	12908349	12908349	-	T	het	1835	128	54	.	.	intronic	HNRNPCL1	.	.	.	.	.
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008
chr01	12939546	12939546	-	G	het	1535	157	50	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.
chr01	12939568	12939568	C	G	het	48	159	52	.	.	exonic	PRAMEF4	nonsynonymous SNV	PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L,	.	.	.
chr01	12954490	12954490	A	G	het	128	74	35	.	.	exonic	PRAMEF10	nonsynonymous SNV	PRAMEF10:NM_001039361:exon3:c.T793C:p.C265R,	.	1000g2010nov_all	0.065

AFF2:
Code:
chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	12834987	12834987	-	T	het	683	53	22	.	.	UTR5	PRAMEF12	.	.	.	.	.
chr01	12908349	12908349	-	T	het	1943	153	61	.	.	intronic	HNRNPCL1	.	.	.	.	.
chr01	12921386	12921386	G	A	het	228	234	119	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008
chr01	12939546	12939546	-	G	het	3903	397	118	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.
chr01	12939568	12939568	C	G	het	55	344	120	.	.	exonic	PRAMEF4	nonsynonymous SNV	PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L,	.	.	.

AFF3:
Code:
chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	12041977	12041977	-	T	het	64	6	3	snp131	rs34602102	intronic	MFN2	.	.	.	.	.
chr01	12267373	12267373	G	T	het	22	6	2	.	.	UTR3	TNFRSF1B	.	.	.	.	.
chr01	12268023	12268023	-	AA	het	1278	46	1	.	.	UTR3	TNFRSF1B	.	.	.	.	.
chr01	12368706	12368706	T	-	het	83	15	3	.	.	intronic	VPS13D	.	.	.	.	.
chr01	12677725	12677725	A	-	het	157	29	5	.	.	UTR5	DHRS3	.	.	.	.	.
chr01	12908349	12908349	-	T	het	841	57	24	.	.	intronic	HNRNPCL1	.	.	.	.	.
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008
chr01	12939546	12939546	-	G	het	1535	157	50	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.

Control1:
Code:
chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq
chr01	12035057	12035057	-	G	het	20	33	1	.	.	UTR3	PLOD1	.	.	.	.	.
chr01	12254746	12254746	-	A	het	128	10	3	.	.	intronic	TNFRSF1B	.	.	.	.	.
chr01	12268023	12268023	-	TGAA	hom	1661	24	22	.	.	UTR3	TNFRSF1B	.	.	.	.	.
chr01	12677725	12677725	A	-	het	50	10	2	.	.	UTR5	DHRS3	.	.	.	.	.
chr01	12908349	12908349	-	T	het	643	53	21	.	.	intronic	HNRNPCL1	.	.	.	.	.
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008

So the first output file (comparing column 15 in AFF1, AFF2 and AFF3) would look like this:

$FALS_affected_variants.txt
Code:
chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq	subjects
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008	3
chr01	12939546	12939546	-	G	het	1535	157	50	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.	3
chr01	12939568	12939568	C	G	het	48	159	52	.	.	exonic	PRAMEF4	nonsynonymous SNV	PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L,	.	.	.	2

Then, I would like to compare this file to control.txt files (here I am only using 1 control file)

I would like the new file to be as follows
Code:
chr_name	chr_start	chr_end	ref_base	alt_base	hom_het	snp_quality	tot_depth	alt_depth	dbSNP	dbSNP131	region	gene	change	annotation	dbSNP132	1000genomes	allele freq	subjects	in_controls
chr01	12921386	12921386	G	A	het	228	170	75	.	.	exonic	PRAMEF2	nonsynonymous SNV	PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T,	.	1000g2010nov_all	0.008	3	yes
chr01	12939546	12939546	-	G	het	1535	157	50	.	.	exonic	PRAMEF4	frameshift insertion	PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs,	.	.	.	3	no
chr01	12939568	12939568	C	G	het	48	159	52	.	.	exonic	PRAMEF4	nonsynonymous SNV	PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L,	.	.	.	2	no

Is this possible? And can anyone help me out.

Many thanks,

Kelly
# 2  
Old 11-30-2011
Kelly,

Can the (non-".") value of the 15th column of data occur more than once in a file. And if so, does that increment the frequency of that value? If so, then you might find something like:
Code:
awk '
    BEGIN { FS = OFS = "\t"; } # tab is the column separator?
    FNR == 1 { next; }
    { N[$(15)]++; }
    END { for (p in N) { print n, N[p]; }
' file1 file2... > outputfile

might help get you started.
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Is there a way to handle commas inside the data when generating a csv file from shell script?

I am extracting data via sql query and some of the data has commas. Output File must be csv and I cannot update the data in the db (as it is used by other application). Example table FavoriteThings Person VARCHAR2(25), Favorite VARCHAR2(100) Sample Data Greta rain drop on... (12 Replies)
Discussion started by: patk625
12 Replies

2. Shell Programming and Scripting

Generating summary data (use awk?)

I have a data file similar to this (but many millions of lines long). You can assume that it is totally unsorted but has no duplicate rows. Date ,Tool_Type ,Tool_ID ,Time_Used 3/13/2014,Screwdriver,Screwdriver02, 6 3/13/2014,Screwdriver,Screwdriver02,20... (2 Replies)
Discussion started by: Michael Stora
2 Replies

3. Shell Programming and Scripting

Generating CSV from Column data

Hi List, I have a chunk of data like so: User Account Control: User Account Control: User Account Control: User Account Control: Disabled User Account Control: User Account Control: User Account Control: Disabled User Account Control: User Account Control: ... (3 Replies)
Discussion started by: landossa
3 Replies

4. UNIX for Dummies Questions & Answers

Generating 512MB file with dd using random data

Hello. Could anyone help me with my little annoying problem? I have to generate a 512 MB file made up with random data using DD. After some internet digging I found out that the command is: dd if=/dev/urandom of=/exemple/file bs=512MB After running this command the... (2 Replies)
Discussion started by: razolo13
2 Replies

5. Shell Programming and Scripting

Sliding window for sequencing data

Hi! I have some sequencing data that I have aligned using maq software Now, I have data that looks like this each line is a 'tag' chr1 10001 chr1 10002 chr1 10005 chr1 10007 chr1 10008 chr1 10008 chr1 10008 chr1 10019 chr1 10019 chr1 10020 What I really want to find out is how... (1 Reply)
Discussion started by: biobio
1 Replies

6. Shell Programming and Scripting

generating reports based on time field of network data

hi i have data extracted in the following format ranging around 300000 to 800000 records in a text file , the format is of network data . No. Time Source Destination Protocol 1 1998-06-05 17:20:23.569905 HP_61:aa:c9 HP_61:aa:c9 ... (1 Reply)
Discussion started by: renukaprasadb
1 Replies

7. Virtualization and Cloud Computing

Cloud Enabling Computing for the Next Generation Data Center

Hear how the changing needs of massive scale-out computing is driving a transfomation in technology and learn how HP is supporting this new evolution of the web. More... (1 Reply)
Discussion started by: Linux Bot
1 Replies

8. Shell Programming and Scripting

generating data for 1 hour

Hi Folks, The reqirement is that i need to generate 1 hr file with a time interval of five minutes.. For ex: my i/p is 0000-0000 and desired o/p is 0000-0005 0005-0010 0010-0015 0015-0020 0020-0025 0025-0030 0030-0035 0040-0045 0050-0055 0055-0100 Script neede urgent ... (0 Replies)
Discussion started by: aajan
0 Replies
Login or Register to Ask a Question