Help generating a script for next-generation sequencing data
I am not sure if this is entirely possible, but I want to compare data in a particular column in several .txt files and have a new file generated. I am a biologist with limited unix knowledge. There are currently no programs written for this type of analysis.
First I would like to define the output file:
Code:
echo Please enter family number
read FALS
OUT=”$FALS_affected_variants.txt” (is this correct?)
echo “The output file is $OUT”
echo Please enter number of affected samples in family
read ?
I want use this number that is inputted for the number of .txt files I want to compare because this will vary every time, so for example if I entered 4, then:
Code:
echo Please enter affected sample 1
read AFF1
echo Please enter affected sample 2
read AFF2
echo Please enter affected sample 3
read AFF3
echo Please enter affected sample 4
read AFF4
Now, I want to compare the .txt files from these samples. So I want to directly compare $AFF1.txt $AFF2.txt $AFF3.txt $AFF4.txt
If the same data in column 15 (not a ".", but if there is something written i.e. NM_123456) is in two or more .txt files (anywhere in the entire file), I want this entire line outputted to a new .txt file
OUT=”$FALS_affected_variants.txt” with a new column added on (so a 19th column in the file) with how many (integer) .txt files this data is present in, and a heading of that column with “subjects”
Next, I would like to compare this $FALS_affected_variants.txt file to another list of .txt files (control files). All of these control files will be be in their own directory e.g. /home/user/NGS/controls and there will probably be ~10 .txt files
I want to compare the data in column 15 (not the ".", but if there is something written i.e. PRAMEF2:NM_023014:exon4:c.G1177A.A393T) in each line (one line at a time) in the $FALS_affected_variants.txt file to the “control” .txt files. If the data in column 15 from $FALS_affected_variants.txt is present in ANY of the “control” .txt files, I want to add an extra column to $FALS_affected_variants.txt (a 20th column with heading in_controls) with the word “yes”, or if it is NOT present in any of the “control” .txt files, the word “no” added to column 20. Or, if it is easier, generate a new output file $FALS_affected_variants_with_control_data.txt with the same 19 columns from the original $FALS_affected_variants.txt with a new 20th column "in_controls" with "yes" or "no"
Here is an example of the files and what I want
AFF1:
Code:
chr_name chr_start chr_end ref_base alt_base hom_het snp_quality tot_depth alt_depth dbSNP dbSNP131 region gene change annotation dbSNP132 1000genomes allele freq
chr01 11298631 11298631 G A het 85 38 20 . . exonic MTOR synonymous SNV MTOR:NM_004958:exon12:c.C1830T:p.F610F, . . .
chr01 11589817 11589817 C T het 54 15 9 . . intronic PTCHD2 . . . . .
chr01 12908349 12908349 - T het 1835 128 54 . . intronic HNRNPCL1 . . . . .
chr01 12921386 12921386 G A het 228 170 75 . . exonic PRAMEF2 nonsynonymous SNV PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T, . 1000g2010nov_all 0.008
chr01 12939546 12939546 - G het 1535 157 50 . . exonic PRAMEF4 frameshift insertion PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs, . . .
chr01 12939568 12939568 C G het 48 159 52 . . exonic PRAMEF4 nonsynonymous SNV PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L, . . .
chr01 12954490 12954490 A G het 128 74 35 . . exonic PRAMEF10 nonsynonymous SNV PRAMEF10:NM_001039361:exon3:c.T793C:p.C265R, . 1000g2010nov_all 0.065
AFF2:
Code:
chr_name chr_start chr_end ref_base alt_base hom_het snp_quality tot_depth alt_depth dbSNP dbSNP131 region gene change annotation dbSNP132 1000genomes allele freq
chr01 12834987 12834987 - T het 683 53 22 . . UTR5 PRAMEF12 . . . . .
chr01 12908349 12908349 - T het 1943 153 61 . . intronic HNRNPCL1 . . . . .
chr01 12921386 12921386 G A het 228 234 119 . . exonic PRAMEF2 nonsynonymous SNV PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T, . 1000g2010nov_all 0.008
chr01 12939546 12939546 - G het 3903 397 118 . . exonic PRAMEF4 frameshift insertion PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs, . . .
chr01 12939568 12939568 C G het 55 344 120 . . exonic PRAMEF4 nonsynonymous SNV PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L, . . .
AFF3:
Code:
chr_name chr_start chr_end ref_base alt_base hom_het snp_quality tot_depth alt_depth dbSNP dbSNP131 region gene change annotation dbSNP132 1000genomes allele freq
chr01 12041977 12041977 - T het 64 6 3 snp131 rs34602102 intronic MFN2 . . . . .
chr01 12267373 12267373 G T het 22 6 2 . . UTR3 TNFRSF1B . . . . .
chr01 12268023 12268023 - AA het 1278 46 1 . . UTR3 TNFRSF1B . . . . .
chr01 12368706 12368706 T - het 83 15 3 . . intronic VPS13D . . . . .
chr01 12677725 12677725 A - het 157 29 5 . . UTR5 DHRS3 . . . . .
chr01 12908349 12908349 - T het 841 57 24 . . intronic HNRNPCL1 . . . . .
chr01 12921386 12921386 G A het 228 170 75 . . exonic PRAMEF2 nonsynonymous SNV PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T, . 1000g2010nov_all 0.008
chr01 12939546 12939546 - G het 1535 157 50 . . exonic PRAMEF4 frameshift insertion PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs, . . .
Control1:
Code:
chr_name chr_start chr_end ref_base alt_base hom_het snp_quality tot_depth alt_depth dbSNP dbSNP131 region gene change annotation dbSNP132 1000genomes allele freq
chr01 12035057 12035057 - G het 20 33 1 . . UTR3 PLOD1 . . . . .
chr01 12254746 12254746 - A het 128 10 3 . . intronic TNFRSF1B . . . . .
chr01 12268023 12268023 - TGAA hom 1661 24 22 . . UTR3 TNFRSF1B . . . . .
chr01 12677725 12677725 A - het 50 10 2 . . UTR5 DHRS3 . . . . .
chr01 12908349 12908349 - T het 643 53 21 . . intronic HNRNPCL1 . . . . .
chr01 12921386 12921386 G A het 228 170 75 . . exonic PRAMEF2 nonsynonymous SNV PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T, . 1000g2010nov_all 0.008
So the first output file (comparing column 15 in AFF1, AFF2 and AFF3) would look like this:
$FALS_affected_variants.txt
Code:
chr_name chr_start chr_end ref_base alt_base hom_het snp_quality tot_depth alt_depth dbSNP dbSNP131 region gene change annotation dbSNP132 1000genomes allele freq subjects
chr01 12921386 12921386 G A het 228 170 75 . . exonic PRAMEF2 nonsynonymous SNV PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T, . 1000g2010nov_all 0.008 3
chr01 12939546 12939546 - G het 1535 157 50 . . exonic PRAMEF4 frameshift insertion PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs, . . . 3
chr01 12939568 12939568 C G het 48 159 52 . . exonic PRAMEF4 nonsynonymous SNV PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L, . . . 2
Then, I would like to compare this file to control.txt files (here I am only using 1 control file)
I would like the new file to be as follows
Code:
chr_name chr_start chr_end ref_base alt_base hom_het snp_quality tot_depth alt_depth dbSNP dbSNP131 region gene change annotation dbSNP132 1000genomes allele freq subjects in_controls
chr01 12921386 12921386 G A het 228 170 75 . . exonic PRAMEF2 nonsynonymous SNV PRAMEF2:NM_023014:exon4:c.G1177A:p.A393T, . 1000g2010nov_all 0.008 3 yes
chr01 12939546 12939546 - G het 1535 157 50 . . exonic PRAMEF4 frameshift insertion PRAMEF4:NM_001009611:exon4:c.1256_1257insC:p.R419fs, . . . 3 no
chr01 12939568 12939568 C G het 48 159 52 . . exonic PRAMEF4 nonsynonymous SNV PRAMEF4:NM_001009611:exon4:c.G1234C:p.V412L, . . . 2 no
Can the (non-".") value of the 15th column of data occur more than once in a file. And if so, does that increment the frequency of that value? If so, then you might find something like:
Code:
awk '
BEGIN { FS = OFS = "\t"; } # tab is the column separator?
FNR == 1 { next; }
{ N[$(15)]++; }
END { for (p in N) { print n, N[p]; }
' file1file2... > outputfile
I am extracting data via sql query and some of the data has commas. Output File must be csv and I cannot update the data in the db (as it is used by other application).
Example
table FavoriteThings
Person VARCHAR2(25),
Favorite VARCHAR2(100)
Sample Data
Greta rain drop on... (12 Replies)
I have a data file similar to this (but many millions of lines long). You can assume that it is totally unsorted but has no duplicate rows.
Date ,Tool_Type ,Tool_ID ,Time_Used
3/13/2014,Screwdriver,Screwdriver02, 6
3/13/2014,Screwdriver,Screwdriver02,20... (2 Replies)
Hi List,
I have a chunk of data like so:
User Account Control:
User Account Control:
User Account Control:
User Account Control:
Disabled
User Account Control:
User Account Control:
User Account Control:
Disabled
User Account Control:
User Account Control:
... (3 Replies)
Hello.
Could anyone help me with my little annoying problem?
I have to generate a 512 MB file made up with random data using DD. After some internet digging I found out that the command is:
dd if=/dev/urandom of=/exemple/file bs=512MB
After running this command the... (2 Replies)
Hi!
I have some sequencing data that I have aligned using maq software
Now, I have data that looks like this each line is a 'tag'
chr1 10001
chr1 10002
chr1 10005
chr1 10007
chr1 10008
chr1 10008
chr1 10008
chr1 10019
chr1 10019
chr1 10020
What I really want to find out is how... (1 Reply)
hi i have data extracted in the following format ranging around 300000 to 800000 records in a text file , the format is of network data .
No. Time Source Destination Protocol
1 1998-06-05 17:20:23.569905 HP_61:aa:c9 HP_61:aa:c9 ... (1 Reply)
Hear how the changing needs of massive scale-out computing is driving a transfomation in technology and learn how HP is supporting this new evolution of the web.
More... (1 Reply)
Hi Folks,
The reqirement is that i need to generate 1 hr file with a time interval of five minutes..
For ex:
my i/p is
0000-0000
and desired o/p is
0000-0005
0005-0010
0010-0015
0015-0020
0020-0025
0025-0030
0030-0035
0040-0045
0050-0055
0055-0100
Script neede urgent
... (0 Replies)