awk to compare each file in two directores by storing in variable
In the below bash I am trying to read each file from a specific directory into a variable REF or VAL. Then use those variables in an awk to compare each matching file from REF and VAL. The filenames in the REF are different then in the VAL, but have a common id up until the _ I know the awk portion works as I can run files through them manually. However, I am not sure if I am doing the variable part correctly. Thank you .
The two variables are:
REF comes from /home/cmccabe/Desktop/comparison/reference/10bp and is 3 files:
Code:
S1234_ref.txt
A5678_ref.txt
T1111_ref.txt
VAR comes from /home/cmccabe/Desktop/comparison/validation/files and is 3 files:
Code:
S1234_panel.vcf
A5678_panel.vcf
T1111_panel.vcf
So, S1234_ref.txt would be REF and S1234_panel.vcf would be VAL, then those two files would be compared. I think I am close but not too sure about the variables.
Code:
REF=/home/cmccabe/Desktop/comparison/reference/10bp
VAL=/home/cmccabe/Desktop/comparison/validation/files
awk -F'\t' -v OFS='\t' 'FNR==1 { next }
FNR == NR { file1[$2,$4,$5] = $2 FS $4 FS $5 }
FNR != NR { file2[$2,$4,$5] = $2 FS $4 FS $5 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in Reference but found in IDP:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in IDP but found in Reference:"; for (k in file1) if (!(k in file2)) print file1[k]
}'$REF $VAL > /home/cmccabe/Desktop/comparison/ref_val/concordance.txt
Last edited by cmccabe; 09-30-2016 at 04:01 PM..
Reason: added details
Also to get the above Input_file you could use paste var_file ref_file > Input_file
Then you could try to read their values by a loop and try to put your awkinside it.
while read file1 file2
awk -F'\t' -v OFS='\t' 'FNR==1 { next }
FNR == NR { file1[$2,$4,$5] = $2 FS $4 FS $5 }
FNR != NR { file2[$2,$4,$5] = $2 FS $4 FS $5 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in Reference but found in IDP:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in IDP but found in Reference:"; for (k in file1) if (!(k in file2)) print file1[k]
}'/home/cmccabe/Desktop/comparison/reference/10bp/$file1 /home/cmccabe/Desktop/comparison/validation/files/$file2 > /home/cmccabe/Desktop/comparison/ref_val/concordance.txt
done < "/home/cmccabe/Desktop/comparison/ref_val/out"
while read file1 file2
awk -F'\t' -v OFS='\t' 'FNR==1 { next }
FNR == NR { file1[$2,$4,$5] = $2 FS $4 FS $5 }
FNR != NR { file2[$2,$4,$5] = $2 FS $4 FS $5 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in Reference but found in IDP:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in IDP but found in Reference:"; for (k in file1) if (!(k in file2)) print file1[k]
}'/home/cmccabe/Desktop/comparison/reference/10bp/$file1 /home/cmccabe/Desktop/comparison/validation/files/$file2 > /home/cmccabe/Desktop/comparison/ref_val/concordance.txt
done < "/home/cmccabe/Desktop/comparison/ref_val/out"
Code:
I give the full path to each:
file1
file2
out
Thank you .
Hello cmccabe,
As we have requested mutiple times, kindly do mention complete details about what's happening and what's not. Though you have used code as I suggested in post#2 but you haven't mention either that approach worked or not. Because you haven't showed us what are there in those Input_files(how Input_file looks with sample data) it is difficult to tell that you command will work or not, please let us know the complete details about your requirements with sample Input_file and ecpected output of your requirements with error messages(if any) you code by running any suggestions or your own codes, I hope this helps.
Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
I am not sure what you mean by create an input file, I created the input as the filenames of REF in $1 of and the filenames of VAL in $2. I then gave the full path to the location of each filename in $1 and the full path to each filename in $2. After giving the full paths to file1 file2 and input, I used the awk, it ran but nothing happened and no output was produced. I was trying to use your code but just misunderstood it. I hope this helps and thank you .
@RudiC and @RavinderSingh13, thank you both for all of your help.
it looks like the script reads all the vcf files from REF and puts them in a variable FN. How do the txt files from VAL get used by the awk. The awk looks at each REF file and compares it to each VAL file looking for what's common and what's different. If a difference is found it identifies which file the missing data came from. The awk portion works on individual files, but I have over 500to compare so a loop would help, however that is what I need help with .
REF there are 250 files all located at /home/cmccabe/Desktop/comparison/reference/10bp
Code:
F13_ref_FP_10bp.txt
H19_ref_FP_10bp.txt
Data structure in REF
Code:
Chr Start End Ref Alt Func.refGene Gene.refGene Coverage Score A(#F,#R) C(#F,#R) G(#F,#R) T(#F,#R) Ins(#F,#R) Del(#F,#R) SNP Mutation Frequency Sanger
12 52200340 52200340 A C exonic SCN8A 4129 28.3 1560;1672 413;453 0;0 0;0 0;2 31;0 c.[5070A>C]+[=] 20.97
2 51254914 51254914 C T exonic NRXN1 1562 25.5 0;0 536;218 0;0 574;234 0;0 0;0 c.[498G>A]+[=] 51.73
X 67433722 67433722 C T exonic OPHN1 2747 25.6 0;0 46;37 0;0 1211;1443 1;8 5;5 c.[579G>A]+[579G>A] 96.61
VAL there are 250 files all located at /home/cmccabe/Desktop/comparison/validation/files
desired output (example not using these files that compares a REF file to a VAL file and finds what's in common, what's different, and where the difference comes from, it includes some additional data as well from another script)
Code:
Match:
Chr Start Ref Alt Func.refGene Gene.refGene Quality Reads Zygosity Phred
chr15 68521889 C T exonic CLN6 GOOD 50 het 4
chr7 147183143 A G intronic CNTNAP2 GOOD 382 het 22
chr2 167099158 A G exonic SCN9A GOOD 210 hom 55
Missing in Reference but found in IDP:
Chr Start Ref Alt Func.refGene Gene.refGene Quality Reads Zygosity Phred
chr2 51666313 T C intergenic NRXN1,NONE GOOD 108 het 7
chr2 166903445 T C exonic SCN1A GOOD 400 het 28
Missing in IDP but found in Reference:
Chr Start Ref Alt Func.refGene Gene.refGene Mutation Call Coverage Score Mutant Allele Frequency A(#F,#R) C(#F,#R) G(#F,#R) T(#F,#R) ins(#F,#R) del(#F,#R) SNP db_ref Region
2 166210776 C T exonic SCN2A c.[2994C>T]+[=] 3095 23.1 24.56 0:0 1158:1177 0;0 457;303 1;0 0;0 No low coverage
7 148106478 - GT intronic CNTNAP2 c.3716-5_3716-4insGT 4168 28.6 51.01 0;0 0;1 0;0 2199;1967 1129;997 0;1 rs60451214 No low
I hope this helps and apologize for the long post but think these are all the details. Thank you .
Last edited by cmccabe; 10-01-2016 at 10:49 AM..
Reason: added details
Hi All,
I was trying a shell script. I was unable to store file contents to a variable in the script. I have tried the below but unable to do it.
Input = `cat /path/op.diary`
Input = $(<op.diary)
I am using ksh shell. I want to store the 'op.diary' file contents to the variable 'Input'... (12 Replies)
I'm working on a script in which gives certain details in its output depending on user-specified options. So, what I'd like to do is something like:
if
then
awkcmd='some_awk_command'
else
awkcmd='some_other_awk_command'
fi
Then, later in the script, we'd do something like:
... (5 Replies)
Hi,
My aim is to get the md5 hash of a file and store it in a variable.
var1="md5sum file1"
$var1
The above outputs fine but also contains the filename, so somthing like this 243ASsf25 file1
i just need to get the first part and put it into a variable.
var1="md5sum file1"... (5 Replies)
I am working on a script for Mac OS X that, among many other things, gets a list of all the installed Applications. I am pulling the list from the system_profiler command and formatting it using grep and awk. The problem is that I want to be able to use each result individually later in the script.... (3 Replies)
Hi all, im having snags creating a variable which uses commands like cut and grep. In the instance below im simply trying to take a value from another file and assign it to a variable. When i do this it only prints the $a rather than the actual value. I know its simple but does anyone have any... (1 Reply)
i want to store the output of 'tail -5000 file' to a variable.
If i want to access the contents of that variable, it becomes kinda difficult because when the data is stored in the variable, everything is mushed together. you dont know where a line begins or ends.
so my question is, how can i... (3 Replies)
Hi folks,
I'm using bash and would like to do the following. I would like to read some values from the file and store it in the variable and use it.
My file is 1.txt and its contents are
VERSION=5.6
UPDATE=4
I would like to read "5.6" and "4" and store it in a variable in shell... (6 Replies)
Hi,
i have some files in one directory(say some sample dir) whose names will be like the following.
some_file1.txt
some_file2.txt.
i need to get the last modified file size based on file name pattern like some_
here i am able to get the value of the last modified file size using the... (5 Replies)
HI
I am trying to store the output of this awk command
awk -F, {(if NR==2) print $1} test.sr
in a variable when I am trying v= awk -F, {(if NR==2) print $1} test.sr
$v = awk -F, {(if NR==2) print $1} test.sr
but its not working out .
Any suggestions
Thanks
Arif (3 Replies)