Check ID in a file matches to the name of the file


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Check ID in a file matches to the name of the file
# 1  
Old 06-07-2017
Check ID in a file matches to the name of the file

I have a number of text tab files in my directory named 1.vcf 2.vcf etc. Each file file has headers of 120-130 rows starting with "#", it looks like this

Code:
...
##contig=<ID=GL000194.1,length=191469,assembly=hg19>
##contig=<ID=GL000225.1,length=211173,assembly=hg19>
##contig=<ID=GL000192.1,length=547496,assembly=hg19>
##contig=<ID=vcontig,length=337,assembly=hg19>
##reference=human_hg19.fasta
##source=SelectVariants
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  1
1       12012010        rs1000002       A       C       14325.14        .       AC=1;AF=0.500;AN=2;BaseQRankSum=-13...

As these files are created with an automated pipeline, I wish to introduce an id check, to see if each file name (1.vcf,2.vcf..) corresponds to the correct ID within the content file.
The ID is always present is the last line of the header after 'FORMAT'.
The files are always named according to ID.
I have been doing this manually so far, is there a way to script it ?

Last edited by nans; 06-07-2017 at 02:06 PM..
# 2  
Old 06-07-2017
You mean the ID should match the file with the "extension" .vcf stripped off? Or any other extension? What should happen if the two match? What if they don't?

And, get rid of the DOS line terminators in your text files you wish to process on *nix...
# 3  
Old 06-07-2017
Just need the name before the prefix <ids>.vcf to match with the <ids> within mentioned text file. The extn will always be .vcf. If the IDs dont match, it will be "false" and I will know there has been some id mix up during the processing of the pipeline. This is done on Linux
# 4  
Old 06-07-2017
Try:
Code:
awk '$(NF-1)=="FORMAT" && $NF".vcf" != FILENAME{print FILENAME":" $NF;nextfile}' *.vcf

Or, if your .vcf files might be in DOS text file format:
Code:
awk '{sub(/\r$/,"")}$(NF-1)=="FORMAT" && $NF".vcf" != FILENAME{print FILENAME":" $NF;nextfile}' *.vcf

That should work with awk (or gawk) on a Linux system. If you want to try this on a system where awk doesn't have the nextfile built-in function, you can remove the ;nextfile from the script and it should work just as well, but will run a little bit slower.

If someone wants to try this on a Solaris/SunOS system, change awk to nawk or /usr/xpg4/bin/awk (and remove the ;nextfile).
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 06-07-2017
Try also
Code:
awk '/#.*FORMAT/ {exit 1 - ($NF == substr (FILENAME, 1, index(FILENAME, ".")-1))}' 1.vcf
echo $?
0
awk '/#.*FORMAT/ {exit 1 - ($NF == substr (FILENAME, 1, index(FILENAME, ".")-1))}' 2.vcf
echo $?
1

or, shamelessly stealing from Don Cragun's post,
Code:
awk '/#.*FORMAT/ {exit 1 - ($NF ".vcf" == FILENAME)}' 2.vcf
echo $?
1

This User Gave Thanks to RudiC For This Post:
# 6  
Old 06-07-2017
Thank you both. Don Cragun's code works for me

@Rudi how is it checking if the id matches ? For example, I tried the code on this file 2.vcf that looks like this

Code:
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      44
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3

The code should return that the ID 44 in the file does not match with the file name 2.vcf, it returns a value of 1.
VCF (variant call format) is only text tab delimited file
# 7  
Old 06-07-2017
Hi nans,
RudiC's code and my code are intended to do different things.

My code processes all of the .vcf file in the current working directory and prints the name of the file and the ID found for each file in which the filename and the ID do not match.

RudiC's code processes one file at a time. If the filename and the ID in that file match, the exit code will be 0; if the filename and the ID do not match, the exit code will be 1. No output is printed either way, you just use the exit code of the script as a test to determine whether or not that file meets your expectations.
This User Gave Thanks to Don Cragun For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match text to lines in a file, iterate backwards until text or text substring matches, print to file

hi all, trying this using shell/bash with sed/awk/grep I have two files, one containing one column, the other containing multiple columns (comma delimited). file1.txt abc12345 def12345 ghi54321 ... file2.txt abc1,text1,texta abc,text2,textb def123,text3,textc gh,text4,textd... (6 Replies)
Discussion started by: shogun1970
6 Replies

2. Shell Programming and Scripting

Replace string of a file with a string of another file for matches using grep,sed,awk

I have a file comp.pkglist which mention package version and release . In 'version change' and 'release change' line there are two versions 'old' and 'new' Version Change: --> Release Change: --> cat comp.pkglist Package list: nss-util-devel-3.28.4-1.el6_9.x86_64 Version Change: 3.28.4 -->... (1 Reply)
Discussion started by: Paras Pandey
1 Replies

3. Shell Programming and Scripting

Required 3 lines above the file and below file when string matches

i had requirement like i need to get "error" line of above 3 and below 3 from a file .I tried with the below script.But it's not working. y='grep -n -i error /home/file.txt|cut -c1' echo $y head -$y /home/file.txt| tail -3 >tmp.txt tail -$y /home/file.txt head -3 >>tmp.txt (4 Replies)
Discussion started by: bhas85
4 Replies

4. Shell Programming and Scripting

FTP a file if the date matches

Hi, I am trying to write a script where I need to pull any file if the date is from yesterday. Can you please help me on how to check the dates for the files on the remote server? Please let me know for any questions. Thanks Ajay (4 Replies)
Discussion started by: ajayakunuri
4 Replies

5. UNIX for Dummies Questions & Answers

Pipe binary file matches grep results to file

I am using grep to match a pattern, but the output is strange. $ grep -r -o "pattern" * Gives me: Binary file foo1 matches Binary file foo2 matches Binary file foo3 matches To find the lines before/after, I then have to use the following on each file: $ strings foo1 | grep -A1 -B1... (0 Replies)
Discussion started by: chipperuga
0 Replies

6. Shell Programming and Scripting

Does uniq -d only check for consecutive matches?

Hi All I have a rather large text file of approx 1m records in the format:- 20110877837-2.PDF 20100298984-3.PDF et al... I want to run uniq against the file to make sure there are no duplicate names..... uniq -d /path/to/input/file.txt However this is not producing any... (1 Reply)
Discussion started by: Bashingaway
1 Replies

7. Solaris

Before I delete any file in Unix, How can I check no open file handle is pointing to that file?

I know how to check if any file has a unix process using a file by looking at 'lsof <fullpath/filename>' command. I think using lsof is very expensive. Also to make it accurate we need to inlcude fullpath of the file. Is there another command that can tell if a file has a truely active... (12 Replies)
Discussion started by: kchinnam
12 Replies

8. Shell Programming and Scripting

get value that matches file name pattern

Hi I have files with names that contain the date in several formats as, YYYYMMDD, DD-MM-YY,DD.MM.YY or similar combinations. I know if a file fits in one pattern or other, but i donīt know how to extract the substring contained in the file that matches the pattern. For example, i know that ... (1 Reply)
Discussion started by: pjrm
1 Replies

9. Shell Programming and Scripting

find matches in file

Hi, im have log file ~100000 lines, 192.168.29.1 at 10/08/09 13:58:55 192.168.60.1 at 10/08/09 14:11:28 192.168.58.171 at 10/08/09 14:12:45 192.168.61.12 at 10/08/09 14:15:44 192.168.60.1 at 10/08/09 14:16:36 192.168.60.1 at 10/08/09 14:17:43 192.168.61.12 at 10/08/09 14:18:08... (9 Replies)
Discussion started by: Trump
9 Replies

10. Shell Programming and Scripting

Until the file extension matches

Hi All, In my script I am trying to input data from user and I want the promt to appear again if the input data is not the one expected. I tried something like this: echo " \n\n\t Enter the dump filename:\c";read dump pst=${dump##*.} until (test $pst = dmp) do ... (7 Replies)
Discussion started by: Sreejith_VK
7 Replies
Login or Register to Ask a Question