awk to match and apply condtions to matchijng files in directories


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk to match and apply condtions to matchijng files in directories
# 1  
Old 10-15-2016
awk to match and apply condtions to matchijng files in directories

I am trying to merge the below awk, which compares two files looking for a match in $2 and then prints the line if two conditions are meet.

awk
Code:
 awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2]}}' F113.txt F113_tvc.bed

This code was improved and provided by @RavinderSingh13, thank you very much. I have ~500 files to process so I wanted to use all .txt files in /home/cmccabe/Desktop/comparison/missing and compare them to each matching numerical prefix in /home/cmccabe/Desktop/comparison/test_tvc all ending in .bed. Each filename in a directory will have a common numerical prefix:

So if there are three files, the three .txt files in home/cmccabe/Desktop/comparison/missing will look like:
Code:
F113.txt
H123.txt
S111.txt

and the three .bed files in /home/cmccabe/Desktop/comparison/test_tvc will look like:
Code:
F113_tvc.bed
H123_tvc.bed
S111_tvc.bed

So F113.txt would be compared to F113_tvc.bed, the matching numerical prefix is F113.

If a match between the $2 values in eaach file is made and both conditions if($10>30 && $11>49 are meet, then the matching line from the .txt file is printed in the out under Match in both files and meet criteria. If no match is found or the criterias is not meet then the line in the .txt is printed in the out under Missing in comparison:.

The below code provided by @Don Cragun works great but since my data has changed a bit I made some updates to it:

Code:
 (code that works perfect)
IAm=${0##*/}

InDir1='/home/cmccabe/Desktop/comparison/reference/10bp'
InDir2='/home/cmccabe/Desktop/comparison/validation/files'
OutDir='/home/cmccabe/Desktop/comparison/ref_val'

cd "$InDir1"
for file1 in *.txt
do    # Grab file prefix.
    p=${file1%%_*}

    # Find matching file2.
    file2=$(printf '%s' "$InDir2/$p"_*.vcf)
    if [ ! -f "$file2" ]
    then    printf '%s: No single file matching %s found.\n' "$IAm" \
            "$file1" >&2
        continue
    fi

    # Create matching output filename.
    out=${file2##*/}
    out=${out%.vcf}_comparison.txt

    printf '%s\t%s\t%s\n' "$InDir1/$file1" "$file2" "$OutDir/$out"
done | awk '
BEGIN {    FS = OFS = "\t"
}
{    in1 = $1
    in2 = $2
    out = $3
    print "Reading from " in1
    while((getline < in1) == 1)
        f1[$2 OFS $4 OFS $5]
    close(in1)
    print "Reading from " in2
    while((getline < in2) == 1)
        f2[$2 OFS $4 OFS $5]
    close(in2)
    print "Writing to " out
    print "Match:" > out
    for(k in f1)
        if(k in f2) {
            print k > out
            delete f1[k]
            delete f2[k]
        }
    print "Missing in Reference but found in IDP:" > out
    for(k in f2) {
        print k > out
        delete f2[k]
    }
    print "Missing in IDP but found in Reference:" > out
    for(k in f1) {
        print k > out
        delete f1[k]
    }
    close(out)
    print "***"
}'

updated version which does not run with comments marked by --

Code:
IAm=${0##*/}

InDir1='/home/cmccabe/Desktop/comparison/missing'   -- updated path to .txt files
InDir2='/home/cmccabe/Desktop/comparison/test_tvc'  -- updated path to .bed files
OutDir='/home/cmccabe/Desktop/comparison/final'  -- updated path to output

cd "$InDir1"
for file1 in *.txt
do    # Grab file prefix.
    p=${file1%%_*}

    # Find matching file2.
    file2=$(printf '%s' "$InDir2/$p"_*.bed)  -- updated extension
    if [ ! -f "$file2" ]
    then    printf '%s: No single file matching %s found.\n' "$IAm" \
            "$file1" >&2
        continue
    fi

    # Create matching output filename.
    out=${file2##*/}
    out=${out%.vcf}_final.txt  -- updated output

    printf '%s\t%s\t%s\n' "$InDir1/$file1" "$file2" "$OutDir/$out"
done | awk '
BEGIN {    FS = OFS = "\t"
}
{  in1 = $1
    in2 = $2
    out = $3
    print "Reading from " in1
    while((getline < in1) == 1)
        f1[$2]  -- updated to look for each $2 in the .txt file
    close(in1)
    print "Reading from " in2
    while((getline < in2) == 1)
        f2[$2] -- updated to look for each $2  from the .txt file in the matching .bed file
    close(in2)
    print "Writing to " out
    print "Match in both files and meet criteria:" > out
    for(k in f1)
        if(k in f2) {
            print k > out
            delete f1[k]
            delete f2[k]
        }
    print "Missing in comparison:" > out
    for(k in f2) {
        print k > out
        delete f2[k]
    }
    close(out)
    print "***"
}'

I am not sure how to perform the two if statements on the matching $2 values. Below are two sample input files as well as the desired output.

file1 (F113.txt)
Code:
Missing in IDP but found in Reference:
2   166848646   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5139C>T]+[=] 52.94
2   166245888   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5500G>T]+[=] 32

file2 (F113.bed)
Code:
Chrom    Position    Gene Sym    Target ID    Type    Zygosity    Genotype    Ref    Variant    Var Freq    Qual    Coverage    Ref Cov    Var Cov
chr2    166245425   SCN2A   AMPL5155065355  SNP Het C/T C   T   54  100   50    23  27
chr2    166848646   SCN1A   AMPL1543060606  SNP Het        G/A   G  A   52.9411764706   100 68  32  36

desired output
Code:
Match in both files and meet criteria:
2   166848646   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5139C>T]+[=] 52.94
Missing in comparison:
2   166245888   G   A   exonic  SCN1A   68  13  16;20   0;0 17;15   0;0 0;0 0;0     c.[5500G>T]+[=] 32

I hope I have included enough information and thank you Smilie.
# 2  
Old 10-15-2016
Hello cmccabe,

Could you please try following and let me know how it goes then. I haven't tested it at all.
Code:
for file in "/home/cmccabe/Desktop/comparison/missing/*.txt"
do
	file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}"
	if [[ -f file1 ]]
	then
		 awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";delete A[$2]}} END{for(i in A){print A[i] >> "out_no_match_found_values"}}'  $file $file1
	fi
done

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 3  
Old 10-15-2016
Code:
for file in "/home/cmccabe/Desktop/comparison/missing/*.txt"
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}"
    if [[ -f file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";delete A[$2]}} END{for(i in A){print A[i] >> "out_no_match_found_values"}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/${file}_final.txt
    fi
done

the portion in bold was added to store the results of each comparison in the /home/cmccabe/Desktop/comparison/final directory.

Here is the error I get. Thank you Smilie.

Code:
awk: cmd. line:1: fatal: cannot open file `/home/cmccabe/Desktop/comparison/test_tvc//home/cmccabe/Desktop/comparison/missing/*' for reading (No such file or directory)

# 4  
Old 10-15-2016
Hello cmccabe,

Could you please try following and let me know if this helps.
Code:
cd /home/cmccabe/Desktop/comparison/missing 
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f $file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";delete A[$2]}} END{for(i in A){print A[i] >> "out_no_match_found_values"}}'  $file $file1
    fi
done

Also I am not sure why you are taking awkcommand's output into a file? If this is the case then you shoulduse following command then.
Code:
cd /home/cmccabe/Desktop/comparison/missing 
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f $file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A[i] >> "out_no_match_found_values";print A[i]}}'  $file $file1 > Output_final_file.txt
    fi
done

Again, I haven't tested it all so there may be a chance to tweak it a bit, kindly check it and let me know how it goes then.

Thanks,
R. Singh

Last edited by RavinderSingh13; 10-16-2016 at 05:34 AM.. Reason: thanks to greet_sed for letting me know the missing $ in file1 variable.
This User Gave Thanks to RavinderSingh13 For This Post:
# 5  
Old 10-15-2016
O used the below:

Code:
cd /home/cmccabe/Desktop/comparison/missing
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}"
    if [[ -f file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A[i] >> "out_no_match_found_values";print A[i]}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/Output_final_file.txt
    fi
done

The output of the awk is stored for each par of files that are compared as that match/difference is important to know. The command does run but there is no output file created for each. So if there are 3 file compared say:

From /home/cmccabe/Desktop/comparison/missing the file F113.txt is compared to /home/cmccabe/Desktop/comparison/test_tvc file F113_tvc.bed the matches and differences are the stored in the output at /home/cmccabe/Desktop/comparison/final called F113_final.txt

From /home/cmccabe/Desktop/comparison/missing the file H123.txt is compared to /home/cmccabe/Desktop/comparison/test_tvc file H123_tvc.bed the matches and differences are the stored in the output at /home/cmccabe/Desktop/comparison/final called H123_final.txt


From /home/cmccabe/Desktop/comparison/missing the file S111.txt is compared to /home/cmccabe/Desktop/comparison/test_tvc file S111_tvc.bed the matches and differences are the stored in the output at /home/cmccabe/Desktop/comparison/final called S111_final.txt

I hope this helps and thank you very much Smilie.
# 6  
Old 10-15-2016
Hello cmccabe,

Could you please try following and let me know if this helps you.
Code:
cd /home/cmccabe/Desktop/comparison/missing
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f $file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "/path/to/file/Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A[i] >> "/path/to/file/out_no_match_found_values";print A[i]}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/Output_final_file.txt
    fi
done

Also files out_no_match_found_valuesandMatch_in_both_files_and_meet_criteria in path /home/cmccabe/Desktop/comparison/missing couldn't be seen, because there is no complete path given for those files so in case you need them
in any other path please use absolute path eg-->/path/to/file/out_no_match_found_values for these output files and then it should fly. Let me know how it goes then.

Thanks,
R. Singh

Last edited by RavinderSingh13; 10-16-2016 at 05:34 AM.. Reason: thanks to greet_sed for letting me know the missing $ in file1 variable.
This User Gave Thanks to RavinderSingh13 For This Post:
# 7  
Old 10-15-2016
Here is what I have:

Code:
cd /home/cmccabe/Desktop/comparison/missing
for file in *.txt
do
    file1="/home/cmccabe/Desktop/comparison/test_tvc/${file%%.txt}.bed"
    if [[ -f file1 ]]
    then
         awk 'FNR==NR{A[$2]=$0;Q=FILENAME;next} ($2 in A){if($10>30 && $11>49){print A[$2] >> "/home/cmccabe/Desktop/comparison/missing/Match_in_both_files_and_meet_criteria";print "Match found in both the files named " Q " and " FILENAME " is: " A[$2];delete A[$2]}} END{print "NON-matched lines between file named "Q " and " FILENAME " are: ";for(i in A){print A[i] >> "/home/cmccabe/Desktop/comparison/missing/out_no_match_found_values";print A[i]}}'  $file $file1 > /home/cmccabe/Desktop/comparison/final/Output_final_file.txt
    fi
done

The code does run but there is no output files created in /home/cmccabe/Desktop/comparison/final.

Thank you very much Smilie.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk move select fields to match file prefix in two directories

In the awk below I am trying to use the file1 as a match to file2. In file2 the contents of $5,&6,and $7 (always tab-delimited) and are copied to the output under the header Quality metrics. The below executes but the output is empty. I have added comments to help and show my thinking. Thank you... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. Shell Programming and Scripting

awk to match field between two files and use conditions on match

I am trying to look for $2 of file1 (skipping the header) in $2 of file2 (skipping the header) and if they match and the value in $10 is > 30 and $11 is > 49, then print the line from file1 to a output file. If no match is foung the line is not printed. Both the input and output are tab-delimited.... (3 Replies)
Discussion started by: cmccabe
3 Replies

3. Shell Programming and Scripting

sed - pattern match - apply substitution

Greetings Experts, I am on AIX and in process of creating a re-startable script that connects to Oracle and executes the statements. The sample contents of the file1 is CREATE OR REPLACE VIEW DB_V.TAB1 AS SELECT * FROM DB_T.TAB1; .... CREATE OR REPLACE VIEW DB_V.TAB10 AS SELECT * FROM... (9 Replies)
Discussion started by: chill3chee
9 Replies

4. Shell Programming and Scripting

awk - Compare files in two different directories

Hi, My script works fine when I have both input files in the same directory but when I put on of the input file in another directory, the output does not show up. SCRIPT: awk ' BEGIN { OFS="\t" out = "File3.txt"} NR==FNR && NF {a=$0; next} function print_77_99() { if... (3 Replies)
Discussion started by: High-T
3 Replies

5. Homework & Coursework Questions

Finding the directories with same permission and then apply some default UNIX commands

Write a Unix shell script named 'mode' that accepts two or more arguments, a file mode, a command and an optional list of parameters and performs the given command with the optional parameters on all files with that given mode. For example, mode 644 ls -l should perform the command ls -l on all... (5 Replies)
Discussion started by: femchi
5 Replies

6. Shell Programming and Scripting

Finding the directories with same permission and then apply some default UNIX commands

HI there. My teacher asked us to write a code for this question Write a Unix shell script named 'mode' that accepts two or more arguments, a file mode, a command and an optional list of parameters and performs the given command with the optional parameters on all files with that given mode. ... (1 Reply)
Discussion started by: femchi
1 Replies

7. Shell Programming and Scripting

apply record separator to multiple files within a directory using awk

Hi, I have a bunch of records within a directory where each one has this form: (example file1) 1 2 50 90 80 90 43512 98 0909 79869 -9 7878 33222 8787 9090 89898 7878 8989 7878 6767 89 89 78676 9898 000 7878 5656 5454 5454 and i want for all of these files to be... (3 Replies)
Discussion started by: amarn
3 Replies

8. UNIX for Dummies Questions & Answers

Do UNIX Permission apply to sub directories?

Hi Guys, Can you tell me if unix permissions apply to sub dirs? Dir is /home/ops/batch/files/all /home is rwxrwxrwx ops is rwxrwxrwx batch is rwxr-wr-w files is rwxrwxrwx all is rwxrwxrwx Having problems writing to all (does the userid nee to be the batch owner... (1 Reply)
Discussion started by: Grueben
1 Replies

9. Shell Programming and Scripting

Apply 'awk' to all files in a directory or individual files from a command line

Hi All, I am using the awk command to replace ',' by '\t' (tabs) in a csv file. I would like to apply this to all .csv files in a directory and create .txt files with the tabs. How would I do this in a script? I have the following script called "csvtabs": awk 'BEGIN { FS... (4 Replies)
Discussion started by: ScKaSx
4 Replies

10. Shell Programming and Scripting

AWK Script - Count Files In Directories

Hey, I'm very new to AWK and am trying to write a script that counts the number of files in all subdirectories. So, basically, my root has many subdirectories, and each subdirectory has many files. How can I get the total count? I haven't been able to figure out how to loop through the... (1 Reply)
Discussion started by: beefeater267
1 Replies
Login or Register to Ask a Question