Match patterns between two files and extract certain range of strings


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Match patterns between two files and extract certain range of strings
# 1  
Old 12-09-2019
Match patterns between two files and extract certain range of strings

Hi,

I need help to match patterns from between two different files and extract region of strings.

inputfile1.fa
Code:
>l-WR24-1:1
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGG
GGCGGAGGGCGACGGCGGGTGGTGAGCGGCCCGGGAGGGGCCGGGCGGTGGGGTCACGTG
CGGCGGGCGGGGCGCGGGCTGACCCAGCTGTGCCCGCAGGCGCCCGGGCGGGCCGGGGGC
CGTGGCGGAGGAGGAGCGCTGCACGGTGGAGCGTCGGGCCGACCTCACCTACGCGGAGTT
CTACCACAAAGTGGACTTGCCCTTCCAGGAGTATGTGGAGCAGCTGCTGCACCCCCAGGA
CCCCACCTCCCTGGGCAATGACACCCTGTACTTCTTCGGGGACAACAACTTCACCGAGTG
GGCCTCTCTCTTTCGGCACTACTCCCCACCCCCATTTGGCCTGCTGGGAACCGCTCCAG
>l-ZF385A-2:1
CAGAATGTGGGTGAGGGTGGCGCCTATGAGGCTGAGCTTCGGGTCACCGCCCCTCCAGAG
GCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGACTCTCGAGTGGGATTT
GGGAGGATACCCCTCTAGAGGGGACACCAAAACCTGACCAGTGCCCACCCCATCTCCAGA
ACTTCTCCAGCCTGAGCTGTGACTACTTTGCCGTGAACCAGAGCCGCCTGCTGGTGTGTG
ACCTGGGCAACCCCATGAAGGCAGGAGCCAGTCTGTGGGGTGGCCTTCGGTTTACAGTCC
CTCATCTCCGGGACACTAAGAAAACCATCCAGTTTGACTTCCAGATCCTCAGGTAGGGAG
TGAGTGTGTCTAGGCTGGGGCTGAGCTGGGGACGGAAGGGAGGGCTGGGCGCCATTCTCA
CTGGCTGCACTCCAGCACCTCAGTCTTGCCTCCATCCCACAGCAAGAATCTCAACAACTC
GCAAAGCGACGTGGTTTCCTTTCGGCTCTCCGTGGAGGCTCAGGCCCAGGTCACCCTGAA
CGGGTCAGTGCCAGGCAAAATGGGGTCT
>l-YJC-1:1
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCACAGCATCCCCACGGG
CCTCCACGCCAACCTGTCCGAGGGCCGCCCCGTGGGTCCGGCCCGCCGTGGCGCCTCATC
GCTGCTCGGCCCGG

inputfile2.txt
Code:
l-WR24-1:1	1	71
l-WR73-7:4	28	506
l-WR86-1:1	140	138
l-YJC-1:1	1	161
l-YJC-1:1	1	165
l-ZFP57-11:1	1	991
l-ZF320-1:5	5	6031
l-ZMYND10-2:1	5	253
l-ZF329-4:1	151	5704
l-ZF708-1:1	195	3744
l-ZF843-3:1	14	1053
l-ZF385A-2:1	33	105
l-ZF843-3:2	4	235

The output file as below. It should contain the strings extracted from inputfile1.fa according to start and stop numbers (in blue) indicated in inputfile2.txt as shown above.


Code:
>l-WR24-1:1	1	71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGG
GGCGGAGGGCG
>l-ZF385A-2:1	33	105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGG
GGTGAGATGAGAC
>l-YJC-1:1	1	161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1	1	165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA

I have tried to match these two files using bioinformatics tools like seqtk and bedtools but they did not give me what I want and some of their data requirements just don't fit mine. I am still learning awk and I want to do this in awk. Below are one of the codes that I did to match the two files but failed and I do not know how to extract certain regions of characters.

Code:
awk 'FNR == NR{A[$1]; next}/^\>/{f = (substr($1,2) in A) ? 1 : 0}f' inputfile2.txt inputfile1.fa

Would appreciate your kind help. thanks

Last edited by bunny_merah19; 12-09-2019 at 10:57 AM..
# 2  
Old 12-09-2019
a bit verbose, but a possible starter.
awk -f bunny.awk inputfile2.txt inputfile1.fa where bunny.awk is:
Code:
function printRec() {
   #print a[f], s[f], e[f]
   for ( i in s) {
      split(i,t, OFS)
      if (f == ">" t[1])
        print ">" i ORS substr(a[f],s[i],e[i]-s[i]+1)
   }
   f=""
   split("",a)
}
FNR==NR {
   idx=$1 OFS $2 OFS $3
   s[idx]=$2
   e[idx]=$3
   next
}
/>/ && f {
   printRec()
}
f { a[f]=(f in a)?a[f] $1:$1 }
/^>/ { f=$1 }
END { printRec() }

results in:
Code:
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA


Last edited by vgersh99; 12-09-2019 at 12:51 PM..
This User Gave Thanks to vgersh99 For This Post:
# 3  
Old 12-09-2019
Try also
Code:
awk '
NR==FNR         {PAT[$1,$2,$3]
                 next
                }
                {IX  = $1
                 L1  = length ($1) + 1
                 $1 = $1 "|"
                 $0 = $0
                 for (p in PAT) {split (p, T)
                                 if (IX == T[1]) print RS p ORS substr ($0, T[2]+L1, T[3]-T[2]+1)
                                }
                }
' SUBSEP="\t" inputfile2.txt   RS=">"  OFS="" inputfile1.fa

>l-WR24-1:1    1    71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1    33    105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1    1    161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
 >l-YJC-1:1    1    165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA

If you really really need the output lines 60 chars in length, use
Code:
                 if (IX == T[1])    {print RS p
                                     TMP = substr ($0, T[2]+L1, T[3]-T[2]+1)
                                     PTR = 1
                                     while (PTR < length (TMP))    {print substr (TMP, PTR, 60)
                                                                    PTR += 60
                                                                   }
                                    }


Last edited by RudiC; 12-09-2019 at 03:20 PM..
These 3 Users Gave Thanks to RudiC For This Post:
# 4  
Old 12-09-2019
if you need to "fold" the lines at 60 char width, pipe the output to fold:
Code:
awk .... | fold -w 60

These 2 Users Gave Thanks to vgersh99 For This Post:
# 5  
Old 12-09-2019
Quote:
Originally Posted by vgersh99
a bit verbose, but a possible starter.
awk -f bunny.awk inputfile2.txt inputfile1.fa where bunny.awk is:
Code:
function printRec() {
   #print a[f], s[f], e[f]
   for ( i in s) {
      split(i,t, OFS)
      if (f == ">" t[1])
        print ">" i ORS substr(a[f],s[i],e[i]-s[i]+1)
   }
   f=""
   split("",a)
}
FNR==NR {
   idx=$1 OFS $2 OFS $3
   s[idx]=$2
   e[idx]=$3
   next
}
/>/ && f {
   printRec()
}
f { a[f]=(f in a)?a[f] $1:$1 }
/^>/ { f=$1 }
END { printRec() }

results in:
Code:
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA

Hi vgersh99,

Your codes work like charm. Thanks a million. Smilie

--- Post updated at 03:59 AM ---

Quote:
Originally Posted by RudiC
Try also
Code:
awk '
NR==FNR         {PAT[$1,$2,$3]
                 next
                }
                {IX  = $1
                 L1  = length ($1) + 1
                 $1 = $1 "|"
                 $0 = $0
                 for (p in PAT) {split (p, T)
                                 if (IX == T[1]) print RS p ORS substr ($0, T[2]+L1, T[3]-T[2]+1)
                                }
                }
' SUBSEP="\t" inputfile2.txt   RS=">"  OFS="" inputfile1.fa

>l-WR24-1:1    1    71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1    33    105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1    1    161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
 >l-YJC-1:1    1    165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA

If you really really need the output lines 60 chars in length, use
Code:
                 if (IX == T[1])    {print RS p
                                     TMP = substr ($0, T[2]+L1, T[3]-T[2]+1)
                                     PTR = 1
                                     while (PTR < length (TMP))    {print substr (TMP, PTR, 60)
                                                                    PTR += 60
                                                                   }
                                    }

Hi RudiC,

Your codes work great on my real data too. Thanks a million Smilie

--- Post updated at 04:43 AM ---

Quote:
Originally Posted by vgersh99
if you need to "fold" the lines at 60 char width, pipe the output to fold:
Code:
awk .... | fold -w 60

Ok, got it... thanks Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match to range in files

I am trying to create a script that will use the position in column A ($1) in 48850.txt and search for it in columns B ($2) in gene.txt. Then when it finds a match it copies the text in column A ($1) and places it in column C ($3) of 48850.txt. I have attached the files. Thank you :). The... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

Match strings in 2 different files

Hi, i am trying to match strings from 2 different files based on position like below:- file1 (tab delimited) f07270 lololol fff u12730 gggddd dddkkrr mmm file2 (not tab delimited) %f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfss %g13450 GDIDFLRIP%ILITEAPPRKgsfgsgsf %d08880... (11 Replies)
Discussion started by: redse171
11 Replies

3. Shell Programming and Scripting

Extract multiple occurance of strings between 2 patterns

I need to extract multiple occurance strings between 2 different patterns in given line. For e.g. in below as input ------------------------------------------------------------------------------------- mike(hussey) AND mike(donald) AND mike(ryan) AND mike(johnson)... (8 Replies)
Discussion started by: sameermohite
8 Replies

4. Shell Programming and Scripting

awk extract strings matching multiple patterns

Hi, I wasn't quite sure how to title this one! Here goes: I have some already partially parsed log files, which I now need to extract info from. Because of the way they are originally and the fact they have been partially processed already, I can't make any assumptions on the number of... (8 Replies)
Discussion started by: chrissycc
8 Replies

5. Shell Programming and Scripting

Using AWK to match CSV files with duplicate patterns

Dear awk users, I am trying to use awk to match records across two moderately large CSV files. File1 is a pattern file with 173,200 lines, many of which are repeated. The order in which these lines are displayed is important, and I would like to preserve it. File2 is a data file with 456,000... (3 Replies)
Discussion started by: isuewing
3 Replies

6. Shell Programming and Scripting

How to extract information from two files with data range

Hi, I want to make a query about extracting data from two files that both have data ranges. the data that i want to extract; when there is matching between file1 column 2 is equal to file2 column2 , and file1 column 3 and column 4 is within the range of file2 columns 3 and 4. I would like rows... (1 Reply)
Discussion started by: houkto
1 Replies

7. Shell Programming and Scripting

Extract patterns and copy them in different files

Hi All, I have a file which looks like this: Name1;A01 Name2;A01.047 Name3;A01.047.025 Newname1;B01 NewName2;B01.056.32 NewName3;B04.09.43 NewNewName1;C01.03 NewNewName2;C01.034.44As you can see, in the file there is some name and followed by the name is some identifier. These... (5 Replies)
Discussion started by: shoaibjameel123
5 Replies

8. Shell Programming and Scripting

Find files that do not match specific patterns

Hi all, I have been searching online to find the answer for getting a list of files that do not match certain criteria but have been unsuccessful. I have a directory that has many jpg files. What I need to do is get a list of the files that do not match both of the following patterns (I have... (21 Replies)
Discussion started by: nikos-koutax
21 Replies

9. Shell Programming and Scripting

script to match patterns in 2 different files.

I am new to shell scripting and need some help. I googled, but couldn't find a similar scenario. Basically, I need to rename a datafile. This is the scenario - I have a file, readonly.txt that has 2 columns - file# and name. I have another file,missing_files.txt that has id and name. Both the... (3 Replies)
Discussion started by: mathews
3 Replies

10. Shell Programming and Scripting

print range between two patterns if it contains a pattern within the range

I want to print between the range two patterns if a particular pattern is present in between the two patterns. I am new to Unix. Any help would be greatly appreciated. e.g. Pattern1 Bombay Calcutta Delhi Pattern2 Pattern1 Patna Madras Gwalior Delhi Pattern2 Pattern1... (2 Replies)
Discussion started by: joyan321
2 Replies
Login or Register to Ask a Question