Match patterns between two files and extract certain range of strings


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Match patterns between two files and extract certain range of strings
# 1  
Match patterns between two files and extract certain range of strings

Hi,

I need help to match patterns from between two different files and extract region of strings.

inputfile1.fa
Code:
>l-WR24-1:1
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGG
GGCGGAGGGCGACGGCGGGTGGTGAGCGGCCCGGGAGGGGCCGGGCGGTGGGGTCACGTG
CGGCGGGCGGGGCGCGGGCTGACCCAGCTGTGCCCGCAGGCGCCCGGGCGGGCCGGGGGC
CGTGGCGGAGGAGGAGCGCTGCACGGTGGAGCGTCGGGCCGACCTCACCTACGCGGAGTT
CTACCACAAAGTGGACTTGCCCTTCCAGGAGTATGTGGAGCAGCTGCTGCACCCCCAGGA
CCCCACCTCCCTGGGCAATGACACCCTGTACTTCTTCGGGGACAACAACTTCACCGAGTG
GGCCTCTCTCTTTCGGCACTACTCCCCACCCCCATTTGGCCTGCTGGGAACCGCTCCAG
>l-ZF385A-2:1
CAGAATGTGGGTGAGGGTGGCGCCTATGAGGCTGAGCTTCGGGTCACCGCCCCTCCAGAG
GCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGACTCTCGAGTGGGATTT
GGGAGGATACCCCTCTAGAGGGGACACCAAAACCTGACCAGTGCCCACCCCATCTCCAGA
ACTTCTCCAGCCTGAGCTGTGACTACTTTGCCGTGAACCAGAGCCGCCTGCTGGTGTGTG
ACCTGGGCAACCCCATGAAGGCAGGAGCCAGTCTGTGGGGTGGCCTTCGGTTTACAGTCC
CTCATCTCCGGGACACTAAGAAAACCATCCAGTTTGACTTCCAGATCCTCAGGTAGGGAG
TGAGTGTGTCTAGGCTGGGGCTGAGCTGGGGACGGAAGGGAGGGCTGGGCGCCATTCTCA
CTGGCTGCACTCCAGCACCTCAGTCTTGCCTCCATCCCACAGCAAGAATCTCAACAACTC
GCAAAGCGACGTGGTTTCCTTTCGGCTCTCCGTGGAGGCTCAGGCCCAGGTCACCCTGAA
CGGGTCAGTGCCAGGCAAAATGGGGTCT
>l-YJC-1:1
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCACAGCATCCCCACGGG
CCTCCACGCCAACCTGTCCGAGGGCCGCCCCGTGGGTCCGGCCCGCCGTGGCGCCTCATC
GCTGCTCGGCCCGG

inputfile2.txt
Code:
l-WR24-1:1	1	71
l-WR73-7:4	28	506
l-WR86-1:1	140	138
l-YJC-1:1	1	161
l-YJC-1:1	1	165
l-ZFP57-11:1	1	991
l-ZF320-1:5	5	6031
l-ZMYND10-2:1	5	253
l-ZF329-4:1	151	5704
l-ZF708-1:1	195	3744
l-ZF843-3:1	14	1053
l-ZF385A-2:1	33	105
l-ZF843-3:2	4	235

The output file as below. It should contain the strings extracted from inputfile1.fa according to start and stop numbers (in blue) indicated in inputfile2.txt as shown above.


Code:
>l-WR24-1:1	1	71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGG
GGCGGAGGGCG
>l-ZF385A-2:1	33	105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGG
GGTGAGATGAGAC
>l-YJC-1:1	1	161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1	1	165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACG
CGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGT
CAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA

I have tried to match these two files using bioinformatics tools like seqtk and bedtools but they did not give me what I want and some of their data requirements just don't fit mine. I am still learning awk and I want to do this in awk. Below are one of the codes that I did to match the two files but failed and I do not know how to extract certain regions of characters.

Code:
awk 'FNR == NR{A[$1]; next}/^\>/{f = (substr($1,2) in A) ? 1 : 0}f' inputfile2.txt inputfile1.fa

Would appreciate your kind help. thanks

Last edited by bunny_merah19; 12-09-2019 at 10:57 AM..
# 2  
a bit verbose, but a possible starter.
awk -f bunny.awk inputfile2.txt inputfile1.fa where bunny.awk is:
Code:
function printRec() {
   #print a[f], s[f], e[f]
   for ( i in s) {
      split(i,t, OFS)
      if (f == ">" t[1])
        print ">" i ORS substr(a[f],s[i],e[i]-s[i]+1)
   }
   f=""
   split("",a)
}
FNR==NR {
   idx=$1 OFS $2 OFS $3
   s[idx]=$2
   e[idx]=$3
   next
}
/>/ && f {
   printRec()
}
f { a[f]=(f in a)?a[f] $1:$1 }
/^>/ { f=$1 }
END { printRec() }

results in:
Code:
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA


Last edited by vgersh99; 12-09-2019 at 12:51 PM..
This User Gave Thanks to vgersh99 For This Post:
# 3  
Try also
Code:
awk '
NR==FNR         {PAT[$1,$2,$3]
                 next
                }
                {IX  = $1
                 L1  = length ($1) + 1
                 $1 = $1 "|"
                 $0 = $0
                 for (p in PAT) {split (p, T)
                                 if (IX == T[1]) print RS p ORS substr ($0, T[2]+L1, T[3]-T[2]+1)
                                }
                }
' SUBSEP="\t" inputfile2.txt   RS=">"  OFS="" inputfile1.fa

>l-WR24-1:1    1    71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1    33    105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1    1    161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
 >l-YJC-1:1    1    165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA

If you really really need the output lines 60 chars in length, use
Code:
                 if (IX == T[1])    {print RS p
                                     TMP = substr ($0, T[2]+L1, T[3]-T[2]+1)
                                     PTR = 1
                                     while (PTR < length (TMP))    {print substr (TMP, PTR, 60)
                                                                    PTR += 60
                                                                   }
                                    }


Last edited by RudiC; 12-09-2019 at 03:20 PM..
These 3 Users Gave Thanks to RudiC For This Post:
# 4  
if you need to "fold" the lines at 60 char width, pipe the output to fold:
Code:
awk .... | fold -w 60

These 2 Users Gave Thanks to vgersh99 For This Post:
# 5  
Quote:
Originally Posted by vgersh99
a bit verbose, but a possible starter.
awk -f bunny.awk inputfile2.txt inputfile1.fa where bunny.awk is:
Code:
function printRec() {
   #print a[f], s[f], e[f]
   for ( i in s) {
      split(i,t, OFS)
      if (f == ">" t[1])
        print ">" i ORS substr(a[f],s[i],e[i]-s[i]+1)
   }
   f=""
   split("",a)
}
FNR==NR {
   idx=$1 OFS $2 OFS $3
   s[idx]=$2
   e[idx]=$3
   next
}
/>/ && f {
   printRec()
}
f { a[f]=(f in a)?a[f] $1:$1 }
/^>/ { f=$1 }
END { printRec() }

results in:
Code:
>l-WR24-1:1 1 71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1 33 105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1 1 161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
>l-YJC-1:1 1 165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA

Hi vgersh99,

Your codes work like charm. Thanks a million. Smilie

--- Post updated at 03:59 AM ---

Quote:
Originally Posted by RudiC
Try also
Code:
awk '
NR==FNR         {PAT[$1,$2,$3]
                 next
                }
                {IX  = $1
                 L1  = length ($1) + 1
                 $1 = $1 "|"
                 $0 = $0
                 for (p in PAT) {split (p, T)
                                 if (IX == T[1]) print RS p ORS substr ($0, T[2]+L1, T[3]-T[2]+1)
                                }
                }
' SUBSEP="\t" inputfile2.txt   RS=">"  OFS="" inputfile1.fa

>l-WR24-1:1    1    71
GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGGGGCGGAGGGCG
>l-ZF385A-2:1    33    105
TGAGCTTCGGGTCACCGCCCCTCCAGAGGCTGAGTACTCAGGACTCGTCAGACACCCAGGGGTGAGATGAGAC
>l-YJC-1:1    1    161
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCA
 >l-YJC-1:1    1    165
GTCCCGCCCTCGCATGCGCCTGGTGGTCACCGCGGACGACTTTGGTTACTGCCCGCGACGCGATGAGGGTATCGTGGAGGCCTTTCTGGCCGGGGCTGTGACCAGCGTGTCCCTGCTGGTCAACGGTGCGGCCACGGAGAGCGCGGCGGAGCTGGCCCGCAGGCA

If you really really need the output lines 60 chars in length, use
Code:
                 if (IX == T[1])    {print RS p
                                     TMP = substr ($0, T[2]+L1, T[3]-T[2]+1)
                                     PTR = 1
                                     while (PTR < length (TMP))    {print substr (TMP, PTR, 60)
                                                                    PTR += 60
                                                                   }
                                    }

Hi RudiC,

Your codes work great on my real data too. Thanks a million Smilie

--- Post updated at 04:43 AM ---

Quote:
Originally Posted by vgersh99
if you need to "fold" the lines at 60 char width, pipe the output to fold:
Code:
awk .... | fold -w 60

Ok, got it... thanks Smilie
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #762
Difficulty: Medium
The L2 cache, and higher-level caches, are never shared between the CPU cores.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match to range in files

I am trying to create a script that will use the position in column A ($1) in 48850.txt and search for it in columns B ($2) in gene.txt. Then when it finds a match it copies the text in column A ($1) and places it in column C ($3) of 48850.txt. I have attached the files. Thank you :). The... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

Match strings in 2 different files

Hi, i am trying to match strings from 2 different files based on position like below:- file1 (tab delimited) f07270 lololol fff u12730 gggddd dddkkrr mmm file2 (not tab delimited) %f07270 APSLH bl%alalalalallaadsdsfdfdfdgsgfss %g13450 GDIDFLRIP%ILITEAPPRKgsfgsgsf %d08880... (11 Replies)
Discussion started by: redse171
11 Replies

3. Shell Programming and Scripting

Extract multiple occurance of strings between 2 patterns

I need to extract multiple occurance strings between 2 different patterns in given line. For e.g. in below as input ------------------------------------------------------------------------------------- mike(hussey) AND mike(donald) AND mike(ryan) AND mike(johnson)... (8 Replies)
Discussion started by: sameermohite
8 Replies

4. Shell Programming and Scripting

awk extract strings matching multiple patterns

Hi, I wasn't quite sure how to title this one! Here goes: I have some already partially parsed log files, which I now need to extract info from. Because of the way they are originally and the fact they have been partially processed already, I can't make any assumptions on the number of... (8 Replies)
Discussion started by: chrissycc
8 Replies

5. Shell Programming and Scripting

Using AWK to match CSV files with duplicate patterns

Dear awk users, I am trying to use awk to match records across two moderately large CSV files. File1 is a pattern file with 173,200 lines, many of which are repeated. The order in which these lines are displayed is important, and I would like to preserve it. File2 is a data file with 456,000... (3 Replies)
Discussion started by: isuewing
3 Replies

6. Shell Programming and Scripting

How to extract information from two files with data range

Hi, I want to make a query about extracting data from two files that both have data ranges. the data that i want to extract; when there is matching between file1 column 2 is equal to file2 column2 , and file1 column 3 and column 4 is within the range of file2 columns 3 and 4. I would like rows... (1 Reply)
Discussion started by: houkto
1 Replies

7. Shell Programming and Scripting

Extract patterns and copy them in different files

Hi All, I have a file which looks like this: Name1;A01 Name2;A01.047 Name3;A01.047.025 Newname1;B01 NewName2;B01.056.32 NewName3;B04.09.43 NewNewName1;C01.03 NewNewName2;C01.034.44As you can see, in the file there is some name and followed by the name is some identifier. These... (5 Replies)
Discussion started by: shoaibjameel123
5 Replies

8. Shell Programming and Scripting

Find files that do not match specific patterns

Hi all, I have been searching online to find the answer for getting a list of files that do not match certain criteria but have been unsuccessful. I have a directory that has many jpg files. What I need to do is get a list of the files that do not match both of the following patterns (I have... (21 Replies)
Discussion started by: nikos-koutax
21 Replies

9. Shell Programming and Scripting

script to match patterns in 2 different files.

I am new to shell scripting and need some help. I googled, but couldn't find a similar scenario. Basically, I need to rename a datafile. This is the scenario - I have a file, readonly.txt that has 2 columns - file# and name. I have another file,missing_files.txt that has id and name. Both the... (3 Replies)
Discussion started by: mathews
3 Replies

10. Shell Programming and Scripting

print range between two patterns if it contains a pattern within the range

I want to print between the range two patterns if a particular pattern is present in between the two patterns. I am new to Unix. Any help would be greatly appreciated. e.g. Pattern1 Bombay Calcutta Delhi Pattern2 Pattern1 Patna Madras Gwalior Delhi Pattern2 Pattern1... (2 Replies)
Discussion started by: joyan321
2 Replies

Featured Tech Videos