How to extract the partial matching strings among two files?


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers How to extract the partial matching strings among two files?
# 1  
Old 06-11-2019
How to extract the fasta sequences based on the partial matching strings from another files?

I have a two file as shown below,
file:1
Code:
>Contig_152_415 [51615 - 50833] (REVERSE SENSE) 
>Contig_152_420 [50829 - 50215] (REVERSE SENSE) 
>Contig_152_472 [46116 - 45550] (REVERSE SENSE) 
>Contig_152_484 [44618 - 44079] (REVERSE SENSE)

Code:
File:2
>Contig_152:49081-49929
ATCGAGCAGCGCCGCGTGCGGTGCACCCTTGTGCAGATCGGGAGTAACCACGCGCACGGC
GGAACGACGGCCGGGAGAGGTGGGTTTGAACTTCATCAATGGCATGGGATGGACCTCAGG
CCTTGGCCG
>Contig_152:50833-51615
CGGAGTAGCTTCGATCAGCGTGACCGGTACCGAGCGACCGTCTTCAGTGAAGACGCGGCT
CATACCAGCCTTGCGGCCCACGAAGCCCAACGAATATTTCTTCGTCATGGTCGTAGTCCT
CAGGTCAGCTTGATCTGGACGTCGACGCCAGCCGCGAGTTCGAGCTTCATCAGCGCGTCC
ACGGTCTTGTCGTTCGGGTCGACGATATCGAGCACACGCTTGTGCGTGCGGGTTTCGTAT
TGG
>Contig_152:50215-50829
TGCCAGCCACTCCTCGACCTTCTTGACCGCGTCGGCGGTGATCACGACCGTATCGGCCCC
GACCAGAGCGACCGGATCCAGACCCTGCACGTCACGCACCTGCACATACGGCAGGTTGCG
AGCGGACAGATACAGGTGCTCGGAAGCCTCTTCGGTGACGATCAGCGGGCGCTTGCCCAC
>Contig_152:45550-46116
GTTACGGAACGGGAACTTGAACGCTGCCAGCAGCGCCTTCGCTTCCGCATCCGTCTTGGC
GGTGGTGGTGATGGCGATATCCATACCGCGGATCGCGTCGACGGCGTCGAAGTCGATTTC
>Contig_152:44079-44618
AGCCTTCTTGGCTTCCTTGCGAATGATGACTTCACCGGCGTACTTCACACCCTTGCCCTT
GTAGGGCTCCGGCGGACGGAAACCGCGAATCTTGGCGGCAACTTCGCCGACGCGCTGCTT
>Contig_65:14454-14897
GCCCTCCACCGTCAAACCCATGCTGCGGGCAGAACCCGCAATCGTACGCACCGCCGCTTC
CAGCTCGGCTGCCGTCAAATCGGCTTCCTTGACCTTGGCGATCTCTTCCAGCTGCTTGCG
>Contig_65:12254-12652
CTTGATTTCGACAGTCGCGCCAGCTTCGGTCAGTTCCTTCTTGATCTTCTCGGCTTCGTC
CTTCGAAGCGCCTTCCTTCAGCATGCCACCGGCTTCGGTCAGATCCTTGGCTTCCTTCAG
ACCCAGGCCGGTCACGGCGCGGACGGCCTTAATGACGCCGACCTTGTTGGTGCCACAGTT
GGTCAGCACCACGTTGAACTCGGTCTGCTCTTCAACCACGGCGGCCGGGCCAGCTGCAGC
AGCAGCGACCGGGGCAGCGGCGGAGACGCCGAACTTTTCTTCGATGGCCTTGACCAGCTC
CATCACTTCCATCAGGGACTTCTCGGCGATGGCGTCGACGATCTGTTCGTTGGTAAGGGA
CATTGTGATTACCTTTAAATGATTTTCTGGATTGGGGTT
>Contig_152:46437-46805
CAGCACTTCGGGAGCGAGCGAGACGATCTTCATGAACTTCTCGGAACGCAGCTCGCGCGT
CACCGGCCCGAAGATGCGGGTCCCGATCGGCTCCTGCTTATTGTTCAACAGCACAGCGGC
>Contig_152:47286-47711
GCGCACCGTCCGGGTCACGAAAGTGGTGGTCACCGAGAGCTTGGCAGCGGCAAGGCGGAA
CGCCTCACGTGCGGTCTCTTCGGGGATTCCCTCGATTTCATAGATCATGCGGCCCGGCTG
CTCGCT
>Contig_83:12952-13500
GGGCTCACCCCTAGCTCTTGCAAGATGGATTGGGGTACAATATGCGGCTTGCTGGCCAGC
CTTGCGTCTGGCCCTCCTACGAACCTCGAGCAATCGCCGTGCGGCGACTGCCGGTCCAAT
AAGGCGGAG
>Contig_98:34509-34868
TGCCGCCAGCGCGCCCTTTGCCTTCTCGGCCAGCGCGGCAAAGCCGGCGGCGTCGTGCAC
GGCGATATCGGCCAGCACCTTACGGTCCAGGGTGATGCCAGCCTTGAGCAGGCCATTCAT
>Contig_49:5824-6093
GGAGAACCAGTCATGGCACATAAAAAGGGCGTAGGTTCCTCGCGCAACGGTCGCGATTCC
AACCCGAAGTACCTCGGCGTGAAGATCTTCGGTGGCCAGGCCATCGACGCCGGCAACATC
>Contig_65:7816-11976
GTCTTCCTGCAGGAATTCGCGATAGGAATCCACCTGGATGGCAAGCAGGAACGGCACTTC
GAGGATCGAGCGCTGCTTACCGAAGTCCTTGCGGATACGCTTTTTTTCGGTGAACGAATA
AGACGTCATGAGGTCTTCACC
>Contig_1:95028-96590
GGAGGACGTAGCCGGCCAGGTCGTCGCGGATGTTGCGGCGGGTACCGCCGGCATCGATAT
AGGCCAGGTCGGCCAGGCGCGCATGGCTGGGGCCGGAATTGTAGATGCGCAACGCGCGCT
GGCTGCCGTGCCCGGCCAGGCTGGCATGCAGCGCCGGTTGCGCTGCGCGGGCATGGCTGG
GCTGTGCAAACACGGGGGTGGAATAACGCAGCAGGTCGGTCTCGTTGCCGATCGCGTGCT
GGG
>Contig_152:47705-48556
GATGTCCTCACCACGCTTGCCGATCACCACACCCGGACGGGCGGTGTGAATGGTCACGCG
GGCGGTTTTGGCAGGGCGCTCGATCAAGATCTTGCTGACACCTGCCTGCGCGAGCTTCTT
>Contig_152:42930-43502
GCCCTTGACCGTCTTGCTGACGCGATTGACCGCGACCAGCTTTTCGATCATGCCATCGTC
GACTTTCTCTTCGCGGTTACGGTCGCGATCACGACCCCGCGGTGCACGTTCTTCTGCCAT
CTTGATTCCTTGATTGATTGAGTATGTACGGCT
>Contig_152:51495-51989
TCCGCCCAAAAACTGAGGCAGCCCGGTAACCCGGCCTGCCCAGACGGAAAAGTATAATGC
GCAACAAGAGCACGG
>Contig_152:39834-40226
GACGCGACGCTTCTTCGGCGGACGGCACCCGTTGTGCGGGATTGGCGTCACGTCGATGAT
GTTGGTGATCTTGTAGCCCACGTTGTTCAACGAACGCACGGCCGACTCACGGCCCGGACC
>Contig_152:40237-40797
CTTCCTGATCGCCTTGCGCGGACCCTTGCGGGTGCGGGCGTTGGTACGGGTGCGCTGACC
ACGCAGCGGAAGACCACGGCGATGACGCAGACCGCGATAGCAGCCCAGGTCCATCAGTCG
>Contig_152:48805-49077
CCTGCCCGACTTCTTGTCGCCACCGTGACCCTTGAAGGTCCGGGTGACGGCAAATTCGCC
GAGCTTGTGGCCGACCATATTCTCGTTGACGAGCACCGGAATGTGGTTCTTGCCGTTATG
>Contig_1:93980-94864
GGCCAGGCCGGCCTGCTTCATCACTTCGGCGGCGTAGTCTTCCACCACCTTCTCGATGCC
TTCGCCAACGGCCAGGCGCTGGAAGCCGATCACTTCGGCGCCGGCGGCCTTGACTGCCTG

Both files (first file single and second file fasta header) strings are partially common to each other. I need to extract the 2nd file fasta sequences based on the first file.
Code:
The expected output
>Contig_152:50833-51615
CGGAGTAGCTTCGATCAGCGTGACCGGTACCGAGCGACCGTCTTCAGTGAAGACGCGGCT
CATACCAGCCTTGCGGCCCACGAAGCCCAACGAATATTTCTTCGTCATGGTCGTAGTCCT
CAGGTCAGCTTGATCTGGACGTCGACGCCAGCCGCGAGTTCGAGCTTCATCAGCGCGTCC
ACGGTCTTGTCGTTCGGGTCGACGATATCGAGCACACGCTTGTGCGTGCGGGTTTCGTAT
TGG
>Contig_152:50215-50829
TGCCAGCCACTCCTCGACCTTCTTGACCGCGTCGGCGGTGATCACGACCGTATCGGCCCC
GACCAGAGCGACCGGATCCAGACCCTGCACGTCACGCACCTGCACATACGGCAGGTTGCG
AGCGGACAGATACAGGTGCTCGGAAGCCTCTTCGGTGACGATCAGCGGGCGCTTGCCCAC
>Contig_152:45550-46116
GTTACGGAACGGGAACTTGAACGCTGCCAGCAGCGCCTTCGCTTCCGCATCCGTCTTGGC
GGTGGTGGTGATGGCGATATCCATACCGCGGATCGCGTCGACGGCGTCGAAGTCGATTT
>Contig_152:44079-44618
AGCCTTCTTGGCTTCCTTGCGAATGATGACTTCACCGGCGTACTTCACACCCTTGCCCTT
GTAGGGCTCCGGCGGACGGAAACCGCGAATCTTGGCGGCAACTTCGCCGACGCGCTGCTT

I have tried the following commands, but it is not working
Code:
grep -Fwf file1 file2 
sed -n file1 file2
awk 'NR==FNR {a[$1]++; next} $1 file1 file2


Last edited by Neo; 06-11-2019 at 03:26 AM..
# 2  
Old 06-11-2019
No surprise "it is not working" as none of your attempts is addressing your problem if at all syntactically correct.



Code:
awk 'FNR == NR {SRCH[$3 "-" $2]; next} $2 in SRCH {print ">" $0}' FS="[]- []*" file1 RS=">" ORS="" FS="[:
]" file2 
>Contig_152:50833-51615
CGGAGTAGCTTCGATCAGCGTGACCGGTACCGAGCGACCGTCTTCAGTGAAGACGCGGCT
CATACCAGCCTTGCGGCCCACGAAGCCCAACGAATATTTCTTCGTCATGGTCGTAGTCCT
CAGGTCAGCTTGATCTGGACGTCGACGCCAGCCGCGAGTTCGAGCTTCATCAGCGCGTCC
ACGGTCTTGTCGTTCGGGTCGACGATATCGAGCACACGCTTGTGCGTGCGGGTTTCGTAT
TGG
>Contig_152:50215-50829
TGCCAGCCACTCCTCGACCTTCTTGACCGCGTCGGCGGTGATCACGACCGTATCGGCCCC
GACCAGAGCGACCGGATCCAGACCCTGCACGTCACGCACCTGCACATACGGCAGGTTGCG
AGCGGACAGATACAGGTGCTCGGAAGCCTCTTCGGTGACGATCAGCGGGCGCTTGCCCAC
>Contig_152:45550-46116
GTTACGGAACGGGAACTTGAACGCTGCCAGCAGCGCCTTCGCTTCCGCATCCGTCTTGGC
GGTGGTGGTGATGGCGATATCCATACCGCGGATCGCGTCGACGGCGTCGAAGTCGATTTC
>Contig_152:44079-44618
AGCCTTCTTGGCTTCCTTGCGAATGATGACTTCACCGGCGTACTTCACACCCTTGCCCTT
GTAGGGCTCCGGCGGACGGAAACCGCGAATCTTGGCGGCAACTTCGCCGACGCGCTGCTT

Your desired output for "45550-46116" is missing a trailing "C".

Last edited by RudiC; 06-13-2019 at 05:02 AM..
These 2 Users Gave Thanks to RudiC For This Post:
# 3  
Old 06-11-2019
Code:
$ awk -F'\\]|\\[|-|:' -v RS='\n>' ' NR == FNR { gsub(" ",""); arr[$3]=1;next} arr[$2]{print ">" $0} ' f1 f2
>Contig_152:50833-51615
CGGAGTAGCTTCGATCAGCGTGACCGGTACCGAGCGACCGTCTTCAGTGAAGACGCGGCT
CATACCAGCCTTGCGGCCCACGAAGCCCAACGAATATTTCTTCGTCATGGTCGTAGTCCT
CAGGTCAGCTTGATCTGGACGTCGACGCCAGCCGCGAGTTCGAGCTTCATCAGCGCGTCC
ACGGTCTTGTCGTTCGGGTCGACGATATCGAGCACACGCTTGTGCGTGCGGGTTTCGTAT
TGG
>Contig_152:50215-50829
TGCCAGCCACTCCTCGACCTTCTTGACCGCGTCGGCGGTGATCACGACCGTATCGGCCCC
GACCAGAGCGACCGGATCCAGACCCTGCACGTCACGCACCTGCACATACGGCAGGTTGCG
AGCGGACAGATACAGGTGCTCGGAAGCCTCTTCGGTGACGATCAGCGGGCGCTTGCCCAC
>Contig_152:45550-46116
GTTACGGAACGGGAACTTGAACGCTGCCAGCAGCGCCTTCGCTTCCGCATCCGTCTTGGC
GGTGGTGGTGATGGCGATATCCATACCGCGGATCGCGTCGACGGCGTCGAAGTCGATTTC
>Contig_152:44079-44618
AGCCTTCTTGGCTTCCTTGCGAATGATGACTTCACCGGCGTACTTCACACCCTTGCCCTT
GTAGGGCTCCGGCGGACGGAAACCGCGAATCTTGGCGGCAACTTCGCCGACGCGCTGCTT

This User Gave Thanks to anbu23 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Match patterns between two files and extract certain range of strings

Hi, I need help to match patterns from between two different files and extract region of strings. inputfile1.fa >l-WR24-1:1 GCCGGCGTCGCGGTTGCTCGCGCTCTGGGCGCTGGCGGCTGTGGCTCTACCCGGCTCCGG GGCGGAGGGCGACGGCGGGTGGTGAGCGGCCCGGGAGGGGCCGGGCGGTGGGGTCACGTG... (4 Replies)
Discussion started by: bunny_merah19
4 Replies

2. Shell Programming and Scripting

URL partial matching

I have two files: file 1 http://www.hello.com http://neo.com/peace/development.html, www.japan.com, http://example.com/abc/abc.html http://news.net http://lolz.com/country/list.html,www.telecom.net, www.highlands.net, www.software.com http://example2.com ... (1 Reply)
Discussion started by: csim_mohan
1 Replies

3. Shell Programming and Scripting

awk extract strings matching multiple patterns

Hi, I wasn't quite sure how to title this one! Here goes: I have some already partially parsed log files, which I now need to extract info from. Because of the way they are originally and the fact they have been partially processed already, I can't make any assumptions on the number of... (8 Replies)
Discussion started by: chrissycc
8 Replies

4. Shell Programming and Scripting

Concatenating 2 lines from 2 files having matching strings

Hello All Unix Users, I am still new to Unix, however I am eager to learn it.. I have 2 files, some lines have some matching substrings, I would like to concatenate these lines into one lines, leaving other untouched. Here below is an example for that.. File 1 (fasta file): >292183... (6 Replies)
Discussion started by: Mohamed EL Hadi
6 Replies

5. UNIX for Dummies Questions & Answers

Extract columns by matching ids in two files

Hello, I want to extract columns from file2 to file3 by matching ids between file1 and file2. The extracted columns should be in same order as file1 ids. for example: file1.txt 1823 607 R2A9 802 771 file2.txt 1823 1 2 4 22 11 4 29 607 12 3 3 R2A9... (8 Replies)
Discussion started by: ryan9011
8 Replies

6. Shell Programming and Scripting

matching strings from different files

I want to compare file 1 to file 2 and if a string from file 1 appears in file 2, then print the file 2 row, where the string appears, onto file3. file 1 looks like this. DOG_0004340 blah blah2 j 22424 DOG_3010311 blah blah3 o 24500 DOG_9949221 blah blah6 x 35035 file 2 looks like... (5 Replies)
Discussion started by: verse123
5 Replies

7. Shell Programming and Scripting

Extract partial string from path.

Hi all, i've a string $DIR=/u/user/NDE/TEST_LOGS/20110622_000005_TEST_11_HD_120/HD/TEST_11_HD_120/hd-12 i need to extract string from 2011.... i.e i need it as 20110622_000005_TEST_11_HD_120 as matched string, and in turn i need to split values 20110622_000005_TEST_11_HD_120 into two.... (6 Replies)
Discussion started by: asak
6 Replies

8. Shell Programming and Scripting

awk/sed to extract column bases on partial match

Hi I have a log file which has outputs like the one below conn=24,196 op=1 RESULT err=0 tag=0 nentries=9 etime=3,712 dbtime=0 mem=486,183,328/2,147,483,648 Now most of the time I am only interested in the time ( the first column) and a column that begins with etime i.e... (8 Replies)
Discussion started by: pkabali
8 Replies

9. Shell Programming and Scripting

is it hard to extract particular lines & strings from the files??

Hi Experts, I have lots of big size files. Below is the snapshot of a file. From the files i want extract informmation like belows. What could be command or script for that? DELETE RESP:940120105 CREATE RESP:0 GET RESP:0 File contains like below- ... ... <log... (8 Replies)
Discussion started by: thepurple
8 Replies

10. Shell Programming and Scripting

Grep all files matching partial filename

What would be the easiest way to grep all files within a particular directory that match a partial filename? For example, searching all files that begin with "filename.txt" and are appended with the date they were created. I am using Ksh 88, btw. (3 Replies)
Discussion started by: mharley
3 Replies
Login or Register to Ask a Question