Extraction of sequences from files


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extraction of sequences from files
# 1  
Old 08-14-2015
Extraction of sequences from files

hey!!! I have 2 files file1 is as ids.txt and is

Code:
>gi|546473186|gb|AWWX01630222.1|
>gi|546473233|gb|AWWX01630175.1|
>gi|546473323|gb|AWWX01630097.1|
>gi|546474044|gb|AWWX01629456.1|
>gi|546474165|gb|AWWX01629352.1|

file2 is sequences.fasta and is like
Code:
>gi|546473233|gb|AWWX01630175.1|
cgtcgctgacgtcgacctcgtcgtgctcgctcggctcgcaagctcgaccgtcga
>gi|546473323|gb|AWWX01630097.1|
ctgctgcacgtcgagctgcacgtgctgacgttgcggcgctgcagctgacg
>gi|546474044|gb|AWWX01629456.1|
ctgggccvgtgctgacgtcgacgtcgacgttcgctgaccgtcgtcgacgtc
>gi|546786044|gb|AWWX01629456.1|
ctgggccvgtgctgccgtgcgtgcgtcgacgttcgctgaccgtcgtcgacgtc
>gi|5464740789|gb|AWWX01629456.1|
ctgggccvgtgctgacgtcgacgtcgacgtttgggttttttcgtcgacgtc

i want to extract only those entries form file2 which have similar ids in a file1 and result should be un a new file .

output should be like
Code:
>gi|546473233|gb|AWWX01630175.1|
cgtcgctgacgtcgacctcgtcgtgctcgctcggctcgcaagctcgaccgtcga
>gi|546473323|gb|AWWX01630097.1|
ctgctgcacgtcgagctgcacgtgctgacgttgcggcgctgcagctgacg
>gi|546474044|gb|AWWX01629456.1|
ctgggccvgtgctgacgtcgacgtcgacgttcgctgaccgtcgtcgacgtc

# 2  
Old 08-14-2015
Any attempt from your side? Mayhap based on your other thread dealing with a similar problem?

---------- Post updated at 14:35 ---------- Previous update was at 14:31 ----------

Howsoever, try
Code:
awk 'FNR==NR {T[$1];next} $1 in T {P=NR+1} NR<=P' file1 file2
>gi|546473233|gb|AWWX01630175.1|
cgtcgctgacgtcgacctcgtcgtgctcgctcggctcgcaagctcgaccgtcga
>gi|546473323|gb|AWWX01630097.1|
ctgctgcacgtcgagctgcacgtgctgacgttgcggcgctgcagctgacg
>gi|546474044|gb|AWWX01629456.1|
ctgggccvgtgctgacgtcgacgtcgacgttcgctgaccgtcgtcgacgtc

# 3  
Old 08-15-2015
no output is generated by this script...
# 4  
Old 08-15-2015
If the input files are as you described, and you used the RudiC suggested in post #2, you should get the output he listed in that same post.

If you're running this script on a Solaris/SunOS system, change awk in his suggestion to /usr/xpg3/bin/awk. (Since you say that was no output, this should not be your problem.)

If your input files are in DOS format (with <carriage-return><linefeed> character pair line terminators instead of the normal <newline> character line terminators expected by UNIX and Linux system utilities) or have extraneous spaces and/or tabs at the end of input lines, change RudiC's suggestion to:
Code:
awk '{sub("[[:space:]]*\r*$","")} FNR==NR {T[$1];next} $1 in T {P=NR+1} NR<=P' file1 file2

# 5  
Old 08-15-2015
The Fasta format allows sequences to be wrapped across lines (so it can be more than one line), so I still think it is better not to rely on line numbers but on the > sign as record separator, so try:
Code:
awk 'NR==FNR && NR>1{A[$1]; next} $1 in A{print RS $0}' RS=\> ORS= FS='\n' file1 file2

Another thing in the Fasta format is that the label line may contain spaces, so either $0 should be used, rather than $1, or (as is the case in my suggestion) FS should be set to '\n', so that $1 contains the label line.


--
In addition to Don. Both suggestions are written for Unix format input. If both input files are in DOS format than they should happen to also work.. If only one of the input files is in DOS format, then they would fail. Files can be converted from DOS to Unix by using:
Code:
tr -d '\r' < file > newfile


Last edited by Scrutinizer; 08-15-2015 at 03:39 PM.. Reason: Typo, thanks Aia
# 6  
Old 08-17-2015
it is showing cannot open file1

---------- Post updated at 12:23 AM ---------- Previous update was at 12:18 AM ----------

Code:
csm@csm-HP-Z420:~/Desktop/sequences/sequences to extract for fold$ awk 'FNR==NR {T[$1];next} $1 in T {P=NR+1} NR<=P' only_ids.txt sequences.fasta  >hkm


this creates only empty hkm output file
# 7  
Old 08-17-2015
Quote:
Originally Posted by harpreetmanku04
it is showing cannot open file1

---------- Post updated at 12:23 AM ---------- Previous update was at 12:18 AM ----------

Code:
csm@csm-HP-Z420:~/Desktop/sequences/sequences to extract for fold$ awk 'FNR==NR {T[$1];next} $1 in T {P=NR+1} NR<=P' only_ids.txt sequences.fasta  >hkm


this creates only empty hkm output file
You started out by telling us you had file1 and file2 and confusingly gave alternative names ids.txt and sequences.fasta. Now you tell us the command above says there is no file1 when it doesn't reference file1???

When sitting in the directory where you ran the above command, please show us the output of the command:
Code:
ls -l file1 file2 ids.txt only_ids.txt sequences.fasta hkm

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

UNIX - 2 tab delimited files, conditional column extraction

Please know that I am very new to unix and trying to learn 'on the job'. I'm only manipulating large tab-delimited files (millions of rows), but I'm stuck and don't know how to proceed with the following. Hoping for some friendly advice :) I have 2 tab-delimited files - with differing column &... (10 Replies)
Discussion started by: GTed
10 Replies

2. Shell Programming and Scripting

Speed up extraction od tar.bz2 files using bash

The below bash will untar each tar.bz2 folder in the directory, then remove the tar.bz2. Each of the tar.bz2 folders ranges from 40-75GB and currently takes ~2 hours to extract. Is there a way to speed up the extraction process? I am using a xeon processor with 12 cores. Thank you :). ... (7 Replies)
Discussion started by: cmccabe
7 Replies

3. Shell Programming and Scripting

Randomly selecting sequences and generating specific output files

I have two files containing hundreds of different sequences with the same Identifiers (ID-001, ID-002, etc.,), something like this: Infile1: ID-001 ATGGGAGCGGGGGCGTCTGCCTTGAGGGGAGAGAAGCTAGATACA ID-002 ATGGGAGCGGGGGCGTCTGTTTTGAGGGGAGAGAAGCTAGATACA ID-003... (18 Replies)
Discussion started by: Xterra
18 Replies

4. UNIX for Dummies Questions & Answers

Need help for data extraction if files

Hello all, I want to extract some particular data from a files and than add all the values . but i m not able to cut the particular word(USU-INOCT and USU-OUTOCT) as it is coming not in column. and than able to add values coming in it . can anyone help me Please cat <file name> ... (7 Replies)
Discussion started by: anamdev
7 Replies

5. Shell Programming and Scripting

Files extraction - any help ?

Hi Friends, i am new to unix,i have a big doubt/help. I have files in folders SER1 and SER2 with naming convention as below file_2010-03-19.txt and so on the file naming format is file_<date>.txt. I would like to copy the files to directory "Landing" I have entries in a log file log.txt... (5 Replies)
Discussion started by: Gopal_Engg
5 Replies

6. Shell Programming and Scripting

Extraction of data from multiple text files, and creation of a chart

Hello dear friends, My problem as explained below seems really basic. Fact is that I'm totally new to programming, and have only a week to produce a script ( CShell or Perl ? ) to perform this action. While searching on the forums, I found a command that could help me, but I don't know... (2 Replies)
Discussion started by: ackheron
2 Replies

7. Shell Programming and Scripting

Selective extraction of data from a files

Hi, I would like to seek for methods to do selective extraction of line froma file. The scenario as follows: I have a file with content: message a received on 11:10:00 file size: 10 bytes send by abc message b received on 11:20:00 file size: 10 bytes send by abc (3 Replies)
Discussion started by: dwgi32
3 Replies

8. Shell Programming and Scripting

Extracting DNA sequences from GenBank files using Perl

Hi all, Using Perl, I need to extract DNA bases from a GenBank file for a given plant species. A sample GenBank file is here... Nucleotide This is saved on my computer as NC_001666.gb. I also have a file that is saved on my computer as NC_001666.txt. This text file has a list of all... (5 Replies)
Discussion started by: akreibich07
5 Replies

9. UNIX for Dummies Questions & Answers

merged 10 files with column extraction into one

Hi, I have 600 text files. In each txt file, I have 3 columns, e.g: File 1 a 0.21 0.003 b 0.34 0.004 c 0.72 0.002 File 2 a 0.25 0.0083 b 0.38 0.0047 c 0.79 0.00234 File 3 a 0.45 0.0063 b 0.88 0.0027 c 0.29 0.00204 ... my filename as "sc2408_0_5278.txt sc2408_0_5279.txt... (2 Replies)
Discussion started by: libenhelen
2 Replies

10. Shell Programming and Scripting

Extraction of latest files from cvs repository

Hi everyone.. Anybody having idea to get the latest file from CVS repository through schell scripts. Thanks in advance. Regards shahid Bakshi (4 Replies)
Discussion started by: shahidbakshi
4 Replies
Login or Register to Ask a Question