Sponsored Content
Full Discussion: Sequence extraction
Top Forums Shell Programming and Scripting Sequence extraction Post 302951309 by Scrutinizer on Wednesday 5th of August 2015 03:39:48 AM
Old 08-05-2015
Maybe because FS is set to '\t' and there appear to be no TABs in your second sample file, so the fields will not match. And you are setting the index to the entire record, save the first character. which is probably not what you want (you probably meant to use $1 here, but that would not be a sure way to do it either, because in the FASTA format the identifier is allowed to contain spaces). And the FASTA file is word wrapped, so you need to take out the newlines and not use getline to get only the second line ....

The best way to do that is is to use ">" as a record separator and use "\n" as the field separator. By setting OFS as the empty string, and assigning a value to one of the fields, all newlines will be replaced by empty strings, so this will effectively remove the word wrap. And the sequence will become one continuous string, which will make it suitable for substring selection.

Using your file order, we would get something like this.
Code:
awk 'NR==FNR{i=$1; $1=x; A[i]=$0; next} $1 in A{print ">" $1 ORS substr(A[$1], $2, $3-$2+1)}' RS=\> FS='\n' OFS= file1 FS=" " RS="\n" file2


If we read the files the other way around, then it becomes more memory efficient:
Code:
awk 'NR==FNR{S[$1]=$2; E[$1]=$3; next} $1 in S{i=$1; $1=x; print RS i FS substr($0,S[i],E[i]-S[i]+1)}' file2 RS=\> FS='\n' OFS= file1


Last edited by Scrutinizer; 08-05-2015 at 04:52 AM..
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Help with tar extraction!

I have this tar file which has files of (.ksh, .ini &.sql) and their hard and soft links. Later when the original files and their directories are deleted (or rather lost as in a system crash), I have this tar file as the only source to restore all of them. In such a case when I do, tar... (4 Replies)
Discussion started by: manthasirisha
4 Replies

2. Shell Programming and Scripting

AWK extraction

Hi all, I have a data file from which i would like to extract only certain fields, which are not adjacent to each other. Following is the format of data file (data.txt) that i have, which has about 6 fields delimited by "|" HARRIS|23|IT|PROGRAMMER|CHICAGO|EMP JOHN|35|IT|JAVA|NY|CON... (2 Replies)
Discussion started by: harris2107
2 Replies

3. Shell Programming and Scripting

extraction of last but one char

I need to extract the character before the last "|" in the following lines, which are 'N' and 'U'. The last "|" shouldn't be extracted. Also the no.s of "|" may vary in a line, but I need only the character before the last one. ... (5 Replies)
Discussion started by: hidnana
5 Replies

4. Shell Programming and Scripting

Regex extraction

Hello, I need your help to extract text from following: ./sherg_fyd_rur:blkabl="R23.21_BL2008_0122_1" ./serge_a75:rlwual="/main/r23.21=26-Mar-2008.05:00:20UTC@R11.31_BL2008_0325" ./serge_a75:blkabl="R23.21_BL2008_0325" ./sherg_proto_npiv:bkguals="R23.21_BL2008_0302 I80_11.31_LR" I... (11 Replies)
Discussion started by: abdurrouf
11 Replies

5. Programming

extraction from a path

Hi, Can you help me on this two problems? how can i get : from input: /ect/exp/hom/bin ==> output: exp and from input: aex1234 =====>output: ex thanks, (1 Reply)
Discussion started by: yeclota
1 Replies

6. Shell Programming and Scripting

extraction

I have following input @xxxxxx@ I want to extract what's between @....@ that is : xxxx using SED command (6 Replies)
Discussion started by: xerox
6 Replies

7. UNIX for Dummies Questions & Answers

fast sequence extraction

Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: >someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450... (4 Replies)
Discussion started by: Fahmida
4 Replies

8. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

9. Shell Programming and Scripting

String Extraction

I am trying to extract a time from the below string in perl but not able to get the time properly I just want to extract the time from the above line I am using the below syntax x=~ /(.*) (\d+)\:(\d+)\:(\d+),(.*)\.com/ $time = $2 . ':' . $3 . ':' . $4; print $time Can... (1 Reply)
Discussion started by: karan8810
1 Replies

10. Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

Hello, here I am posting my query again with modified data input files. see my query is : i have two input files file1 and file2. file1 is smalldata.fasta >gi|546671471|gb|AWWX01449637.1| Bubalus bubalis breed Mediterranean WGS:AWWX01:contig449636, whole genome shotgun sequence... (20 Replies)
Discussion started by: harpreetmanku04
20 Replies
join(1) 						      General Commands Manual							   join(1)

NAME
join - relational database operator SYNOPSIS
[options] file1 file2 DESCRIPTION
forms, on the standard output, a join of the two relations specified by the lines of file1 and file2. If file1 or file2 is the standard input is used. file1 and file2 must be sorted in increasing collating sequence (see Environment Variables below) on the fields on which they are to be joined; normally the first in each line. The output contains one line for each pair of lines in file1 and file2 that have identical join fields. The output line normally consists of the common field followed by the rest of the line from file1, then the rest of the line from file2. The default input field separators are space, tab, or new-line. In this case, multiple separators count as one field separator, and lead- ing separators are ignored. The default output field separator is a space. Some of the below options use the argument n. This argument should be a or a referring to either file1 or file2, respectively. Options In addition to the normal output, produce a line for each unpairable line in file n, where n is or Replace empty output fields by string s. Join on field m of both files. The argument m must be delimited by space characters. This option and the following two are provided for backward compatibility. Use of the and options ( see below ) is recommended for portability. Join on field m of file1. Join on field m of file2. Each output line comprises the fields specified in list, each element of which has the form where n is a file number and m is a field number. The common field is not printed unless specifically requested. Use character c as a separator (tab character). Every appearance of c in a line is significant. The character c is used as the field sepa- rator for both input and output. Instead of the default output, produce a line only for each unpairable line in file_number, where file_number is or Join on field f of file 1. Fields are numbered starting with 1. Join on field f of file 2. Fields are numbered starting with 1. EXTERNAL INFLUENCES
Environment Variables determines the collating sequence expects from input files. determines the alternative blank character as an input field separator, and the interpretation of data within files as single and/or multi- byte characters. also determines whether the separator defined through the option is a single- or multi-byte character. If or is not specified in the environment or is set to the empty string, the value of is used as a default for each unspecified or empty variable. If is not specified or is set to the empty string, a default of ``C'' (see lang(5)) is used instead of If any internationaliza- tion variable contains an invalid setting, behaves as if all internationalization variables are set to ``C'' (see environ(5)). International Code Set Support Single- and multi-byte character code sets are supported with the exception that multi-byte-character file names are not supported. EXAMPLES
The following command line joins the password file and the group file, matching on the numeric group ID, and outputting the login name, the group name, and the login directory. It is assumed that the files have been sorted in the collating sequence defined by the or environment variable on the group ID fields. The following command produces an output consisting all possible combinations of lines that have identical first fields in the two sorted files sf1 and sf2, with each line consisting of the first and third fields from and the second and fourth fields from WARNINGS
With default field separation, the collating sequence is that of with the sequence is that of a plain sort. The conventions of and are incongruous. Numeric filenames may cause conflict when the option is used immediately before listing filenames. AUTHOR
was developed by OSF and HP. SEE ALSO
awk(1), comm(1), sort(1), uniq(1). STANDARDS CONFORMANCE
join(1)
All times are GMT -4. The time now is 03:48 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy