Extraction of upstream and downstream regions from long sequence file Post: 302951415

Sponsored Content

Top Forums Shell Programming and Scripting Extraction of upstream and downstream regions from long sequence file Post 302951415 by Don Cragun on Thursday 6th of August 2015 03:57:18 AM

08-06-2015

Registered User

And make sure that the output you show us includes the (exact) output you want produced for at least the following file2 input lines:

Code:

subject id	 s. start	 s. end
gi|546669925|gb|AWWX01450616.1|	282	305
gi|546669842|gb|AWWX01450698.1|	1523	1542
gi|546669842|gb|AWWX01450698.1|	1641	1660
gi|546669842|gb|AWWX01450698.1|	2484	2503
gi|546669842|gb|AWWX01450698.1|	2720	2739
gi|546669842|gb|AWWX01450698.1|	2725	2744
gi|546669977|gb|AWWX01450566.1|	2822	2842

Note that the 1st line is this file2 is related to one entry from file1, the next 5 lines from this file2 are related to another entry from file1, and the last line from file2 is related to an entry that is not found in file1.

Is the output for the 5 lines related to the string gi|546669842|gb|AWWX01450698.1| supposed to generate 5 sets of output OR is the output for those 5 lines supposed to be combined into 1 set of output duplicating some of the output (due to overlapping ranges) OR is the output for those 5 lines supposed to be combined into 1 set of output containing the non-overlapping regions of thee requested ranges 1423 through 1760, 2384 through 2603, and 2620 through 2844 (where the start and stop points have been extended 100 characters in each direction and the five overlapping input regions in file2 have been combined into a three non-overlapping output regions)?

And, for the last entry in file2, there is no entry in your sample file1. Is anything supposed to appear in the output for this case? If so, what?

And, just for the record, the number of characters specified by the range 282 through 305 is 24 characters; not 23. (If you don't see why that is true, take the simpler example where the range 282 through 282 is 1 character; not 0.)

Last edited by Don Cragun; 08-06-2015 at 06:24 AM.. Reason: Fix typos in counts.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Programming

selecting rows with specific IDs for downstream analysis

Hi, I'm working hard on SQL and I came across a hurdle I'm hoping you can help me out with. I have two tables table1 headers: chrom start end name score strand 11 9720685 9720721 U0 0 + 21 9721043 9721079 U0 0 - 1 9721093 9721129 U0 0 + 20 ...

2. Shell Programming and Scripting

awk: union regions

Hi all, I have difficulty to solve the followign problem. mydata: StartPoint EndPoint 22 55 2222 2230 33 66 44 58 222 240 11 25 22 60 33 45 The union of above...

3. UNIX for Dummies Questions & Answers

fast sequence extraction

Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: >someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450...

4. UNIX for Dummies Questions & Answers

extract regions of file based on start and end position

Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2. Based on a post elsewhere, I found the code: awk...

5. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ...

6. Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

FILE_ID extraction from file name and save it in CSV file after looping through each folders My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that? I have folders in unix environment, directory structure is...

7. Shell Programming and Scripting

Obtain the names of the flanking regions

Hi I have 2 files; usually the end position in the file1 is the start position in the file2 and the end position in file2 will be the start position in file1 (flanks) file1 Id start end aaa1 0 3000070 aaa1 3095270 3095341 aaa1 3100822 3100894 aaa1 ...

8. IP Networking

Newbie BIND DNS question: resolving upstream hosts?

Old skool UNIX and Linux geek here, but newbie to the world of DNS and bind. I've recently been tasked with replacing our DNS infrastructure, currently on Windows, with a RHEL based solution. And I assume that means using bind, which I've not used before. Here's my question: Suppose our company...

9. Shell Programming and Scripting

Parsing and masking regions from a single fasta file with subsequence

HI, I have a Complete genome fasta file and I have list of sub sequence regions in the format as : 4353..5633 6795..9354 1034..14456 I want a script which can mask these region in a single complete genome fasta file with the alphabet N kindly help

10. Shell Programming and Scripting

Sequence extraction

i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is: I have file1 is this >GL3482.1 GAACTTGAGATCCGGGGA GCAGTGGATCTCCACCAG CGGCCAGAACTGGTGCAC CTCCAGGCCAGCCTCGTC CTGCGTGTC >GL3550.1...

LEARN ABOUT OPENSOLARIS

comm

comm(1) 							   User Commands							   comm(1)

NAME

       comm - select or reject lines common to two files

SYNOPSIS

       comm [-123] file1 file2

DESCRIPTION

       The comm utility reads file1 and file2, which must be ordered in the current collating sequence, and produces three text columns as output:
       lines only in file1; lines only in file2; and lines in both files.

       If the input files were ordered according to the collating sequence of the current locale, the lines  written  will  be	in  the  collating
       sequence of the original lines. If not, the results are unspecified.

OPTIONS

       The following options are supported:

       -1    Suppresses the output column of lines unique to file1.

       -2    Suppresses the output column of lines unique to file2.

       -3    Suppresses the output column of lines duplicated in file1 and file2.

OPERANDS

       The following operands are supported:

       file1	A path name of the first file to be compared. If file1 is -, the standard input is used.

       file2	A path name of the second file to be compared. If file2 is -, the standard input is used.

USAGE

       See largefile(5) for the description of the behavior of comm when encountering files greater than or equal to 2 Gbyte ( 2^31 bytes).

EXAMPLES

       Example 1 Printing a list of utilities specified by files

       If file1, file2, and file3 each contain a sorted list of utilities, the command

	 example% comm -23 file1 file2	| comm -23 - file3

       prints a list of utilities in file1 not specified by either of the other files. The entry:

	 example% comm -12 file1 file2 | comm -12 - file3

       prints a list of utilities specified by all three files. And the entry:

	 example% comm -12  file2 file3 | comm -23 -file1

       prints a list of utilities specified by both file2 and file3, but not specified in file1.

ENVIRONMENT VARIABLES

       See  environ(5)	for  descriptions  of  the  following  environment  variables that affect the execution of comm: LANG, LC_ALL, LC_COLLATE,
       LC_CTYPE, LC_MESSAGES, and NLSPATH.

EXIT STATUS

       The following exit values are returned:

       0     All input files were successfully output as specified.

       >0    An error occurred.

ATTRIBUTES

       See attributes(5) for descriptions of the following attributes:

       +-----------------------------+-----------------------------+
       |      ATTRIBUTE TYPE	     |	    ATTRIBUTE VALUE	   |
       +-----------------------------+-----------------------------+
       |Availability		     |SUNWesu			   |
       +-----------------------------+-----------------------------+
       |CSI			     |enabled			   |
       +-----------------------------+-----------------------------+
       |Interface Stability	     |Standard			   |
       +-----------------------------+-----------------------------+

SEE ALSO

       cmp(1), diff(1), sort(1), uniq(1), attributes(5), environ(5), largefile(5), standards(5)

SunOS 5.11							    3 Mar 2004								   comm(1)

10 More Discussions You Might Find Interesting

1. Programming

selecting rows with specific IDs for downstream analysis

Discussion started by: labrazil

2. Shell Programming and Scripting

awk: union regions

Discussion started by: phoeberunner

3. UNIX for Dummies Questions & Answers

fast sequence extraction

Discussion started by: Fahmida

4. UNIX for Dummies Questions & Answers

extract regions of file based on start and end position

Discussion started by: pathunkathunk