Extraction of upstream and downstream regions from long sequence file

08-08-2015

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

You can save Don's suggestion to a file with a name of your liking, for example: /some_dir/fasta_extract

Then do the following to make it executable:

Code:

chmod +x /some_dir/fasta_extract

And then you should be able to run it like this:

Code:

/some_dir/fasta_extract /some_other_dir/result.ods /some_other_dir/smalldata.fasta

If all files are in the same directory, and you are also in that same directory, then you can use:

Code:

./fasta_extract result.ods smalldata.fasta

And if the input files actually have these names, then you can run it is:

Code:

./fasta_extract

Since these are the default names that are used in the script.

With all these commands you can use redirection to put the data in a new file:

Code:

command > newfile

Last edited by Scrutinizer; 08-08-2015 at 04:14 AM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

08-08-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

In addition to what Scrutinizer already said, note that the script and both of the input data files must be in UNIX text file format (with a single <newline> character as the line terminator); not Windows format (with <carriage-return> <newline> characters pairs as the line terminator); and not text produced by some text formatting tool like Microsoft word.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-10-2015

Registered User

35, 0

Join Date: Aug 2015

Last Activity: 13 November 2015, 4:12 AM EST

Posts: 35

Thanks Given: 0

Thanked 0 Times in 0 Posts

i am trying to run

harpreetmanku04

View Public Profile for harpreetmanku04

Find all posts by harpreetmanku04

08-13-2015

Registered User

35, 0

Join Date: Aug 2015

Last Activity: 13 November 2015, 4:12 AM EST

Posts: 35

Thanks Given: 0

Thanked 0 Times in 0 Posts

hello sir, i am getting good results with this script, but what if i want to extract another col from file2 followed by seq_id column?

---------- Post updated at 05:47 AM ---------- Previous update was at 05:29 AM ----------

i mean how can i modifies this script

Code:

awk '
BEGIN           {print "\query id\tsequence id\textracted region small\textracted region big upstream and downstream"
                }
NR==FNR &&
FNR>1           {CNT[$1]++
                 S[$1,CNT[$1]]=$2
                 E[$1,CNT[$1]]=$3
                 next
                }
                {split ($1, T, " ")
                }
T[1] in CNT     {i=T[1]
                 $1=x
                 for (j=1; j<=CNT[T[1]]; j++)
                        print RS i "\t" substr ($0,S[i,j],E[i,j]-S[i,j]+1) "\t" substr ($0, S[i,j]-100, E[i,j]-S[i,j]+201)
                }
' result.txt RS=\> FS='\n' OFS= 1.fasta >output_1

to extract one more column data means column no. 4 from the file2 i.e result.xls

Last edited by Don Cragun; 08-13-2015 at 01:47 PM.. Reason: Add CODE and ICODE tags.

harpreetmanku04

View Public Profile for harpreetmanku04

Find all posts by harpreetmanku04

08-13-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Please use code tags as required by forum rules!

The better the spec, the better the solution, as you certainly learned. With what you show us (i.e. no input nor output sample), I'd propose to save the new column in an array (as you do with the other fields) when reading result.xls (or .txt, unclear to me), and then print it in the for loop together with the other relevant fields.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

08-14-2015

Registered User

35, 0

Join Date: Aug 2015

Last Activity: 13 November 2015, 4:12 AM EST

Posts: 35

Thanks Given: 0

Thanked 0 Times in 0 Posts

Code:

query_id  subject id	  s. start	 s. end
3453  gi|546669925|gb|AWWX01450616.1|  282	   305
5676  gi|546671471|gb|AWWX01449637.1|	  771	   790
8765  gi|546669842|gb|AWWX01450698.1|	  1523  1542
6578  gi|546669842|gb|AWWX01450698.1|	  1644  1660
9087  gi|546671514|gb|AWWX01449617.1|	  1926  1948

like i want to extract query id along with subject id from this xls file.

harpreetmanku04

View Public Profile for harpreetmanku04

Find all posts by harpreetmanku04

08-14-2015

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

RudiC, Scrutinizer, and I helped you with awk scripts that did what you requested with input files in the formats you specified. Now you want more output (in an unspecified output format) from input files with a different format.

If you would like to show us the modifications you have made to the code we supplied to meet your new requirements AND show us the output you're trying to produce now AND explain what you are unable to figure out that needs to be done to achieve your new goal, we'll be happy to help you.

If you just want to keep changing your requirements, make us guess at what output you want, and continue acting as your unpaid programming staff while you don't show any interest in learning from the sample code we have provided; then there is very little incentive for us to continue helping someone who does not seem to be interested in learning how to solve his own problems.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Extraction of upstream and downstream regions from long sequence file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sequence extraction

Discussion started by: harpreetmanku04

2. Shell Programming and Scripting

Parsing and masking regions from a single fasta file with subsequence

Discussion started by: margarita

3. IP Networking

Newbie BIND DNS question: resolving upstream hosts?

Discussion started by: lupin..the..3rd

4. Shell Programming and Scripting

Obtain the names of the flanking regions

Discussion started by: anurupa777

5. Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

Discussion started by: princetd001

6. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Discussion started by: manigrover

7. UNIX for Dummies Questions & Answers

extract regions of file based on start and end position

Discussion started by: pathunkathunk

8. UNIX for Dummies Questions & Answers

fast sequence extraction

Discussion started by: Fahmida

9. Shell Programming and Scripting

awk: union regions

Discussion started by: phoeberunner

10. Programming

selecting rows with specific IDs for downstream analysis

Discussion started by: labrazil