Extraction of upstream and downstream regions from long sequence file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Extraction of upstream and downstream regions from long sequence file
# 15  
Old 08-08-2015
You can save Don's suggestion to a file with a name of your liking, for example: /some_dir/fasta_extract

Then do the following to make it executable:
Code:
chmod +x /some_dir/fasta_extract

And then you should be able to run it like this:
Code:
/some_dir/fasta_extract /some_other_dir/result.ods /some_other_dir/smalldata.fasta

If all files are in the same directory, and you are also in that same directory, then you can use:
Code:
./fasta_extract result.ods smalldata.fasta

And if the input files actually have these names, then you can run it is:
Code:
./fasta_extract

Since these are the default names that are used in the script.

With all these commands you can use redirection to put the data in a new file:
Code:
command > newfile


Last edited by Scrutinizer; 08-08-2015 at 04:14 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 16  
Old 08-08-2015
In addition to what Scrutinizer already said, note that the script and both of the input data files must be in UNIX text file format (with a single <newline> character as the line terminator); not Windows format (with <carriage-return> <newline> characters pairs as the line terminator); and not text produced by some text formatting tool like Microsoft word.
# 17  
Old 08-10-2015
i am trying to run Smilie
# 18  
Old 08-13-2015
hello sir, i am getting good results with this script, but what if i want to extract another col from file2 followed by seq_id column?

---------- Post updated at 05:47 AM ---------- Previous update was at 05:29 AM ----------

i mean how can i modifies this script

Code:
awk '
BEGIN           {print "\query id\tsequence id\textracted region small\textracted region big upstream and downstream"
                }
NR==FNR &&
FNR>1           {CNT[$1]++
                 S[$1,CNT[$1]]=$2
                 E[$1,CNT[$1]]=$3
                 next
                }
                {split ($1, T, " ")
                }
T[1] in CNT     {i=T[1]
                 $1=x
                 for (j=1; j<=CNT[T[1]]; j++)
                        print RS i "\t" substr ($0,S[i,j],E[i,j]-S[i,j]+1) "\t" substr ($0, S[i,j]-100, E[i,j]-S[i,j]+201)
                }
' result.txt RS=\> FS='\n' OFS= 1.fasta >output_1

to extract one more column data means column no. 4 from the file2 i.e result.xls

Last edited by Don Cragun; 08-13-2015 at 01:47 PM.. Reason: Add CODE and ICODE tags.
# 19  
Old 08-13-2015
Please use code tags as required by forum rules!

The better the spec, the better the solution, as you certainly learned. With what you show us (i.e. no input nor output sample), I'd propose to save the new column in an array (as you do with the other fields) when reading result.xls (or .txt, unclear to me), and then print it in the for loop together with the other relevant fields.
# 20  
Old 08-14-2015
Code:
query_id  subject id	  s. start	 s. end
3453  gi|546669925|gb|AWWX01450616.1|  282	   305
5676  gi|546671471|gb|AWWX01449637.1|	  771	   790
8765  gi|546669842|gb|AWWX01450698.1|	  1523  1542
6578  gi|546669842|gb|AWWX01450698.1|	  1644  1660
9087  gi|546671514|gb|AWWX01449617.1|	  1926  1948


like i want to extract query id along with subject id from this xls file.
# 21  
Old 08-14-2015
RudiC, Scrutinizer, and I helped you with awk scripts that did what you requested with input files in the formats you specified. Now you want more output (in an unspecified output format) from input files with a different format.

If you would like to show us the modifications you have made to the code we supplied to meet your new requirements AND show us the output you're trying to produce now AND explain what you are unable to figure out that needs to be done to achieve your new goal, we'll be happy to help you.

If you just want to keep changing your requirements, make us guess at what output you want, and continue acting as your unpaid programming staff while you don't show any interest in learning from the sample code we have provided; then there is very little incentive for us to continue helping someone who does not seem to be interested in learning how to solve his own problems.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Sequence extraction

i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is: I have file1 is this >GL3482.1 GAACTTGAGATCCGGGGA GCAGTGGATCTCCACCAG CGGCCAGAACTGGTGCAC CTCCAGGCCAGCCTCGTC CTGCGTGTC >GL3550.1... (14 Replies)
Discussion started by: harpreetmanku04
14 Replies

2. Shell Programming and Scripting

Parsing and masking regions from a single fasta file with subsequence

HI, I have a Complete genome fasta file and I have list of sub sequence regions in the format as : 4353..5633 6795..9354 1034..14456 I want a script which can mask these region in a single complete genome fasta file with the alphabet N kindly help (2 Replies)
Discussion started by: margarita
2 Replies

3. IP Networking

Newbie BIND DNS question: resolving upstream hosts?

Old skool UNIX and Linux geek here, but newbie to the world of DNS and bind. I've recently been tasked with replacing our DNS infrastructure, currently on Windows, with a RHEL based solution. And I assume that means using bind, which I've not used before. Here's my question: Suppose our company... (3 Replies)
Discussion started by: lupin..the..3rd
3 Replies

4. Shell Programming and Scripting

Obtain the names of the flanking regions

Hi I have 2 files; usually the end position in the file1 is the start position in the file2 and the end position in file2 will be the start position in file1 (flanks) file1 Id start end aaa1 0 3000070 aaa1 3095270 3095341 aaa1 3100822 3100894 aaa1 ... (1 Reply)
Discussion started by: anurupa777
1 Replies

5. Shell Programming and Scripting

FILE_ID extraction from file name and save it in CSV file after looping through each folders

FILE_ID extraction from file name and save it in CSV file after looping through each folders My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that? I have folders in unix environment, directory structure is... (15 Replies)
Discussion started by: princetd001
15 Replies

6. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

7. UNIX for Dummies Questions & Answers

extract regions of file based on start and end position

Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2. Based on a post elsewhere, I found the code: awk... (2 Replies)
Discussion started by: pathunkathunk
2 Replies

8. UNIX for Dummies Questions & Answers

fast sequence extraction

Hi everyone, I have a large text file containing DNA sequences in fasta format as follows: >someseq GAACTTGAGATCCGGGGAGCAGTGGATCTC CACCAGCGGCCAGAACTGGTGCACCTCCAG GCCAGCCTCGTCCTGCGTGTC >another seq GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT GACATTTTCATTACTACCATTTTGGAGTACA >seq3450... (4 Replies)
Discussion started by: Fahmida
4 Replies

9. Shell Programming and Scripting

awk: union regions

Hi all, I have difficulty to solve the followign problem. mydata: StartPoint EndPoint 22 55 2222 2230 33 66 44 58 222 240 11 25 22 60 33 45 The union of above... (2 Replies)
Discussion started by: phoeberunner
2 Replies

10. Programming

selecting rows with specific IDs for downstream analysis

Hi, I'm working hard on SQL and I came across a hurdle I'm hoping you can help me out with. I have two tables table1 headers: chrom start end name score strand 11 9720685 9720721 U0 0 + 21 9721043 9721079 U0 0 - 1 9721093 9721129 U0 0 + 20 ... (2 Replies)
Discussion started by: labrazil
2 Replies
Login or Register to Ask a Question