Extraction of upstream and downstream regions from long sequence file
Hello, here I am posting my query again with modified data input files.
see my query is :
i have two input files file1 and file2.
file1 is smalldata.fasta
file2 is result.ods
output:
and i want to extract region from like 282-305 from seq gi|546669925|gb|AWWX01450616.1| from file1 i.e smalldata.fasta.
i.e output should be like
23 charactes small string. (305-282=23)
moreover i also want to extract region 100 charactes back from 282 and 100 charactes forward from 305
i.e result should be like
100+23+100 characters long string i.e 223 character long string
the result file should be separate file from two input files
I shall be thankful to you if script made by you works for these two files i.e file1=smalldata.fasta
file2=result.ods
Thanku
Last edited by Scrutinizer; 08-06-2015 at 04:14 AM..
Reason: CODE tags
And make sure that the output you show us includes the (exact) output you want produced for at least the following file2 input lines:
Note that the 1st line is this file2 is related to one entry from file1, the next 5 lines from this file2 are related to another entry from file1, and the last line from file2 is related to an entry that is not found in file1.
Is the output for the 5 lines related to the string gi|546669842|gb|AWWX01450698.1| supposed to generate 5 sets of output OR is the output for those 5 lines supposed to be combined into 1 set of output duplicating some of the output (due to overlapping ranges) OR is the output for those 5 lines supposed to be combined into 1 set of output containing the non-overlapping regions of thee requested ranges 1423 through 1760, 2384 through 2603, and 2620 through 2844 (where the start and stop points have been extended 100 characters in each direction and the five overlapping input regions in file2 have been combined into a three non-overlapping output regions)?
And, for the last entry in file2, there is no entry in your sample file1. Is anything supposed to appear in the output for this case? If so, what?
And, just for the record, the number of characters specified by the range 282 through 305 is 24 characters; not 23. (If you don't see why that is true, take the simpler example where the range 282 through 282 is 1 character; not 0.)
Last edited by Don Cragun; 08-06-2015 at 06:24 AM..
Reason: Fix typos in counts.
sir that is excel file how can i post here?
however it is roughly like
---------- Post updated at 03:30 AM ---------- Previous update was at 03:01 AM ----------
cragun sir, last entry yes i want to generate 5 set of output correspond to each gi|546669842|gb|AWWX01450698.1| entry. though it is occurring multiple times but positions are different so there should be 5 lines of result in the result file correspond to gi|546669842|gb|AWWX01450698.1| entry.
and last entry is there in file1 see entry no 4.
---------- Post updated at 04:27 AM ---------- Previous update was at 03:30 AM ----------
even now any problem exists sir?
---------- Post updated 08-07-15 at 12:23 AM ---------- Previous update was 08-06-15 at 04:27 AM ----------
i want to extract specific region of interest from big file. i have only start position, end position and seq id, see my query is:
I have file1 is this
>GL3482.1
GAACTTGAGATCCGGGGA
GCAGTGGATCTCCACCAG
CGGCCAGAACTGGTGCAC
CTCCAGGCCAGCCTCGTC
CTGCGTGTC
>GL3550.1... (14 Replies)
HI,
I have a Complete genome fasta file and I have list of sub sequence regions
in the format as :
4353..5633
6795..9354
1034..14456
I want a script which can mask these region in a single complete genome fasta file with the alphabet N
kindly help (2 Replies)
Old skool UNIX and Linux geek here, but newbie to the world of DNS and bind. I've recently been tasked with replacing our DNS infrastructure, currently on Windows, with a RHEL based solution. And I assume that means using bind, which I've not used before. Here's my question:
Suppose our company... (3 Replies)
Hi I have 2 files; usually the end position in the file1 is the start position in the file2 and the end position in file2 will be the start position in file1 (flanks)
file1
Id start end
aaa1 0 3000070
aaa1 3095270 3095341
aaa1 3100822 3100894
aaa1 ... (1 Reply)
FILE_ID extraction from file name and save it in CSV file after looping through each folders
My files are located in UNIX Server, i want to extract file_id and file_name from each file .and save it in a CSV file. How do I do that?
I have folders in unix environment, directory structure is... (15 Replies)
Hi all,
I have a file like this
ID 3BP5L_HUMAN Reviewed; 393 AA.
AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3;
DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot.
DT 05-JUL-2004, sequence version 1.
DT 05-SEP-2012, entry version 71.
FT COILED 59 140 ... (1 Reply)
Hi, I have a file1 of many long sequences, each preceded by a unique header line. file2 is 3-columns list: headers name, start position, end position. I'd like to extract the sequence region of file1 specified in file2.
Based on a post elsewhere, I found the code:
awk... (2 Replies)
Hi everyone,
I have a large text file containing DNA sequences in fasta format as follows:
>someseq
GAACTTGAGATCCGGGGAGCAGTGGATCTC
CACCAGCGGCCAGAACTGGTGCACCTCCAG
GCCAGCCTCGTCCTGCGTGTC
>another seq
GGCATTTTTGTGTAATTTTTGGCTGGATGAGGT
GACATTTTCATTACTACCATTTTGGAGTACA
>seq3450... (4 Replies)
Hi all,
I have difficulty to solve the followign problem.
mydata:
StartPoint EndPoint
22 55
2222 2230
33 66
44 58
222 240
11 25
22 60
33 45
The union of above... (2 Replies)
Hi,
I'm working hard on SQL and I came across a hurdle I'm hoping you can help me out with.
I have two tables
table1
headers: chrom start end name score strand
11 9720685 9720721 U0 0 +
21 9721043 9721079 U0 0 -
1 9721093 9721129 U0 0 +
20 ... (2 Replies)