Scanning alignment and "extracting" blocks


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Scanning alignment and "extracting" blocks
# 1  
Old 02-09-2011
Scanning alignment and "extracting" blocks

I have been thinking how to go around this problem but I just do not find a way to do it. So, I finally decided to ask. I have a real bunch of different sequences of different lenghts aligned in the following format:
Quote:
>Sequence ID1
TAGATGTGCCCGTGGGTTTCTAGATGTGCCCGTGGGTTTC
>Sequence ID2
TTGATGTCGTGGGTTTCCCGTAGATGTGCCCGTGGGTT
>Sequence ID3
TTGATGTGCCAGTTTCCCGTTAGATGTGCCCGTGGGTTTC
>Sequence ID4
TTGATGTGTCCCGTCGACACTAGATGTGCCCGTGGG
>Sequence ID5
TTGATTCCCGTCGACACCGGTAGATGTGCCCGTGGGTTTC
Now, what I need is a 'window' of let say 10 characters that I have to 'slide' along the entire alignment in "steps" of let say 5 characters and then generate the corresponding files with a consecutive number, something like Block1, Block2, etc. Thus, I will end up with the following files:
Block1=
Quote:
>Sequence ID1
TAGATGTGCC
>Sequence ID2
TTGATGTCGT
>Sequence ID3
TTGATGTGCC
>Sequence ID4
TTGATGTGTC
>Sequence ID5
TTGATTCCCG
Block2=
Quote:
>Sequence ID1
GTGCCCGTGG
>Sequence ID2
GTCGTGGGTT
>Sequence ID3
GTGCCAGTTT
>Sequence ID4
GTGTCCCGTC
>Sequence ID5
TCCCGTCGAC
Block3=
Quote:
>Sequence ID1
CGTGGGTTTC
>Sequence ID2
GGGTTTCCCG
>Sequence ID3
AGTTTCCCGT
>Sequence ID4
CCGTCGACAC
>Sequence ID5
TCGACACCGG
So on and so forth. Most probably the last "Block" will not have a windows of 10 characters and that's is perfectly fine.
Any help with this problem will be greatly appreciate it!
# 2  
Old 02-10-2011
Which awk implementation are you using?

This could work or run out of open files:

Code:
awk '/^>/ { t = $0; c = x;  next }
{ 
  for (i = 1; i <= length; i += wn) 
    print t RS substr($0, i, mx) > ("block" ++c)
    
  }' mx=10 wn=5 infile

Let me know if it doesn't work.
This User Gave Thanks to radoulov For This Post:
# 3  
Old 02-10-2011
You're the man!

Thanks a lot!
It worked like a charm!
# 4  
Old 02-11-2011
Little problem

The script does not produce the expected result when the infile looks like this:
Quote:
10 120
Pat3324 aagtggtaag ttcgtgggga gactgcttac taccaaataa gatttgccca
Pat 1234 cttccgatgt accggtcgca gctctggata gaagccagct ccctttgagt
Pat Aqt12 gctcttaaat ctcagaaaac ggtacgtcgc gagggcgtcg gtgaaccccg
Pat-ARl gccagatgga gtgaggaaat ttgagcgcgc gcgtgaacgt cagacctcgt
Pat 222 attttacgag cggtggaggc aggatcgccg tgcgcctgtt cagaacgata
Pat ARQ caccaagtgt gggtgaatac cactgacttg gagactcagt tccgaatctt
PatAA12 tgactggggt gtaagaaact atatcgtgac gttgcgcaat ttgataaacg
2345 taggcacagc ctcaaaagct cttacattta cgaaaccggt atgcatcagt
John Smith aatgagatat caatactcca acgaatgaac ccgatgttgt gtattcaggc
Rabbit gactttgatg ggtacaggtc gacagtccgt actcatagat cgccttcgcc
gtcattgggg acggtggtgt tatgtgccag gggttcgcac tatgggccca
ccccgctcgg catgtataga aacctccggt gtatctaaag tgtgattttg
gaacactatc ccgtaccgat ctgtttaaac gggttgattt ccctaccgac
cctaggcata ccctctaccg atttaactgt taagatagta gacaattaac
cataagcgtg agcgcttcgt atattaagca tgagtcaaaa tctatattgc
gctaagcgca atctatgcac atgggggctc cgtatagagt cgtgcagacg
acgactgacg ctgcgttata agttgtattc gttatatgac agcttagtag
atgtattagt aacacgggaa gaacgcaacg tcggctccta atcgatagca
gtgcttagac tcgcgcaccg cacgtctttc ccaatattga cgcatactgt
tacacaaggg cgtctactgc taaccaatgg acgggtgggc cttaagacgt
aaaatatcgt tcgccctatt
aaggcgagag gggggtctag
cccaaatact gagatgtact
tcctccagct gatttagtgc
ttctgaattc agaaatcctg
cggtaagggc atatttagag
aacataaaaa cgaataatgc
tacactccat atatcggttc
gttaggccca ttgatcatcg
tcttgcttag ccaaagtgcg
I know the problem is because the sequences are not linear and continuous. I was wondering how can I go around that problem?
Thanks!
PS. The numbers at the very top indicate the number of sequences in the file and the second one is the number of characters in the sequence is this case 120.

Last edited by Xterra; 02-11-2011 at 06:18 PM..
# 5  
Old 02-11-2011
Better yet

I am uploding an example of the infile and the first two block files.
Thanks once again!

Last edited by Xterra; 02-11-2011 at 04:40 PM..
# 6  
Old 02-11-2011
Sorry wrong files!

These are the actual files!
# 7  
Old 02-12-2011
The last column of sequence data in block1 is the same as the first column of block2. Please clarify the desired output.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. AIX

Apache 2.4 directory cannot display "Last modified" "Size" "Description"

Hi 2 all, i have had AIX 7.2 :/# /usr/IBMAHS/bin/apachectl -v Server version: Apache/2.4.12 (Unix) Server built: May 25 2015 04:58:27 :/#:/# /usr/IBMAHS/bin/apachectl -M Loaded Modules: core_module (static) so_module (static) http_module (static) mpm_worker_module (static) ... (3 Replies)
Discussion started by: penchev
3 Replies

2. Shell Programming and Scripting

Bash script - Print an ascii file using specific font "Latin Modern Mono 12" "regular" "9"

Hello. System : opensuse leap 42.3 I have a bash script that build a text file. I would like the last command doing : print_cmd -o page-left=43 -o page-right=22 -o page-top=28 -o page-bottom=43 -o font=LatinModernMono12:regular:9 some_file.txt where : print_cmd ::= some printing... (1 Reply)
Discussion started by: jcdole
1 Replies

3. UNIX for Dummies Questions & Answers

Extracting Parts of String "#" vs "%"

Hello, I have a question regarding extracting parts of a string and the meaning of # and % in the syntax. I created an example below. # filename=/first/second/third/fourth # # echo $filename /first/second/third/fourth # # echo "${filename##*/}" fourth # # echo "${filename%/*}"... (3 Replies)
Discussion started by: shah9250
3 Replies

4. UNIX for Dummies Questions & Answers

Using "mailx" command to read "to" and "cc" email addreses from input file

How to use "mailx" command to do e-mail reading the input file containing email address, where column 1 has name and column 2 containing “To” e-mail address and column 3 contains “cc” e-mail address to include with same email. Sample input file, email.txt Below is an sample code where... (2 Replies)
Discussion started by: asjaiswal
2 Replies

5. UNIX for Dummies Questions & Answers

find/xargs/*grep: find multi-line empty "try-catch" blocks - eg, missing ; not in a commented block

How can I recursively find all files in a directory and print out the file and first line number of any text blocks that match the below cases? This would seem to involve find, xargs, *grep, regex, etc. In summary, I want to find so-called empty "try-catch blocks" that do not contain code... (0 Replies)
Discussion started by: lifechamp
0 Replies

6. Shell Programming and Scripting

how to use "cut" or "awk" or "sed" to remove a string

logs: "/home/abc/public_html/index.php" "/home/abc/public_html/index.php" "/home/xyz/public_html/index.php" "/home/xyz/public_html/index.php" "/home/xyz/public_html/index.php" how to use "cut" or "awk" or "sed" to get the following result: abc abc xyz xyz xyz (8 Replies)
Discussion started by: timmywong
8 Replies

7. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

8. Shell Programming and Scripting

Extracting string between "_" and "."

Hi, I got several files with this format 1.1.1.1_fa0_1.html or 1.1.1.1_vl100.html and I need just the fa0_1 or the vl100 string. I managed to extract from the vl100 with baseline 1.1.1.1_vl100.html .html | awk -F"_" '{print $NF}' but obviously that command gets only "1" in the fa0_1... (4 Replies)
Discussion started by: warorgyman
4 Replies

9. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Hi Friends, Can any of you explain me about the below line of code? mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'` Im not able to understand, what exactly it is doing :confused: Any help would be useful for me. Lokesha (4 Replies)
Discussion started by: Lokesha
4 Replies
Login or Register to Ask a Question