Scanning alignment and "extracting" blocks


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Scanning alignment and "extracting" blocks
# 8  
Old 02-12-2011
gocoogs

I am uploading another example so you can see what I am trying to accomplish. Radoulov example works wonders when the sequences are linear and continuous, however, in many cases I am getting this other format and then the script does not produced the desired output.
In this last example I am generating blocks of 60 characters in length and then I move the 'window' 10 characters and generate the second block, so on and so forth.
Once again, the numbers at the very top of the file indicate the number of sequences in the file (in this case 5) and the length of the sequences (210 for the input file and 60 for the output files = the window size).
Thanks alot

Last edited by Xterra; 02-12-2011 at 10:29 PM..
# 9  
Old 02-12-2011
Here's a start:
Code:
cat Infile.txt | awk 'NR == 1 {numseq = $1; len= $2; blocksize=40; RS=""; FS="\n"; for (i=1;i<=numseq;i++) sqx[i]=""; next;}
NR != 1 {for (i=1;i<=numseq; i++) sqx[i]=sqx[i] $i;} END {for (i=1;i<=numseq;i++) print (sqx[i])}' | sed 's/[ ]\{13\}//g'

Pat3324      aagtggtaag ttcgtgggga gactgcttac taccaaataa gatttgccca gtcattgggg acggtggtgt tatgtgccag gggttcgcac tatgggccca aaaatatcgt tcgccctatt 
Pat 1234     cttccgatgt accggtcgca gctctggata gaagccagct ccctttgagt ccccgctcgg catgtataga aacctccggt gtatctaaag tgtgattttg aaggcgagag gggggtctag 
Pat Aqt12    gctcttaaat ctcagaaaac ggtacgtcgc gagggcgtcg gtgaaccccg gaacactatc ccgtaccgat ctgtttaaac gggttgattt ccctaccgac cccaaatact gagatgtact 
Pat-ARl      gccagatgga gtgaggaaat ttgagcgcgc gcgtgaacgt cagacctcgt cctaggcata ccctctaccg atttaactgt taagatagta gacaattaac tcctccagct gatttagtgc 
Pat 222      attttacgag cggtggaggc aggatcgccg tgcgcctgtt cagaacgata cataagcgtg agcgcttcgt atattaagca tgagtcaaaa tctatattgc ttctgaattc agaaatcctg 
Pat ARQ      caccaagtgt gggtgaatac cactgacttg gagactcagt tccgaatctt gctaagcgca atctatgcac atgggggctc cgtatagagt cgtgcagacg cggtaagggc atatttagag 
PatAA12      tgactggggt gtaagaaact atatcgtgac gttgcgcaat ttgataaacg acgactgacg ctgcgttata agttgtattc gttatatgac agcttagtag aacataaaaa cgaataatgc 
2345         taggcacagc ctcaaaagct cttacattta cgaaaccggt atgcatcagt atgtattagt aacacgggaa gaacgcaacg tcggctccta atcgatagca tacactccat atatcggttc 
John Smith   aatgagatat caatactcca acgaatgaac ccgatgttgt gtattcaggc gtgcttagac tcgcgcaccg cacgtctttc ccaatattga cgcatactgt gttaggccca ttgatcatcg 
Rabbit       gactttgatg ggtacaggtc gacagtccgt actcatagat cgccttcgcc tacacaaggg cgtctactgc taaccaatgg acgggtgggc cttaagacgt tcttgcttag ccaaagtgcg


Last edited by Scott; 02-12-2011 at 05:24 AM.. Reason: Please use code tags
# 10  
Old 02-12-2011
gocoocs

Your script does not produce the output I need.
# 11  
Old 02-12-2011
It's not easy to come up with a flexible solution.
If you're using GNU awk, you may try something like this:

Code:
awk --re-interval 'END {
  c = m > c ? m : c
  for (i = 1; i <= c; i++) {
  n = 0
  for (j = 1; j <= length(d[p[i]]); j += wn) {
    r = substr(d[p[i]], j, mx)
    gsub(/.{10}/, "& ", r)
    print p[i], r > ("block" ++n)      
      }    
    }
  }  
NR > 1 {
  if (!/^  / && NF) { 
    t = substr($0, 1, 12)
    sub(t, x)
    p[++c] = t
    }
  NF || f = 1; 
  if (f) {
    m = c; c = NF ? ++c : NF
    }
  gsub(/ /, x)
  d[p[c]] = d[p[c]] ? d[p[c]] $0 : $0 
  }' mx=40 wn=30 infile.txt

# 12  
Old 02-12-2011
This is what I am getting
Quote:
awk: not an option: --re-interval
Now I have been able to homogenize the format of the input file:
Quote:
12 70
GADEN5572 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTGCTTGTAAATATTAAT
GAJFA4268 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAMLT1199 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTGCTTGTAAATATTAAT
GAOCA1250 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2020 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2031 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2081 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2085 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2096 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2102 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2121 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2138 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
I still kinda like your previous code better. Now I just need to be able to add the number of sequences and the length at the very beggining of the file.
Code:
awk '{t=$1; c = x}{for (i = 1; i <= length; i += wn)print t FS"" substr($2, i, mx) > ("block" ++c)}' mx=40 wn=40 infile.txt

I have been trying to modify it to accomplish that with no success though. The first output file should look like this:
Quote:
12 40
GADEN5572 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAJFA4268 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAMLT1199 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA1250 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2020 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2031 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2081 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2085 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2096 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2102 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2121 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2138 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA

Last edited by Xterra; 02-13-2011 at 01:17 AM..
# 13  
Old 02-13-2011
Consider that the code depends entirely on the format of the input data.
Given your last sample, something like this might work:

Code:
awk 'NR > 1{ 
  c = x
  for (i = 1; i <= length($2); i += wn) 
    print $1, substr($2, i, mx) > ("block" ++c)
    
  }' mx=40 wn=40 infile

There is a big difference between the last example and the previous one,
at least, as far as the awk code is concerned.

And, by the way, what operating system are you using?
# 14  
Old 02-13-2011
The very first line is missing!

I need the number of sequences and the length at the very top of ALL the output (block) files.
This is the desire output for the first outfile (block1):
Quote:
12 40
GADEN5572 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAJFA4268 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAMLT1199 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA1250 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2020 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2031 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2081 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2085 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2096 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2102 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2121 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2138 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
The output file I am getting using your code
Code:
awk 'NR > 1{ 
  c = x
  for (i = 1; i <= length($2); i += wn) 
    print $1, substr($2, i, mx) > ("block" ++c)
 
  }' mx=40 wn=40 infile

is missing that very first line
Quote:
GADEN5572 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAJFA4268 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAMLT1199 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA1250 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2020 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2031 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2081 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2085 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2096 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2102 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2121 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2138 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
I have to include that first line otherwise the file won't be recognized by me second application.
Thanks!
PS. I am using Linux RedHat

Last edited by Xterra; 02-13-2011 at 11:53 AM..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. AIX

Apache 2.4 directory cannot display "Last modified" "Size" "Description"

Hi 2 all, i have had AIX 7.2 :/# /usr/IBMAHS/bin/apachectl -v Server version: Apache/2.4.12 (Unix) Server built: May 25 2015 04:58:27 :/#:/# /usr/IBMAHS/bin/apachectl -M Loaded Modules: core_module (static) so_module (static) http_module (static) mpm_worker_module (static) ... (3 Replies)
Discussion started by: penchev
3 Replies

2. Shell Programming and Scripting

Bash script - Print an ascii file using specific font "Latin Modern Mono 12" "regular" "9"

Hello. System : opensuse leap 42.3 I have a bash script that build a text file. I would like the last command doing : print_cmd -o page-left=43 -o page-right=22 -o page-top=28 -o page-bottom=43 -o font=LatinModernMono12:regular:9 some_file.txt where : print_cmd ::= some printing... (1 Reply)
Discussion started by: jcdole
1 Replies

3. UNIX for Dummies Questions & Answers

Extracting Parts of String "#" vs "%"

Hello, I have a question regarding extracting parts of a string and the meaning of # and % in the syntax. I created an example below. # filename=/first/second/third/fourth # # echo $filename /first/second/third/fourth # # echo "${filename##*/}" fourth # # echo "${filename%/*}"... (3 Replies)
Discussion started by: shah9250
3 Replies

4. UNIX for Dummies Questions & Answers

Using "mailx" command to read "to" and "cc" email addreses from input file

How to use "mailx" command to do e-mail reading the input file containing email address, where column 1 has name and column 2 containing “To” e-mail address and column 3 contains “cc” e-mail address to include with same email. Sample input file, email.txt Below is an sample code where... (2 Replies)
Discussion started by: asjaiswal
2 Replies

5. UNIX for Dummies Questions & Answers

find/xargs/*grep: find multi-line empty "try-catch" blocks - eg, missing ; not in a commented block

How can I recursively find all files in a directory and print out the file and first line number of any text blocks that match the below cases? This would seem to involve find, xargs, *grep, regex, etc. In summary, I want to find so-called empty "try-catch blocks" that do not contain code... (0 Replies)
Discussion started by: lifechamp
0 Replies

6. Shell Programming and Scripting

how to use "cut" or "awk" or "sed" to remove a string

logs: "/home/abc/public_html/index.php" "/home/abc/public_html/index.php" "/home/xyz/public_html/index.php" "/home/xyz/public_html/index.php" "/home/xyz/public_html/index.php" how to use "cut" or "awk" or "sed" to get the following result: abc abc xyz xyz xyz (8 Replies)
Discussion started by: timmywong
8 Replies

7. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

8. Shell Programming and Scripting

Extracting string between "_" and "."

Hi, I got several files with this format 1.1.1.1_fa0_1.html or 1.1.1.1_vl100.html and I need just the fa0_1 or the vl100 string. I managed to extract from the vl100 with baseline 1.1.1.1_vl100.html .html | awk -F"_" '{print $NF}' but obviously that command gets only "1" in the fa0_1... (4 Replies)
Discussion started by: warorgyman
4 Replies

9. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Hi Friends, Can any of you explain me about the below line of code? mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'` Im not able to understand, what exactly it is doing :confused: Any help would be useful for me. Lokesha (4 Replies)
Discussion started by: Lokesha
4 Replies
Login or Register to Ask a Question