Scanning alignment and "extracting" blocks

02-12-2011

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

gocoogs

I am uploading another example so you can see what I am trying to accomplish. Radoulov example works wonders when the sequences are linear and continuous, however, in many cases I am getting this other format and then the script does not produced the desired output.
In this last example I am generating blocks of 60 characters in length and then I move the 'window' 10 characters and generate the second block, so on and so forth.
Once again, the numbers at the very top of the file indicate the number of sequences in the file (in this case 5) and the length of the sequences (210 for the input file and 60 for the output files = the window size).
Thanks alot

Last edited by Xterra; 02-12-2011 at 10:29 PM..

Xterra

View Public Profile for Xterra

Find all posts by Xterra

02-12-2011

Registered User

2, 0

Join Date: Jan 2011

Last Activity: 19 February 2011, 2:49 PM EST

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Here's a start:

Code:

cat Infile.txt | awk 'NR == 1 {numseq = $1; len= $2; blocksize=40; RS=""; FS="\n"; for (i=1;i<=numseq;i++) sqx[i]=""; next;}
NR != 1 {for (i=1;i<=numseq; i++) sqx[i]=sqx[i] $i;} END {for (i=1;i<=numseq;i++) print (sqx[i])}' | sed 's/[ ]\{13\}//g'

Pat3324      aagtggtaag ttcgtgggga gactgcttac taccaaataa gatttgccca gtcattgggg acggtggtgt tatgtgccag gggttcgcac tatgggccca aaaatatcgt tcgccctatt 
Pat 1234     cttccgatgt accggtcgca gctctggata gaagccagct ccctttgagt ccccgctcgg catgtataga aacctccggt gtatctaaag tgtgattttg aaggcgagag gggggtctag 
Pat Aqt12    gctcttaaat ctcagaaaac ggtacgtcgc gagggcgtcg gtgaaccccg gaacactatc ccgtaccgat ctgtttaaac gggttgattt ccctaccgac cccaaatact gagatgtact 
Pat-ARl      gccagatgga gtgaggaaat ttgagcgcgc gcgtgaacgt cagacctcgt cctaggcata ccctctaccg atttaactgt taagatagta gacaattaac tcctccagct gatttagtgc 
Pat 222      attttacgag cggtggaggc aggatcgccg tgcgcctgtt cagaacgata cataagcgtg agcgcttcgt atattaagca tgagtcaaaa tctatattgc ttctgaattc agaaatcctg 
Pat ARQ      caccaagtgt gggtgaatac cactgacttg gagactcagt tccgaatctt gctaagcgca atctatgcac atgggggctc cgtatagagt cgtgcagacg cggtaagggc atatttagag 
PatAA12      tgactggggt gtaagaaact atatcgtgac gttgcgcaat ttgataaacg acgactgacg ctgcgttata agttgtattc gttatatgac agcttagtag aacataaaaa cgaataatgc 
2345         taggcacagc ctcaaaagct cttacattta cgaaaccggt atgcatcagt atgtattagt aacacgggaa gaacgcaacg tcggctccta atcgatagca tacactccat atatcggttc 
John Smith   aatgagatat caatactcca acgaatgaac ccgatgttgt gtattcaggc gtgcttagac tcgcgcaccg cacgtctttc ccaatattga cgcatactgt gttaggccca ttgatcatcg 
Rabbit       gactttgatg ggtacaggtc gacagtccgt actcatagat cgccttcgcc tacacaaggg cgtctactgc taaccaatgg acgggtgggc cttaagacgt tcttgcttag ccaaagtgcg

Last edited by Scott; 02-12-2011 at 05:24 AM.. Reason: Please use code tags

gocoogs

View Public Profile for gocoogs

Find all posts by gocoogs

02-12-2011

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

gocoocs

Your script does not produce the output I need.

Xterra

View Public Profile for Xterra

Find all posts by Xterra

02-12-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

It's not easy to come up with a flexible solution.
If you're using GNU awk, you may try something like this:

Code:

awk --re-interval 'END {
  c = m > c ? m : c
  for (i = 1; i <= c; i++) {
  n = 0
  for (j = 1; j <= length(d[p[i]]); j += wn) {
    r = substr(d[p[i]], j, mx)
    gsub(/.{10}/, "& ", r)
    print p[i], r > ("block" ++n)      
      }    
    }
  }  
NR > 1 {
  if (!/^  / && NF) { 
    t = substr($0, 1, 12)
    sub(t, x)
    p[++c] = t
    }
  NF || f = 1; 
  if (f) {
    m = c; c = NF ? ++c : NF
    }
  gsub(/ /, x)
  d[p[c]] = d[p[c]] ? d[p[c]] $0 : $0 
  }' mx=40 wn=30 infile.txt

radoulov

View Public Profile for radoulov

Find all posts by radoulov

02-12-2011

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

This is what I am getting

Quote:

awk: not an option: --re-interval

Now I have been able to homogenize the format of the input file:

Quote:

12 70
GADEN5572 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTGCTTGTAAATATTAAT
GAJFA4268 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAMLT1199 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTGCTTGTAAATATTAAT
GAOCA1250 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2020 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2031 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2081 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2085 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2096 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2102 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2121 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT
GAOCA2138 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTATTTCCCTTTGTTTTACTTGTAAATATTAAT

I still kinda like your previous code better. Now I just need to be able to add the number of sequences and the length at the very beggining of the file.

Code:

awk '{t=$1; c = x}{for (i = 1; i <= length; i += wn)print t FS"" substr($2, i, mx) > ("block" ++c)}' mx=40 wn=40 infile.txt

I have been trying to modify it to accomplish that with no success though. The first output file should look like this:

Quote:

12 40
GADEN5572 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAJFA4268 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAMLT1199 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA1250 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2020 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2031 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2081 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2085 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2096 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2102 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2121 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2138 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA

Last edited by Xterra; 02-13-2011 at 01:17 AM..

Xterra

View Public Profile for Xterra

Find all posts by Xterra

02-13-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Consider that the code depends entirely on the format of the input data.
Given your last sample, something like this might work:

Code:

awk 'NR > 1{ 
  c = x
  for (i = 1; i <= length($2); i += wn) 
    print $1, substr($2, i, mx) > ("block" ++c)
    
  }' mx=40 wn=40 infile

There is a big difference between the last example and the previous one,
at least, as far as the awk code is concerned.

And, by the way, what operating system are you using?

radoulov

View Public Profile for radoulov

Find all posts by radoulov

02-13-2011

Registered User

365, 3

Join Date: Jun 2010

Last Activity: 6 August 2019, 11:08 PM EDT

Posts: 365

Thanks Given: 149

Thanked 3 Times in 3 Posts

The very first line is missing!

I need the number of sequences and the length at the very top of ALL the output (block) files.
This is the desire output for the first outfile (block1):

Quote:

12 40
GADEN5572 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAJFA4268 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAMLT1199 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA1250 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2020 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2031 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2081 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2085 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2096 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2102 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2121 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2138 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA

The output file I am getting using your code

Code:

awk 'NR > 1{ 
  c = x
  for (i = 1; i <= length($2); i += wn) 
    print $1, substr($2, i, mx) > ("block" ++c)
 
  }' mx=40 wn=40 infile

is missing that very first line

Quote:

GADEN5572 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAJFA4268 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAMLT1199 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA1250 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2020 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2031 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2081 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2085 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2096 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2102 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2121 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA
GAOCA2138 AGGCTATAGGCTAAATTTCCCTTTCCCTGTTCCTTCCCTA

I have to include that first line otherwise the file won't be recognized by me second application.
Thanks!
PS. I am using Linux RedHat

Last edited by Xterra; 02-13-2011 at 11:53 AM..

Xterra

View Public Profile for Xterra

Find all posts by Xterra

Shell Programming and Scripting

Scanning alignment and "extracting" blocks

9 More Discussions You Might Find Interesting

1. AIX

Apache 2.4 directory cannot display "Last modified" "Size" "Description"

Discussion started by: penchev

2. Shell Programming and Scripting

Bash script - Print an ascii file using specific font "Latin Modern Mono 12" "regular" "9"

Discussion started by: jcdole

3. UNIX for Dummies Questions & Answers

Extracting Parts of String "#" vs "%"

Discussion started by: shah9250

4. UNIX for Dummies Questions & Answers

Using "mailx" command to read "to" and "cc" email addreses from input file

Discussion started by: asjaiswal

5. UNIX for Dummies Questions & Answers

find/xargs/*grep: find multi-line empty "try-catch" blocks - eg, missing ; not in a commented block

Discussion started by: lifechamp

6. Shell Programming and Scripting

how to use "cut" or "awk" or "sed" to remove a string

Discussion started by: timmywong

7. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Discussion started by: shis100

8. Shell Programming and Scripting

Extracting string between "_" and "."

Discussion started by: warorgyman

9. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Discussion started by: Lokesha