awk code to reconstruct sequence from alignment


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers awk code to reconstruct sequence from alignment
# 1  
Old 07-22-2013
awk code to reconstruct sequence from alignment

Hi Everyone,

I need some help to construct a long 'Sbjct' string from the following input using incremental order of 'Sbjct' starting number (e.g. 26325115,33716368,33769033,34869860 etc.)
Different 'Sbject' string will be separated by 'NNNN's as:

(Sbjct:26325115-26325094)NNNN(Sbjct:33716368-33716347)NNNN(Sbjct:33769033-33769073)NNNN(Sbjct:34869860-34869889)

The output expected is shown in 'Example Output'.

--------- Example Input (just a small segment of the whole file)-----------

Score = 44.1 bits (22), Expect = 0.30
Identities = 28/30 (93%)
Strand = Plus / Plus

Query: 1684 atcaaaatgaccaaaatatttcattaaaaa 1713
|||||||||| |||||| ||||||||||||
Sbjct: 34869860 atcaaaatgaacaaaatgtttcattaaaaa 34869889

Score = 44.1 bits (22), Expect = 0.30
Identities = 22/22 (100%)
Strand = Plus / Minus


Query: 1758 ttagggtttagagttaaggggt 1779
||||||||||||||||||||||
Sbjct: 26325115 ttagggtttagagttaaggggt 26325094


Score = 44.1 bits (22), Expect = 0.30
Identities = 22/22 (100%)
Strand = Plus / Minus


Query: 1687 aaaatgaccaaaatatttcatt 1708
||||||||||||||||||||||
Sbjct: 33716368 aaaatgaccaaaatatttcatt 33716347


Score = 44.1 bits (22), Expect = 0.30
Identities = 38/42 (90%), Gaps = 1/42 (2%)
Strand = Plus / Plus


Query: 1734 ccctagggttaactaattcaaaccttagggtttagagttaag 1775
||||||| ||||||||| |||| ||||||||||||||||||
Sbjct: 33769033 ccctaggattaactaatctaaac-ttagggtttagagttaag 33769073

----------------------

---------------------- Example Output -----------
Whole Sbjct string
ttagggtttagagttaaggggtNNNNaaaatgaccaaaatatttcattNNNNccctaggattaactaatctaaac-ttagggtttagagttaagNNNNatcaaaatgaacaaaatgtttcattaaaaa

Thanks for your help.
# 2  
Old 07-22-2013
Could you please put some code tag ?
# 3  
Old 07-22-2013
I need some help to construct a long 'Sbjct' string from the following input using incremental order of 'Sbjct' starting number (e.g. 26325115,33716368,33769033,34869860 etc.)
Different 'Sbject' string will be separated by 'NNNN's as:
Code:
(Sbjct:26325115-26325094)NNNN(Sbjct:33716368-33716347)NNNN(Sbjct:33769033-33769073)NNNN(Sbjct:34869860-34869889)

The output expected is shown in 'Example Output'.
Code:
--------- Example Input (just a small segment of the whole file)-----------

Score = 44.1 bits (22), Expect = 0.30
Identities = 28/30 (93%)
Strand = Plus / Plus

Query: 1684 atcaaaatgaccaaaatatttcattaaaaa 1713
|||||||||| |||||| ||||||||||||
Sbjct: 34869860 atcaaaatgaacaaaatgtttcattaaaaa 34869889

Score = 44.1 bits (22), Expect = 0.30
Identities = 22/22 (100%)
Strand = Plus / Minus


Query: 1758 ttagggtttagagttaaggggt 1779
||||||||||||||||||||||
Sbjct: 26325115 ttagggtttagagttaaggggt 26325094


Score = 44.1 bits (22), Expect = 0.30
Identities = 22/22 (100%)
Strand = Plus / Minus


Query: 1687 aaaatgaccaaaatatttcatt 1708
||||||||||||||||||||||
Sbjct: 33716368 aaaatgaccaaaatatttcatt 33716347


Score = 44.1 bits (22), Expect = 0.30
Identities = 38/42 (90%), Gaps = 1/42 (2%)
Strand = Plus / Plus


Query: 1734 ccctagggttaactaattcaaaccttagggtttagagttaag 1775
||||||| ||||||||| |||| ||||||||||||||||||
Sbjct: 33769033 ccctaggattaactaatctaaac-ttagggtttagagttaag 33769073

Code:
---------------------- Example Output -----------
Whole Sbjct string
ttagggtttagagttaaggggtNNNNaaaatgaccaaaatatttcattNNNNccctaggattaactaatctaaac-ttagggtttagagttaagNNNNatcaaaatgaacaaaatgtttcattaaaaa

Thanks for your help.
# 4  
Old 07-22-2013
Fahmida, for future reference, note that you could have edited the 1st message in this thread to add CODE tags instead of duplicating it in a new message. Smilie
# 5  
Old 07-22-2013
If you have gawk:
Code:
gawk '
        /Sbjct/ {
                A[$2] = $3
        }
        END {
                n = asorti ( A, D )
                for ( i = 1; i <= n; i++ )
                        s = s ? s "NNNN" A[D[i]] : A[D[i]]
                print s
        }
' file

Output:
Code:
ttagggtttagagttaaggggtNNNNaaaatgaccaaaatatttcattNNNNccctaggattaactaatctaaac-ttagggtttagagttaagNNNNatcaaaatgaacaaaatgtttcattaaaaa

# 6  
Old 07-22-2013
With regular awk, try:
Code:
awk '
  NR==FNR{
    if(/Sbjct/)A[$2 "-" $4]=$3
    next
  }
  { 
    for(i=2; i<=NF; i+=2) $i=A[$i]
  }
  1
' file2 FS='\\(Sbjct:|\\)' OFS= file1

Code:
$ cat file1
(Sbjct:26325115-26325094)NNNN(Sbjct:33716368-33716347)NNNN(Sbjct:33769033-33769073)NNNN(Sbjct:34869860-34869889)

Output:
Code:
ttagggtttagagttaaggggtNNNNaaaatgaccaaaatatttcattNNNNccctaggattaactaatctaaac-ttagggtttagagttaagNNNNatcaaaatgaacaaaatgtttcattaaaaa

# 7  
Old 07-23-2013
Dear Yoda,

Thanks for your reply. Your code is working, it adds the 'NNNN's after each subject line, it would be great if code is modified to add 'NNNN's at the end of the Sbjct of a whole block (each block starts with 'Score =...' and may contain many Sbjct lines).
For example there are three segments below. If we look at the starting point of 'Sbjcts' at each segment: 11561582, 3294707, 11140709,.

Code:
Score =  139 bits (70), Expect = 7e-30
 Identities = 103/114 (90%)
 Strand = Plus / Plus
Query: 2040     atctagagtattatggtctttttacatattaaatgaaacattttgatcattttcctactt 2099
                |||||| |||||| |||| |||||| ||||||||||||||||||| |||||||||| |||
Sbjct: 11561582 atctagggtattagggtccttttacctattaaatgaaacattttggtcattttcctcctt 11561641
                                                                     
Query: 2100     gtggtatatttttgtgatcaaaatttgaaaatagtatatttaggagaattgccc 2153
                ||||| |||||||||||||||||  ||||||| || ||||||||||||||||||
Sbjct: 11561642 gtggtctatttttgtgatcaaaacatgaaaatggtctatttaggagaattgccc 11561695

Score =  139 bits (70), Expect = 7e-30
 Identities = 103/114 (90%)
 Strand = Plus / Minus
                                                                         
Query: 2040    atctagagtattatggtctttttacatattaaatgaaacattttgatcattttcctactt 2099
               |||||| |||||| |||| |||||| ||||||||||||||||||| ||||||| || || 
Sbjct: 3294707 atctagggtattagggtcattttacctattaaatgaaacattttggtcatttttctcctc 3294648
                                                                
Query: 2100    gtggtatatttttgtgatcaaaatttgaaaatagtatatttaggagaattgccc 2153
               ||||||||||||||||||||||| |||||||| || ||||||||||||||||||
Sbjct: 3294647 gtggtatatttttgtgatcaaaacttgaaaattgtctatttaggagaattgccc 3294594

 Score =  139 bits (70), Expect = 7e-30
 Identities = 103/114 (90%)
 Strand = Plus / Minus
                                                                       
Query: 2040     atctagagtattatggtctttttacatattaaatgaaacattttgatcattttcctactt 2099
                |||||| |||||| | ||||||||  ||||||||||||||||||| |||||||||| |||
Sbjct: 11140709 atctagggtattaggatctttttagctattaaatgaaacattttggtcattttcctcctt 11140650
                                                                  
Query: 2100     gtggtatatttttgtgatcaaaatttgaaaatagtatatttaggagaattgccc 2153
                ||||| ||||||||||| ||||| || |||||||||||||||||||||||||||
Sbjct: 11140649 gtggtttatttttgtgaccaaaacttaaaaatagtatatttaggagaattgccc 11140596

So,
1) Sort the Sbjct based on their starting points in each block i.e 3294707, 11140709, 11561582,
2) Add Sbjcts strings of each block in the order of (1) separated by 'NNNN'.

Expected Output is:
Code:
gtggtatatttttgtgatcaaaacttgaaaattgtctatttaggagaattgccc
atctagggtattagggtcattttacctattaaatgaaacattttggtcatttttctcctcNNNN
gtggtttatttttgtgaccaaaacttaaaaatagtatatttaggagaattgcccatctagggta
ttaggatctttttagctattaaatgaaacattttggtcattttcctccttNNNNatctagggtatt
agggtccttttacctattaaatgaaacattttggtcattttcctccttgtggtctatttttgtgatca
aaacatgaaaatggtctatttaggagaattgccc

 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Help with awk alignment

Dear All, I am in the beginning stage of learning shell scripting and preparing shell script on my own now. I would like to get help from fellow mates here. As I am trying to take O/P with space included from I/P table. Kindly guide me to align given I/P table as Expected O/P. ... (5 Replies)
Discussion started by: Raja007
5 Replies

2. AIX

Power RAID10 array reconstruct fails ?

Hello, P7 machine PCI Express x8 Planar 3Gb SAS Adapter RAID10 array(2 disks)(not AIX lvm) was configured and working, then one disk failed and IBM support replaced that. Now raid array is degraded, data is not lost. I see new disk model(same as original) serial and etc. What I did trying... (0 Replies)
Discussion started by: vilius
0 Replies

3. UNIX for Dummies Questions & Answers

Sequence of conditions awk

hello gurus, I want to use an associative array from a file to populate a field of another file, by matching several columns in order of priority. If the first column matches, then i dont want to match $2. Similarly I only want to match $3 when $1 and $2 are not in associative array. For the... (6 Replies)
Discussion started by: ritakadm
6 Replies

4. Shell Programming and Scripting

Inserting IDs from a text file into a sequence alignment file

Hi, I have one file with one column and several hundred entries File1: NA1 NA2 NA3And now I need to run a command within a mapping aligner tool to insert these sample names into a sequence alignment file (SAM) such that they look like this @RG ID:Library1 SM:NA1 PL:Illumina ... (7 Replies)
Discussion started by: nans
7 Replies

5. Shell Programming and Scripting

> dpkg-deb to Extract and Reconstruct a Multipart Archive???

Greetings! Here's one which has been bugging me for a bit ;) As might be known, LibreOffice is available to some of us Linux folk as a large set of debs. Of course, being a curious sort, I'd like to dig in and recreate the original tree which is composed of these assorted archives. So, I... (1 Reply)
Discussion started by: LinQ
1 Replies

6. Shell Programming and Scripting

Using awk and/or sed to reconstruct a file

So I have a file in the following format >*42 abssdfalsdfkjfuf asdhfskdkdklllllllffl eiffejcif >2 dfhucujf dhfjdkfhskskkkkk eifjvujf ddftttyy yyy >~ ojcufk kcdheycjc djcyfjf and I would like it to output abssdfalsdfkjfufasdhfskdkdklllllllffleiffejcif (3 Replies)
Discussion started by: viored
3 Replies

7. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

8. Shell Programming and Scripting

suffix a sequence in awk

hi I have a string pattern like ... ... 000446448742 00432265 040520100408 21974435 DEWSWATER GARRIER AAG IK4000 N 017500180000000000000000077000000000100 000446448742 00580937 040520100408 32083576 PEWSWATER BARRIER DAG GK4000 ... (6 Replies)
Discussion started by: zainravi
6 Replies

9. UNIX for Dummies Questions & Answers

Tools for alignment of code?

Hello, Do we have any freeware which helps in alignment of code wrt spaces, sections etc? Thanks (6 Replies)
Discussion started by: eagercyber
6 Replies

10. Shell Programming and Scripting

PrintF and AWK => Center Alignment?

Alright, I'm relativly new to the Unix enviroment and C in general. I'm writing a script for AWK to search through a file and return what it finds with a center alignment, but so far, I can't get it to work. If anyone could help me out, I'd really appreciate it. (1 Reply)
Discussion started by: Mavrick3020
1 Replies
Login or Register to Ask a Question