awk reformat file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk reformat file
# 1  
Old 12-05-2013
awk reformat file

Hello:
When I tried a perl-oneliner to re-format fasta file.

infile.fasta
Code:
>YAL069W-1.334 Putative promoter
CCACACCACACCCACACACC 
ACACCACACCCACACACACA
ACAGCCCTAATCTAACCC 
>YAL068C-7235.2170 Putative ABC sequence
TACGAGAATAATTT 
ACGTAAATGAAGTT
TATATATAAA 
>gi|31044174|gb|AY143560.1| Tintinnopsis
GAAACTGCGAATGGCTCATTAAAA
TAATTCTAGAGCTAATACATGCTG
AGCATCTGCTATTGTGGTGACTCATAGT
>gi|31044185|gb|AY143571.1|  
ATTACCCAATCCT 
GGGCACCACCAG

outfile.fasta
Code:
>YAL069W-1.334 Putative promoter
CCACACCACACCCACACACCACACCACACCCACACACACAACAGCCCTAATCTAACCC
>YAL068C-7235.2170 Putative ABC sequence
TACGAGAATAATTTACGTAAATGAAGTTTATATATAAA
>gi|31044174|gb|AY143560.1| Tintinnopsis
GAAACTGCGAATGGCTCATTAAAATAATTCTAGAGCTAATACATGCTGAGCATCTGCTATTGTGGTGACTCATAGT
>gi|31044185|gb|AY143571.1| 
ATTACCCAATCCTGGGCACCACCAG

Code:
perl -e 'while (<>) { if (!/^>/) {chomp; print} else{ print "\n",$_; }}' infile.fasta > outfile.fasta

which reminds me of an old post:
Code:
awk 'BEGIN{RS=">"} NR>1 {sub("\n","\t"); gsub("\n",""); print RS$0}' infile.fasta > outfile.tab

One step from what I need, but could not figure out the awk script easily.
Any help please? Thanks a lot!
# 2  
Old 12-05-2013
Is this what you want:
Code:
awk 'BEGIN{RS=">"} NR>1 {sub("\n","\t"); gsub("\n",""); sub ("\t","\n"); print RS$0}' file
>YAL069W-1.334 Putative promoter
CCACACCACACCCACACACC ACACCACACCCACACACACAACAGCCCTAATCTAACCC 
>YAL068C-7235.2170 Putative ABC sequence
TACGAGAATAATTT ACGTAAATGAAGTTTATATATAAA 
>gi|31044174|gb|AY143560.1| Tintinnopsis
GAAACTGCGAATGGCTCATTAAAATAATTCTAGAGCTAATACATGCTGAGCATCTGCTATTGTGGTGACTCATAGT
>gi|31044185|gb|AY143571.1|  
ATTACCCAATCCT GGGCACCACCAG

This User Gave Thanks to RudiC For This Post:
# 3  
Old 12-05-2013
Yes, except the space among the sequence row, which may be due to the space in my original file.
Could you explain the the two sub() and gsub() functions in the script? I understand these two functions, but not sure how they work in this script.
I am assuming:
sub("\n","\t") is to replace the newline with tab?
gsub("\n",""); remove all the newlines within each record?
sub ("\t","\n"); then replace the tab back to newline? (What if there is a tab between the ">" and the DNA sequence?)
And I was very nervous about those space/tab chars in the header lines (i.e. lines with ">" char)------ There may be tab space in the header line.
Thanks!

Last edited by yifangt; 12-05-2013 at 02:00 PM..
# 4  
Old 12-05-2013
He is setting RS to >, so awk reads in blocks delimited by > instead of blocks delimited by \n. This means the first "line", as far as awk is concerned, will look like this:

Code:
YAL069W-1.334 Putative promoter
CCACACCACACCCACACACC 
ACACCACACCCACACACACA
ACAGCCCTAATCTAACCC

The first sub(), matches the first \n it finds, but no further, changing it to a tab. It does this so it can find it later (instead of removing it, like the rest.)

Code:
YAL069W-1.334 Putative promoter        CCACACCACACCCACACACC 
ACACCACACCCACACACACA
ACAGCCCTAATCTAACCC

The gsub() matches all further newlines, deleting them:

Code:
YAL069W-1.334 Putative promoter        CCACACCACACCCACACACCACACCACACCCACACACACAACAGCCCTAATCTAACCC

The final sub() turns the tab back into a newline:

Code:
YAL069W-1.334 Putative promoter
CCACACCACACCCACACACCACACCACACCCACACACACAACAGCCCTAATCTAACCC

...then the program prints it, sticking RS -- which is > -- onto the front first.

Code:
>YAL069W-1.334 Putative promoter
CCACACCACACCCACACACCACACCACACCCACACACACAACAGCCCTAATCTAACCC

This should work nicely if you are using GNU awk on Linux, but awk on other systems may have a record-size limitation of 1 or 2 kilobytes.
This User Gave Thanks to Corona688 For This Post:
# 5  
Old 12-05-2013
Thanks Corona688!
Just to confirm: awk works on RECORD individually if RS is specified, otherwise, by row, right?
While I was replying you submitted your answer. My concern is the tab may be embedded within the first line of each record. I was thinking the way to remember the first row as the first field ($1) and the rest as the the other ($2).
Thanks!
# 6  
Old 12-05-2013
Try:

Code:
$ awk '!/^>/{gsub(/[[:space:]]/,x)}{printf /^>/ ? NR == 1 ? $0 RS : RS $0 RS : $0 }END{printf RS}' file

Code:
>YAL069W-1.334 Putative promoter
CCACACCACACCCACACACCACACCACACCCACACACACAACAGCCCTAATCTAACCC
>YAL068C-7235.2170 Putative ABC sequence
TACGAGAATAATTTACGTAAATGAAGTTTATATATAAA
>gi|31044174|gb|AY143560.1| Tintinnopsis
GAAACTGCGAATGGCTCATTAAAATAATTCTAGAGCTAATACATGCTGAGCATCTGCTATTGTGGTGACTCATAGT
>gi|31044185|gb|AY143571.1|  
ATTACCCAATCCTGGGCACCACCAG

---edit----
Code:
$ awk '{printf /^>/? NR == 1 ? $0 RS : RS $0 RS : gsub(/[[:space:]]/,x) $0 }END{printf RS}' file -->BUG


Last edited by Akshay Hegde; 12-05-2013 at 03:38 PM.. Reason: to simplify more..
This User Gave Thanks to Akshay Hegde For This Post:
# 7  
Old 12-05-2013
You could say it always works in records. RS just happens to be set to \n by default, so "record" usually means "lines".
This User Gave Thanks to Corona688 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to reformat output if input file is empty, but not if file has data in it

The below awk improved bu @MadeInGermany, works great as long as the input file has data in it in the below format: input chrX 25031028 25031925 chrX:25031028-25031925 ARX 631 18 chrX 25031028 25031925 chrX:25031028-25031925 ARX 632 14... (3 Replies)
Discussion started by: cmccabe
3 Replies

2. Shell Programming and Scripting

awk to reformat text file

Howdy. AWK beginner here. I need to reformat a text file in the following format: TTGS08-2014001 6018.00 143563.00 ... (2 Replies)
Discussion started by: c47v3770
2 Replies

3. Shell Programming and Scripting

Reformat awk output

I need to rearrange the output but i am unable to arrange it to match the format. In the output i need NAME=\"To in the column . Bash: #!/bin/bash cd /cygdrive/c/output/a cat *.txt > output.txt i=/cygdrive/c/output/a/output.csv #echo "NE_Name, Source, Destination, OSPF_AREA_ID"... (4 Replies)
Discussion started by: adgjmpt
4 Replies

4. Shell Programming and Scripting

Using awk to reformat file output

Hi there. I need to reformat a large file. Here is a sample of the file. NETIK0102_UCS_Boot_a,NETIK0102_UCS_Boot_b 5200 2438 70G 5200 2439 70G NETIK0102_UCS_HBA0_a,NETIK0102_UCS_HBA1_b,NETIK0102_UCS_HBA2_a,NETIK0102_UCS_HBA3_b 2673 19D7 55G 2673 19C0 30G 2673 19F5 120G... (5 Replies)
Discussion started by: kieranfoley
5 Replies

5. Shell Programming and Scripting

awk to reformat text

I have this input and want output like below, how can I achieve that through awk: Input: CAT1 FRY-01 CAT1 FRY-04 CAT1 DRY-03 CAT1 FRY-02 CAT1 DRY-04 CAT2 FRY-03 CAT2 FRY-02 CAT2 DRY-01 FAT3 DRY-12 FAT3 FRY-06 Output: category CAT1 item FRY-01 (7 Replies)
Discussion started by: aydj
7 Replies

6. Shell Programming and Scripting

need awk or sed help to reformat output

We have the following output: server1_J00_data_20120711122243 server1_J00_igs_20120711122243 server1_J00_j2ee_20120711122243 server1_J00_sec_20120711122243 server1_J00_data_20120711131819 server1_J00_igs_20120711131819 server1_J00_j2ee_20120711131819 server2_J00_data_20120711122245... (10 Replies)
Discussion started by: ux4me
10 Replies

7. Shell Programming and Scripting

Reformat MLS Data - Use AWK?

I am helping my wife set up a real estate site and I am starting to integrate MLS listings. We are using a HostGator level 5 VPS running CentOS and have full root and SSH access to the VPS. Thus far I have automated the daily FTP download of listings from our MLS server using a little sh script.... (4 Replies)
Discussion started by: Chicago_Realtor
4 Replies

8. Shell Programming and Scripting

awk to reformat a text file

I am definitely not an expert with awk, and I want to reformat a text file like the following. This is probably a very easy one for an expert out there. I would like to keep the lines in the same order, but move the heading to only be listed once above the lines. This is what the text file... (7 Replies)
Discussion started by: linux4life
7 Replies

9. Shell Programming and Scripting

reformat date, awk and sed

The command below is getting me the output I need. awk -F"," ' { if ($6 = 475) print "@@"$3 " " "0000" $10 "0" $1 "00000000" $8}' ${DIR1}${TMPFILE1} | sed -e 's/@@1//g' > ${DIR2}${TPRFILE} Output: 900018732 00004961160200805160000000073719 Now I need to incorporate... (5 Replies)
Discussion started by: mondrar
5 Replies

10. Shell Programming and Scripting

help reformat data with awk

I am trying to write an awk program to reformat a data table and convert the date to julian time. I have all the individual steps working, but I am having some issues joing them into one program. Can anyone help me out? Here is my code so far: # This is an awk program to convert the dates from... (4 Replies)
Discussion started by: climbak
4 Replies
Login or Register to Ask a Question