Reformat Header of Variable Length


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Reformat Header of Variable Length
# 1  
Old 04-11-2016
Reformat Header of Variable Length

Dear Forum,

I am struggling with reformatting headers in protein sequence files. For the input file each header (lines starting with @) contains and unique ID followed by barcode (bc) information (a,b,c,d,f,g). The header is of variable length and some barcodes are missing or extra for certain records.

I would like to reformat the barcode by removing fields c1 and d1 if present. I would also like to shorten records with missing barcodes (e.g S006) if consecutive barcodes re missing.

I tried something in awk but it got rather complicated in oder to deal to deal with all possible cases.

Thanks for considering my question!


Code:
awk -F "," '{
   if(NF == 8 && $0 ~ "c1:" && $0 ~ "d1:")
    print $1","$2","$3","$5","$7","$8;
    ...
   else
   print $0;
}
'

Input:
Code:
@S001;bc=a:GGT,b:GGT,c:TTG,c1:TTT,d:ACA,d1:AAA,f:TCC,g:TGA;
AWTVM...
@S002;bc=a:GGT,b:GTT,c:ATG,c1:TTT,d:ACA,d1:AAA,f:TCC,g:TGA;
AWTVM...
@S003;bc=a:GGT,b:GTT,c:TTG,d:AGA,d1:AAA,f:TGG,g:TGG;
AWTVM...
@S004;bc=a:GGT,b:GTT,c:ATG,c1:TTT,d:ACA,f:TGG,g:AGG;
AWTVM...
@S005;bc=a:GGT,b:TGT,c:AGG,c1:TTT,d1:AAA,f:TCC;
AWTVM...
@S006;bc=a:GGT,b:TGT,c1:TTT,d:ACA,d1:AAA,f:TCC;
AWTVM...
@S007;bc=a:GGT,b:TGT,c:ATA,d:ACA,f:TCC;
AWTVM...

Output:
Code:
@S001;bc=a:GGT,b:GGT,c:TTG,d:ACA,f:TCC,g:TGA;
AWTVM...
@S002;bc=a:GGT,b:GTT,c:ATG,d:ACA,f:TCC,g:TGA;
AWTVM...
@S003;bc=a:GGT,b:GTT,c:TTG,d:AGA,f:TGG,g:TGG;
AWTVM...
@S004;bc=a:GGT,b:GTT,c:ATG,d:ACA,f:TGG,g:AGG;
AWTVM...
@S005;bc=a:GGT,b:TGT,c:AGG;
AWTVM...
@S006;bc=a:GGT,b:TGT;
AWTVM...
@S007;bc=a:GGT,b:TGT,c:ATA,d:ACA,f:TCC;
AWTVM...

# 2  
Old 04-11-2016
I'm not sure your S006 output is correct as there should be a d: field in there. Try
Code:
awk '
/^@/    {for (i=NF; i>1; i--)   {if ($i ~ /^[cd]1/) sub ($i FS, _)
                                 if ($i ~ /c:/) KP = i
                                }
         if (0 == gsub (/,([cd]1|d)/, "&"))     {NF = KP
                                                 $NF = $NF ";"
                                                }
        }
1
' FS="," OFS="," file
@S001;bc=a:GGT,b:GGT,c:TTG,d:ACA,f:TCC,g:TGA;
AWTVM...
@S002;bc=a:GGT,b:GTT,c:ATG,d:ACA,f:TCC,g:TGA;
AWTVM...
@S003;bc=a:GGT,b:GTT,c:TTG,d:AGA,f:TGG,g:TGG;
AWTVM...
@S004;bc=a:GGT,b:GTT,c:ATG,d:ACA,f:TGG,g:AGG;
AWTVM...
@S005;bc=a:GGT,b:TGT,c:AGG;
AWTVM...
@S006;bc=a:GGT,b:TGT,d:ACA,f:TCC;
AWTVM...
@S007;bc=a:GGT,b:TGT,c:ATA,d:ACA,f:TCC;
AWTVM...

That NF = trick may not be available on all awk versions out there; you'll may need to delete the fields one by one, then.
This User Gave Thanks to RudiC For This Post:
# 3  
Old 04-11-2016
Another approach.. Try:
Code:
awk -F\; '
  BEGIN {
    split("bc=a: b: c: d: f: g:",L," ")
  }
  !/^@/{
    print
    next
  }
  {
    gsub(/[cd]1:[^,]+,/,x,$2)
    s=$1 FS
    n=split($2,F,",")
    for(i=1; i<=n; i++) {
      if(F[i]!~"^" L[i])
        break
      if(F[i]~/^[cd]1:/)
        continue
      s=s F[i] ","
    } 
    sub(/,$/,";",s)
    print s
  }
' file

Output:
Code:
@S001;bc=a:GGT,b:GGT,c:TTG,d:ACA,f:TCC,g:TGA;
AWTVM...
@S002;bc=a:GGT,b:GTT,c:ATG,d:ACA,f:TCC,g:TGA;
AWTVM...
@S003;bc=a:GGT,b:GTT,c:TTG,d:AGA,f:TGG,g:TGG;
AWTVM...
@S004;bc=a:GGT,b:GTT,c:ATG,d:ACA,f:TGG,g:AGG;
AWTVM...
@S005;bc=a:GGT,b:TGT,c:AGG;
AWTVM...
@S006;bc=a:GGT,b:TGT;
AWTVM...
@S007;bc=a:GGT,b:TGT,c:ATA,d:ACA,f:TCC;
AWTVM...


Last edited by Scrutinizer; 04-11-2016 at 03:33 PM..
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 04-11-2016
Thanks for the help.

Output of @S006 is correct and what I need at the end. It is a incomplete record with a information gap. Barecode c is missing

Code:
@S006;bc=a:GGT,b:TGT,c1:TTT,d:ACA,d1:AAA,f:TCC;
AWTVM...

and if I reformat the rest it would look like

Code:
@S006;bc=a:GGT,b:TGT,d:ACA,f:TCC;
AWTVM...

but I would need to shorten inconsistent barcode information

Code:
@S006;bc=a:GGT,b:TGT;
AWTVM...

Sorry for not explain myself clearly.

---------- Post updated at 07:26 PM ---------- Previous update was at 07:23 PM ----------

Scrutiniser,

thanks a lot for the truly elegant solution to my problem.
# 5  
Old 04-11-2016
Thank you Smilie, you're welcome. I made an adaptation so that it should also work if label a: is missing. Try:

Code:
awk -F\; '
  BEGIN {
    split("a b c d f g", L, " ")
  }
  !/^@/{
    print
    next
  }
  {
    gsub(/[cd]1:[^,]+,|^bc=/,x,$2)
    s=$1 FS "bc="
    n=split($2,F,",")
    for(i=1; i<=n; i++) {
      if(F[i]!~"^" L[i] ":")
        break
      if(F[i]~/^[cd]1:/)
        continue
      s=s F[i] ","
    } 
    sub(/,?$/,";",s)
    print s
  }
' file

So that with
Code:
@S008;bc=b:TGT,c:ATA,d:ACA,f:TCC;
AWTVM...

if would produce:
Code:
@S008;bc=;
AWTVM...


Last edited by Scrutinizer; 04-11-2016 at 03:56 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Convert variable length record to fixed length

Hi Team, I have an issue to split the file which is having special chracter(German Char) using awk command. I have a different length records in a file. I am separating the files based on the length using awk command. The command is working fine if the record is not having any... (7 Replies)
Discussion started by: Anthuvan
7 Replies

2. Shell Programming and Scripting

[Solved] How to increment and add variable length numbers to a variable in a loop?

Hi All, I have a file which has hundred of records with fixed number of fields. In each record there is set of 8 characters which represent the duration of that activity. I want to sum up the duration present in all the records for a report. The problem is the duration changes per record so I... (5 Replies)
Discussion started by: danish0909
5 Replies

3. Shell Programming and Scripting

Flat file-make field length equal to header length

Hello Everyone, I am stuck with one issue while working on abstract flat file which i have to use as input and load data to table. Input Data- ------ ------------------------ ---- ----------------- WFI001 Xxxxxx Control Work Item A Number of Records ------ ------------------------... (5 Replies)
Discussion started by: sonali.s.more
5 Replies

4. UNIX for Dummies Questions & Answers

Delete header row and reformat from tab delimited to fixed width

Hello gurus, I have a file in a tab delimited format and a header row. I need a code to delete the header in the file, and convert the file to a fixed width format, with all the columns aligned. Below is a sample of the file:... (4 Replies)
Discussion started by: chumsky
4 Replies

5. Shell Programming and Scripting

changing a variable length text to a fixed length

Hi, Can anyone help with a effective solution ? I need to change a variable length text field (between 1 - 18 characters) to a fixed length text of 18 characters with the unused portion, at the end, filled with spaces. The text field is actually field 10 of a .csv file however I could cut... (7 Replies)
Discussion started by: dc18
7 Replies

6. Shell Programming and Scripting

Make variable length record a fixed length

Very, very new to unix scripting and have a unique situation. I have a file of records that contain 3 records types: (H)eader Records (D)etail Records (T)railer Records The Detail records are 82 bytes in length which is perfect. The Header and Trailer records sometimes are 82 bytes in... (3 Replies)
Discussion started by: jclanc8
3 Replies

7. UNIX for Dummies Questions & Answers

Convert a tab delimited/variable length file to fixed length file

Hi, all. I need to convert a file tab delimited/variable length file in AIX to a fixed lenght file delimited by spaces. This is the input file: 10200002<tab>US$ COM<tab>16/12/2008<tab>2,3775<tab>2,3783 19300978<tab>EURO<tab>16/12/2008<tab>3,28523<tab>3,28657 And this is the expected... (2 Replies)
Discussion started by: Everton_Silveir
2 Replies

8. Shell Programming and Scripting

how to reformat a file to 80 byte rec length?

I have a variable length file that needs to be reformatted to 80 byte reclen before I ftp it to a customer. What is the best way to do this? I tried using dd if=inputfile of=outputfile conv=noblock cbs=80, and it almost gives me what I need. The output file needs to be 80-byte records, and the last... (4 Replies)
Discussion started by: cmgarcia
4 Replies

9. Shell Programming and Scripting

creating a fixed length output from a variable length input

Is there a command that sets a variable length? I have a input of a variable length field but my output for that field needs to be set to 32 char. Is there such a command? I am on a sun box running ksh Thanks (2 Replies)
Discussion started by: r1500
2 Replies

10. IP Networking

ethernet header length

When i capture a tcp packet (a normal ACK-RST packet), Snort shows me a total packet lenght of 3C(hex) = 60(dez) and an IpLen of 20(dez) and a TcpLen of 20(dez), so the sizeof the Ethernet header should be: TotalPacketLenght-(IpLen+TcpLen), that would be 60-(20+20) = 20, but i thought that the... (4 Replies)
Discussion started by: atmansiddhi
4 Replies
Login or Register to Ask a Question