Reformat text table


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Reformat text table
# 1  
Old 01-09-2011
Reformat text table

Hello,

I have a challenge here to reformat a text table. The original table follows:
Code:
Item01: m1, m2, m3: A; m4, m5, m6: B; m7, m8: C; m9 m10: D
Item02: m1, m9, m10: A; m7, m5, m6: C; m2, m3, m4, m8: D
Item03: m1, m6, m7: A; m2: B; m3, m4: C; m5 m8 m9 m10: D
.
.
.

Please note:
1) m1, m2, ~ m10 show up each row, always!
2) m1~m10 may be in random order;
3) A, B, C, D may NOT always show up in each row;

I want the table reformatted to:
Code:
        m1   m2   m3   m4   m5   m6   m7   m8   m9   m10
Item01  A    A    A    B    B    B    C    C    D    D
Item02  A    D    D    D    C    C    C    D    A    A
Item03  A    B    C    C    D    A    A    D    D    D

.
.
.

which mean with a header as the first row (m1 ~ m10) and the Item-- as the first column. The reformatted one is a two dimension structure that is much easier to look at.

I have been struggling with it using PERL by myself for a long time. Could not figure it out. Any help is highly appreciated. Thanks in advance!

Yifangt

Last edited by Scott; 01-09-2011 at 03:29 AM.. Reason: Code tags
# 2  
Old 01-09-2011
Try this:
Code:
awk -F'[ \t:;,]*' '{split($0,T); for(i=NF;i>=2;i--)if (T[i]~/m[0-9]/){sub(/m/,x,T[i]);$(T[i]+1)=c}else c=T[i]; NF=11}1 ' OFS="\t" infile

Code:
Item01  A       A       A       B       B       B       C       C       D       D
Item02  A       D       D       D       C       C       C       D       A       A
Item03  A       B       C       C       D       A       A       D       D       D


Last edited by Scrutinizer; 01-09-2011 at 06:14 AM..
# 3  
Old 01-09-2011
Thanks Scrutinizer!

This is amazing and too complicated to me. Is it possible for you to explain it to me, as I can only catch part of your code?

Actually my data is much bigger than the sample and I ignored the header row and some of the columns. I thought of using perl to parse it, and combined each row with the same SNP name in one row.

1) Each row start with the SNP name that can be repeated for 4 times at most (they are neighbour rows). Some only once. The output is a combined single row for all the same SNP;
2) If the 1st column is the same then the 2nd, 4th and 5th are the same (for same SNP), which means the same SNP in different rows. This is the most different part from my first post;
3) There are 96 variants for each SNP. The variant not listed for a specific SNP indicates the SNP is missing for it and should be labeled as - or NA for consistency of the output format;

Sorry for not put the raw data first as I was trying perl script by using hash and I am a geneticist fond of programming. Anyway, thank you if you can have a look at this again.
Code:
SNP-name    chromosome-polymorphic-sequence-Species-variants    Locus-(if mapped-to-locus)    Chromosomal-map-location
BKN000000001    1    C    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-17;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;HR5;HR-10;NFA-8;NFA-10;Sq-1;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ9;Goettingen-7;Goettingen-22;Rennes-1;Rennes-11;Uod-1;Uod-7;Cvi-0;Lz-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Br-0;Est-1;Ag-0;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Mz-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Fei-0;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Kondara;Shahdara;Sorbo;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112482
BKN000000001    1    T    KZ1    AT1G01280    112482
BKN000000002    1    G    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-17;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;HR5;HR-10;NFA-8;NFA-10;Sq-1;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ1;KZ9;Goettingen-7;Goettingen-22;Rennes-1;Rennes-11;Uod-1;Uod-7;Cvi-0;Lz-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Br-0;Est-1;Ag-0;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Mz-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Fei-0;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Shahdara;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112561
BKN000000002    1    A    Kondara;Sorbo    AT1G01280    112561
BKN000000003    1    A    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ1;KZ9;Goettingen-7;Goettingen-22;Uod-1;Uod-7;Cvi-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Est-1;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Kondara;Shahdara;Sorbo;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112771
BKN000000003    1    G    Pna-17;HR5;HR-10;NFA-8;NFA-10;Sq-1;Rennes-1;Rennes-11;Lz-0;Br-0;Ag-0;Mz-0;Fei-0    AT1G01280    112771
.
.
.

Thanks again!

Yifangt

Last edited by yifangt; 01-09-2011 at 11:02 AM.. Reason: Code tags
# 4  
Old 01-09-2011
Hi yifangt, you are welcome. Here is an explanation:

awk -F'[ \t:;,]*'Use zero or more repetitions of the characters in square brackets as field separators
split($0,T)Split the record $0 into array T, using FS as field separator, effectively creating a copy of $1 to $NF (allowing the reuse of $1 to $NF for output..)
for(i=NF;i>=2;i--)reading backwards from the last field number to the 2nd ..
if (T[i]~/m[0-9]/)if the array copy of field number "i" contains "m" followed by a digit,
{sub(/m/,x,T[i]) remove the letter m from that field.
$(T[i]+1)=cStore the character contained in variable c into the field number contained in T[i] + 1. If for example T[i] contains 4 than store in $5
else c=T[i]if the array copy of field number "i" does not contain m followed by a digit, it must be a new value which gets stored in variable c
NF=11Cut off fields $12 until $NF, so that 11 fields remain
1Print every record
OFS="\t"Use tab as output field separator
With your actual raw data what is the required output?

S.
This User Gave Thanks to Scrutinizer For This Post:
# 5  
Old 01-09-2011
In your data sample:
Quote:
Originally Posted by yifangt
Code:
SNP-name    chromosome-polymorphic-sequence-Species-variants    Locus-(if mapped-to-locus)    Chromosomal-map-location
BKN000000001    1    C    RRS-7;RRS-10;...;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112482
BKN000000001    1    T    KZ1    AT1G01280    112482
BKN000000002    1    G    RRS-7;RRS-10;...;Oy-0;Ws-2    AT1G01280    112561
BKN000000002    1    A    Kondara;Sorbo    AT1G01280    112561
BKN000000003    1    A    RRS-7;RRS-10;...;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112771
BKN000000003    1    G    Pna-17;HR5;HR-10;...;Mz-0;Fei-0    AT1G01280    112771
.
.
.

There are six columns of data but four column headers.
Are the first three data columns the "SNP-Name"?
And the last two the "value", the 'A', 'B', 'C', 'D' in your example?

---------- Post updated at 10:51 AM ---------- Previous update was at 10:35 AM ----------

My initial implementation to generate a CSV file:
Code:
use strict;
use warnings;

$\ = "\n";
$, = '';

my %H;
my %D;

<>; # toss the header

while (<>) {
    chomp;
    my ($snpname, $snpidx, $acgt, $chomosomelist, $locus, $location) = split;

    unless (defined $location) {
        print STDERR $ARGV, '(', $., '): malformed entry - ', $_;
        next;
    }

#> adjust these as required to get a proper "label" and "value"

    $snpname .= '-' . $snpidx;
    $locus   .= '(' . $location . ')';

    foreach my $c (split /;/, $chomosomelist) {
        $H{$c}++;
        $D{$snpname}->{$c} = $locus;
    }
}

sub csv {
    local $, = ',';
    print map { defined $_ ? '"' . $_ . '"' : '"NA"' } @_;
}

my @H = sort keys %H;

csv '', @H;

foreach my $snpname (sort keys %D) {
    my $X = $D{$snpname};
    csv $snpname, map { $X->{$_} } @H;
}


Last edited by m.d.ludwig; 01-09-2011 at 11:52 AM.. Reason: mssing ">" in "while (<>)"
This User Gave Thanks to m.d.ludwig For This Post:
# 6  
Old 01-09-2011
re: reformat text table

mixed up reply, deleted!

Last edited by yifangt; 01-09-2011 at 11:02 PM.. Reason: Had hard time to format the text.
# 7  
Old 01-09-2011
A lot of the elegance of Scrutinizer's solution was that your variants were numbered in the test data with the column number required (and each data set was on 1 line).

I've created a solution that works with your modified spec, and for simplicity I've put the column variant names in an external file (var), 1 per line:
Code:
$ head var
Ag-0
An-1
Bay-0
Bil-5
Bil-7
Bor-1
Bor-4
Br-0
Bur-0
C24

Code:
$ cat process_snp 
awk '
BEGIN { printf "SNP chromosome" }
NR == FNR { vars[$0]=NR-1; var_cnt=NR; printf " "$0; next }
FNR == 1 { printf " Locus location\n"; next}
SNP && SNP != $1 {
  printf "%s %s",SNP,ch;
  for(i=0;i<var_cnt;i++) {
     printf " "(T[i]=="" ? "-" : T[i]);
     T[i]="";
  }
  printf " %s %s\n", lo, ln
}
{
  SNP=$1; ch=$2; lo=$5; ln=$6
  split($4,V,";");
  for(var in V) T[vars[V[var]]]=$3;
}
END {
  printf "%s %s",SNP,ch;
  for(i=0;i<var_cnt;i++) printf " "(T[i]=="" ? "-" : T[i])
  printf " %s %s\n", lo, ln
}' var datafile


Output is a follows:

Code:
$ ./process_snp
SNP chromosome Ag-0 An-1 Bay-0 Bil-5 Bil-7 Bor-1 Bor-4 Br-0 Bur-0 C24 CIBC17 CIBC5 CS22491 Col-0 Ct-1 Cvi-0 Eden-1 Eden-2 Edi-0 Ei-2 Est-1 Fab-2 Fab-4 Fei-0 Ga-0 Goettingen-22 Goettingen-7 Gu-0 Gy-0 HR-10 HR5 KZ1 KZ9 Kas-1 Kin-0 Knox-10 Knox-18 Kondara Ler-1 Ll-0 Lov-1 Lov-5 Lp2-2 Lp2-6 Lz-0 Mr-0 Mrk-0 Ms-0 Mt-0 Mz-0 NFA-10 NFA-8 Nd-1 Nok-3 Omo2-1 Omo2-3 Oy-0 Pna-10 Pna-17 Pro-0 Pu2-23 Pu2-7 RRS-10 RRS-7 Ra-0 Rennes-1 Rennes-11 Rmx-A02 Rmx-A180 Se-0 Shahdara Sorbo Spr1-2 Spr1-6 Sq-1 Sq-8 Tamm-2 Tamm-27 Ts-1 Ts-5 Tsu-1 Ull2-3 Ull2-5 Uod-1 Uod-7 Var2-1 Var2-6 Wa-1 Wei-0 Ws-0 Ws-2 Wt-5 Yo-0 Zdr-1 Zdr-6 Locus location
BKN000000001 1 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C T C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C AT1G01280 112482
BKN000000002 1 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G A G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G A G G G G G G G G G G G G G G G G G G G G G G G AT1G01280 112561
BKN000000003 1 G A A A A A A G A A A A A A A A A A A A A A A G A A A A A G G A A A A A A A A A A A A A G A A A A G G G A A A A A A G A A A A A A G G A A A A A A A G A A A A A A A A A A A A A A A A A A A A AT1G01280 112771


Last edited by Chubler_XL; 01-09-2011 at 08:10 PM.. Reason: Update to add locus and location fields to end
This User Gave Thanks to Chubler_XL For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to reformat text file

Howdy. AWK beginner here. I need to reformat a text file in the following format: TTGS08-2014001 6018.00 143563.00 ... (2 Replies)
Discussion started by: c47v3770
2 Replies

2. UNIX for Dummies Questions & Answers

Deleting unwanted text from a table

Hi everyone, I have a microbial diversity table in the format ;k__kingdom; p__phylum, etc, somer rows have descriptions before the :k__ (like the af028349.1 below) is there a way I can get rid of this text (which is different every time) and keep all the other columns? Thanks a bunch! ;... (1 Reply)
Discussion started by: Juan Gonzalez
1 Replies

3. UNIX for Dummies Questions & Answers

Loading text file into table

Hi, I have text file with comma seprater shown below lu8yh,n,Fri,Feb,7,2014,16:5 deer4 deer4,n,Tue,Aug,21,,2012,on r43ed r43ed,n,Tue,Nov,12,2013,12: e43sd e43sd,n,Tue,Jan,1,,2013,on, I am using below code to load the text file into table #!/bin/ksh... (16 Replies)
Discussion started by: stew
16 Replies

4. Shell Programming and Scripting

awk to reformat text

I have this input and want output like below, how can I achieve that through awk: Input: CAT1 FRY-01 CAT1 FRY-04 CAT1 DRY-03 CAT1 FRY-02 CAT1 DRY-04 CAT2 FRY-03 CAT2 FRY-02 CAT2 DRY-01 FAT3 DRY-12 FAT3 FRY-06 Output: category CAT1 item FRY-01 (7 Replies)
Discussion started by: aydj
7 Replies

5. Shell Programming and Scripting

Normal text to table format

Hi, I am trying to show my list, from a simple list format to a table (row and column formatted table) Currently i have this format in my output (the formart it will always be like this ) >> first 3 lines must be on the same line aligned, and the next 3 shud be on 2nd line....: INT1:... (10 Replies)
Discussion started by: eboye
10 Replies

6. Shell Programming and Scripting

Make a table from a text file

Hi, I have a pipe separated text file. Can some someone tell me how to convert it to a table? Text File contents. |Activities|Status1|Status2|Status3| ||NA|$io_running2|$io_running3| |Replication Status|NA|$running2|$running3| ||NA|$master2|$master3|... (1 Reply)
Discussion started by: rocky88
1 Replies

7. Shell Programming and Scripting

Help in script - Getting table name from a text file

hhhhhhhhhh (5 Replies)
Discussion started by: sams
5 Replies

8. Shell Programming and Scripting

awk to reformat a text file

I am definitely not an expert with awk, and I want to reformat a text file like the following. This is probably a very easy one for an expert out there. I would like to keep the lines in the same order, but move the heading to only be listed once above the lines. This is what the text file... (7 Replies)
Discussion started by: linux4life
7 Replies

9. Shell Programming and Scripting

how can I bcp out a table into a text file including the header row in the text file

Hi All, I need to BCP out a table into a text file along with the table headers. Normal BCP out command only bulk copies the data, and not the headers. I am using the following command: bcp database1..table1 out file1.dat -c -t\| -b1000 -A8192 -Uuser -Ppassword -efile.dat.err Regards,... (0 Replies)
Discussion started by: shilpa_acc
0 Replies
Login or Register to Ask a Question