Reformat text table

01-09-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Reformat text table

Hello,

I have a challenge here to reformat a text table. The original table follows:

Code:

Item01: m1, m2, m3: A; m4, m5, m6: B; m7, m8: C; m9 m10: D
Item02: m1, m9, m10: A; m7, m5, m6: C; m2, m3, m4, m8: D
Item03: m1, m6, m7: A; m2: B; m3, m4: C; m5 m8 m9 m10: D
.
.
.

Please note:
1) m1, m2, ~ m10 show up each row, always!
2) m1~m10 may be in random order;
3) A, B, C, D may NOT always show up in each row;

I want the table reformatted to:

Code:

        m1   m2   m3   m4   m5   m6   m7   m8   m9   m10
Item01  A    A    A    B    B    B    C    C    D    D
Item02  A    D    D    D    C    C    C    D    A    A
Item03  A    B    C    C    D    A    A    D    D    D

.
.
.

which mean with a header as the first row (m1 ~ m10) and the Item-- as the first column. The reformatted one is a two dimension structure that is much easier to look at.

I have been struggling with it using PERL by myself for a long time. Could not figure it out. Any help is highly appreciated. Thanks in advance!

Yifangt

Last edited by Scott; 01-09-2011 at 03:29 AM.. Reason: Code tags

yifangt

View Public Profile for yifangt

Find all posts by yifangt

01-09-2011

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Try this:

Code:

awk -F'[ \t:;,]*' '{split($0,T); for(i=NF;i>=2;i--)if (T[i]~/m[0-9]/){sub(/m/,x,T[i]);$(T[i]+1)=c}else c=T[i]; NF=11}1 ' OFS="\t" infile

Code:

Item01  A       A       A       B       B       B       C       C       D       D
Item02  A       D       D       D       C       C       C       D       A       A
Item03  A       B       C       C       D       A       A       D       D       D

Last edited by Scrutinizer; 01-09-2011 at 06:14 AM..

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

01-09-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thanks Scrutinizer!

This is amazing and too complicated to me. Is it possible for you to explain it to me, as I can only catch part of your code?

Actually my data is much bigger than the sample and I ignored the header row and some of the columns. I thought of using perl to parse it, and combined each row with the same SNP name in one row.

1) Each row start with the SNP name that can be repeated for 4 times at most (they are neighbour rows). Some only once. The output is a combined single row for all the same SNP;
2) If the 1st column is the same then the 2nd, 4th and 5th are the same (for same SNP), which means the same SNP in different rows. This is the most different part from my first post;
3) There are 96 variants for each SNP. The variant not listed for a specific SNP indicates the SNP is missing for it and should be labeled as - or NA for consistency of the output format;

Sorry for not put the raw data first as I was trying perl script by using hash and I am a geneticist fond of programming. Anyway, thank you if you can have a look at this again.

Code:

SNP-name    chromosome-polymorphic-sequence-Species-variants    Locus-(if mapped-to-locus)    Chromosomal-map-location
BKN000000001    1    C    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-17;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;HR5;HR-10;NFA-8;NFA-10;Sq-1;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ9;Goettingen-7;Goettingen-22;Rennes-1;Rennes-11;Uod-1;Uod-7;Cvi-0;Lz-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Br-0;Est-1;Ag-0;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Mz-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Fei-0;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Kondara;Shahdara;Sorbo;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112482
BKN000000001    1    T    KZ1    AT1G01280    112482
BKN000000002    1    G    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-17;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;HR5;HR-10;NFA-8;NFA-10;Sq-1;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ1;KZ9;Goettingen-7;Goettingen-22;Rennes-1;Rennes-11;Uod-1;Uod-7;Cvi-0;Lz-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Br-0;Est-1;Ag-0;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Mz-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Fei-0;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Shahdara;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112561
BKN000000002    1    A    Kondara;Sorbo    AT1G01280    112561
BKN000000003    1    A    RRS-7;RRS-10;Knox-10;Knox-18;Rmx-A02;Rmx-A180;Pna-10;Eden-1;Eden-2;Lov-1;Lov-5;Fab-2;Fab-4;Bil-5;Bil-7;Var2-1;Var2-6;Spr1-2;Spr1-6;Omo2-1;Omo2-3;Ull2-5;Ull2-3;Zdr-1;Zdr-6;Bor-1;Bor-4;Pu2-7;Pu2-23;Lp2-2;Lp2-6;Sq-8;CIBC5;CIBC17;Tamm-2;Tamm-27;KZ1;KZ9;Goettingen-7;Goettingen-22;Uod-1;Uod-7;Cvi-0;Ei-2;Gu-0;Ler-1;Nd-1;C24;CS22491;Wei-0;Ws-0;Yo-0;Col-0;An-1;Est-1;Gy-0;Ra-0;Bay-0;Ga-0;Mrk-0;Wt-5;Kas-1;Ct-1;Mr-0;Tsu-1;Mt-0;Nok-3;Wa-1;Se-0;Ts-1;Ts-5;Pro-0;Ll-0;Kondara;Shahdara;Sorbo;Kin-0;Ms-0;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112771
BKN000000003    1    G    Pna-17;HR5;HR-10;NFA-8;NFA-10;Sq-1;Rennes-1;Rennes-11;Lz-0;Br-0;Ag-0;Mz-0;Fei-0    AT1G01280    112771
.
.
.

Thanks again!

Yifangt

Last edited by yifangt; 01-09-2011 at 11:02 AM.. Reason: Code tags

yifangt

View Public Profile for yifangt

Find all posts by yifangt

01-09-2011

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Hi yifangt, you are welcome. Here is an explanation:

awk -F'[ \t:;,]*'	Use zero or more repetitions of the characters in square brackets as field separators
split($0,T)	Split the record $0 into array T, using FS as field separator, effectively creating a copy of $1 to $NF (allowing the reuse of $1 to $NF for output..)
for(i=NF;i>=2;i--)	reading backwards from the last field number to the 2nd ..
if (T[i]~/m[0-9]/)	if the array copy of field number "i" contains "m" followed by a digit,
{sub(/m/,x,T[i])	remove the letter m from that field.
$(T[i]+1)=c	Store the character contained in variable c into the field number contained in T[i] + 1. If for example T[i] contains 4 than store in $5
else c=T[i]	if the array copy of field number "i" does not contain m followed by a digit, it must be a new value which gets stored in variable c
NF=11	Cut off fields $12 until $NF, so that 11 fields remain
1	Print every record
OFS="\t"	Use tab as output field separator

With your actual raw data what is the required output?

S.

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

01-09-2011

Registered User

313, 60

Join Date: Dec 2010

Last Activity: 7 December 2012, 7:50 PM EST

Location: Albany, NY

Posts: 313

Thanks Given: 15

Thanked 60 Times in 60 Posts

In your data sample:

Quote:

Originally Posted by yifangt

Code:

SNP-name    chromosome-polymorphic-sequence-Species-variants    Locus-(if mapped-to-locus)    Chromosomal-map-location
BKN000000001    1    C    RRS-7;RRS-10;...;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112482
BKN000000001    1    T    KZ1    AT1G01280    112482
BKN000000002    1    G    RRS-7;RRS-10;...;Oy-0;Ws-2    AT1G01280    112561
BKN000000002    1    A    Kondara;Sorbo    AT1G01280    112561
BKN000000003    1    A    RRS-7;RRS-10;...;Bur-0;Edi-0;Oy-0;Ws-2    AT1G01280    112771
BKN000000003    1    G    Pna-17;HR5;HR-10;...;Mz-0;Fei-0    AT1G01280    112771
.
.
.

There are six columns of data but four column headers.
Are the first three data columns the "SNP-Name"?
And the last two the "value", the 'A', 'B', 'C', 'D' in your example?

---------- Post updated at 10:51 AM ---------- Previous update was at 10:35 AM ----------

My initial implementation to generate a CSV file:

Code:

use strict;
use warnings;

$\ = "\n";
$, = '';

my %H;
my %D;

<>; # toss the header

while (<>) {
    chomp;
    my ($snpname, $snpidx, $acgt, $chomosomelist, $locus, $location) = split;

    unless (defined $location) {
        print STDERR $ARGV, '(', $., '): malformed entry - ', $_;
        next;
    }

#> adjust these as required to get a proper "label" and "value"

    $snpname .= '-' . $snpidx;
    $locus   .= '(' . $location . ')';

    foreach my $c (split /;/, $chomosomelist) {
        $H{$c}++;
        $D{$snpname}->{$c} = $locus;
    }
}

sub csv {
    local $, = ',';
    print map { defined $_ ? '"' . $_ . '"' : '"NA"' } @_;
}

my @H = sort keys %H;

csv '', @H;

foreach my $snpname (sort keys %D) {
    my $X = $D{$snpname};
    csv $snpname, map { $X->{$_} } @H;
}

Last edited by m.d.ludwig; 01-09-2011 at 11:52 AM.. Reason: mssing ">" in "while (<>)"

This User Gave Thanks to m.d.ludwig For This Post:

m.d.ludwig

View Public Profile for m.d.ludwig

Find all posts by m.d.ludwig

01-09-2011

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

re: reformat text table

mixed up reply, deleted!

Last edited by yifangt; 01-09-2011 at 11:02 PM.. Reason: Had hard time to format the text.

yifangt

View Public Profile for yifangt

Find all posts by yifangt

01-09-2011

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

A lot of the elegance of Scrutinizer's solution was that your variants were numbered in the test data with the column number required (and each data set was on 1 line).

I've created a solution that works with your modified spec, and for simplicity I've put the column variant names in an external file (var), 1 per line:

Code:

$ head var
Ag-0
An-1
Bay-0
Bil-5
Bil-7
Bor-1
Bor-4
Br-0
Bur-0
C24

Code:

$ cat process_snp 
awk '
BEGIN { printf "SNP chromosome" }
NR == FNR { vars[$0]=NR-1; var_cnt=NR; printf " "$0; next }
FNR == 1 { printf " Locus location\n"; next}
SNP && SNP != $1 {
  printf "%s %s",SNP,ch;
  for(i=0;i<var_cnt;i++) {
     printf " "(T[i]=="" ? "-" : T[i]);
     T[i]="";
  }
  printf " %s %s\n", lo, ln
}
{
  SNP=$1; ch=$2; lo=$5; ln=$6
  split($4,V,";");
  for(var in V) T[vars[V[var]]]=$3;
}
END {
  printf "%s %s",SNP,ch;
  for(i=0;i<var_cnt;i++) printf " "(T[i]=="" ? "-" : T[i])
  printf " %s %s\n", lo, ln
}' var datafile

Output is a follows:

Code:

$ ./process_snp
SNP chromosome Ag-0 An-1 Bay-0 Bil-5 Bil-7 Bor-1 Bor-4 Br-0 Bur-0 C24 CIBC17 CIBC5 CS22491 Col-0 Ct-1 Cvi-0 Eden-1 Eden-2 Edi-0 Ei-2 Est-1 Fab-2 Fab-4 Fei-0 Ga-0 Goettingen-22 Goettingen-7 Gu-0 Gy-0 HR-10 HR5 KZ1 KZ9 Kas-1 Kin-0 Knox-10 Knox-18 Kondara Ler-1 Ll-0 Lov-1 Lov-5 Lp2-2 Lp2-6 Lz-0 Mr-0 Mrk-0 Ms-0 Mt-0 Mz-0 NFA-10 NFA-8 Nd-1 Nok-3 Omo2-1 Omo2-3 Oy-0 Pna-10 Pna-17 Pro-0 Pu2-23 Pu2-7 RRS-10 RRS-7 Ra-0 Rennes-1 Rennes-11 Rmx-A02 Rmx-A180 Se-0 Shahdara Sorbo Spr1-2 Spr1-6 Sq-1 Sq-8 Tamm-2 Tamm-27 Ts-1 Ts-5 Tsu-1 Ull2-3 Ull2-5 Uod-1 Uod-7 Var2-1 Var2-6 Wa-1 Wei-0 Ws-0 Ws-2 Wt-5 Yo-0 Zdr-1 Zdr-6 Locus location
BKN000000001 1 C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C T C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C C AT1G01280 112482
BKN000000002 1 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G A G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G A G G G G G G G G G G G G G G G G G G G G G G G AT1G01280 112561
BKN000000003 1 G A A A A A A G A A A A A A A A A A A A A A A G A A A A A G G A A A A A A A A A A A A A G A A A A G G G A A A A A A G A A A A A A G G A A A A A A A G A A A A A A A A A A A A A A A A A A A A AT1G01280 112771

Last edited by Chubler_XL; 01-09-2011 at 08:10 PM.. Reason: Update to add locus and location fields to end

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

Shell Programming and Scripting

Reformat text table

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to reformat text file

Discussion started by: c47v3770

2. UNIX for Dummies Questions & Answers

Deleting unwanted text from a table

Discussion started by: Juan Gonzalez

3. UNIX for Dummies Questions & Answers

Loading text file into table

Discussion started by: stew

4. Shell Programming and Scripting

awk to reformat text

Discussion started by: aydj

5. Shell Programming and Scripting

Normal text to table format

Discussion started by: eboye

6. Shell Programming and Scripting

Make a table from a text file

Discussion started by: rocky88

7. Shell Programming and Scripting

Help in script - Getting table name from a text file

Discussion started by: sams

8. Shell Programming and Scripting

awk to reformat a text file

Discussion started by: linux4life

9. Shell Programming and Scripting

how can I bcp out a table into a text file including the header row in the text file

Discussion started by: shilpa_acc