![]() |
|
|
grep unix.com with google
|
|||||||
| Forums | Register | Blog | Man Pages | Forum Rules | Links | Albums | FAQ | Our Members | Calendar | Search | Today's Posts | Mark Forums Read |
| Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here. |
![]() |
|
|
Thread Tools | Search this Thread |
Rating:
|
Display Modes |
|
|||
|
Deleting columns by list or file
Dear specialists out there, please help a poor awk newbie:
I have a very huge file to process consisting of 300000 columns and 1500 rows. About 20000 columns shall be deleted from that file. So it is clear, that I can't do this by writing down all the columns in an awk command like $1, $x etc. I have a text-file containing one column with the identifiers for these 20000 columns, that shall be removed (corresponding to the identifiers in the header/first line in file to process). Can anyone give me a hint ho to do this with awk or a shell script? I'd appreciate any kind of help very much! Best regards, Felix |
|
|||
|
This is very similar to a post the other day, except that person wanted to keep the columns identified in the column file. This should work for you: I created two test files. The first, column.dat, with 21 lines containing 300,000 columns of data delimited by spaces with a header line consisting of column headers: column.dat Code:
col_1 col_2 col_3 ... col_299998 col_299999 col_300000 1_1 1_2 1_3 ... 1_299998 col_299999 col_300000 2_1 2_2 2_3 ... 2_299998 col_299999 col_300000 ... ... 20_1 20_2 20_3 ... 20_299998 20_299999 20_300000 The second, column.lst, contains the list of columns you want to delete, starting with col_300 and incrementing by 10 up to col_200290: Code:
col_300 col_310 col_320 ... ... col_200270 col_200280 col_200290 The code is dependent on the columns in column.lst and the column headers in column.dat are in the same order. Here's the perl code: Code:
#!/usr/bin/perl
use strict;
my @a_column;
my @a_delcol;
my @a_allcol;
my @a_outcol;
my $datcol;
my $outline;
my $line;
my $date_stamp;
my $i;
my $d;
$date_stamp = localtime time;
print "START: $date_stamp\n";
open COLFILE, "<column.lst"
or die "can't open file: $!";
$i=0;
@a_delcol = <COLFILE>;
close COLFILE
or die "can't close file: $!";
open DATFILE, "<column.dat"
or die "can't open file: $!";
$datcol = <DATFILE>;
chomp ($datcol);
close DATFILE
or die "can't close file: $!";
@a_allcol = (split ' ', $datcol);
$i=0;
$d=0;
while ( $a_allcol[$i] )
{
chomp ($a_allcol[$i]);
chomp ($a_delcol[$d]);
if ( $a_allcol[$i] ne $a_delcol[$d] )
{
push (@a_column, $i);
push (@a_outcol, "$a_allcol[$i]");
}
else
{
$d++;
}
$i++;
}
undef @a_allcol;
undef @a_delcol;
open DATFILE, "<column.dat"
or die "can't open file: $!";
open OUTFILE, ">column.out"
or die "can't open file: $!";
$outline = join(" ", @a_outcol);
print OUTFILE "$outline\n";
undef @a_outcol;
while($line = <DATFILE>)
{
chomp($line);
$outline = join(" ", (split ' ', $line) [@a_column]);
print OUTFILE "$outline\n";
}
close DATFILE
or die "can't close file: $!";
close OUTFILE
or die "can't close file: $!";
$date_stamp = localtime time;
print "END: $date_stamp\n";
Here's the timing of the script: Code:
START: Tue Dec 1 15:00:40 2009
END: Tue Dec 1 15:00:50 2009
and the output: Code:
col_1 col_2 col_3 ... col_299 col_301 ... col_599 col_601 ... col_299999 col_300000 1_1 1_2 1_3 ... 1_299 1_301 ... 1_599 1_601 ... 1_299999 1_300000 ... ... 20_1 20_2 20_3 ... 20_299 20_301 ... 20_599 20_601 ... 20_299999 20_300000 I would expect this would take about 12 to 15 minutes to complete depending on your system resources. Good luck. |
|
|||
|
Hi jsmithstl,
wow, this code looks really cool! I'll try it out tomorrow (when I have access to our server at the institute again) Thank you very much for your answer! Today I tried out a quick solution for this problem in (g)awk, which works perfectly (just as hint for future readers of this thread: AWK - delete or extract columns by list of identifiers) The Perl-Script might be a better solution with respect to performance issues in the future, cause I'll probably have to handle files that are even (much) bigger, than the actual (300000 columns 1500 rows) one. So thank you again very much, I'll report again after testing. Best regards, Felix |
|
|||
|
I changed the code to use a hash to store column.lst This eliminates the need for the column.lst to be in the same order as the column headers in column.dat. I also noticed that the column headers were being written twice to column.out and fixed that. Here's the revised code: Code:
#!/usr/bin/perl
use strict;
my $colfile="column.lst";
my $datfile="column.dat";
my $outfile="column.out";
my %seen;
my @a_column;
my @a_allcol;
my @a_colhdr;
my $datcol;
my $delcol;
my $outline;
my $line;
my $date_stamp;
my $i;
$date_stamp = localtime time;
print "START: $date_stamp\n";
open COLFILE, "<$colfile"
or die "can't open file: $!";
while ($delcol = <COLFILE>)
{
chomp($delcol);
$seen{$delcol}++;
}
close COLFILE
or die "can't close file: $!";
open DATFILE, "<$datfile"
or die "can't open file: $!";
$datcol = <DATFILE>;
chomp ($datcol);
@a_allcol = (split ' ', $datcol);
$i=0;
while ( $a_allcol[$i] )
{
chomp ($a_allcol[$i]);
unless ( $seen{$a_allcol[$i]} )
{
push (@a_column, $i);
push (@a_colhdr, "$a_allcol[$i]");
$seen{$a_allcol[$i]}++;
}
$i++;
}
undef @a_allcol;
open OUTFILE, ">$outfile"
or die "can't open file: $!";
$outline = join(" ", @a_colhdr);
print OUTFILE "$outline\n";
undef @a_colhdr;
while($line = <DATFILE>)
{
chomp($line);
$outline = join(" ", (split ' ', $line) [@a_column]);
print OUTFILE "$outline\n";
}
close DATFILE
or die "can't close file: $!";
close OUTFILE
or die "can't close file: $!";
$date_stamp = localtime time;
print "END: $date_stamp\n";
|
|
|||
|
perl code: Code:
use strict;
my %hash;
my @list;
while(<DATA>){
chomp;
$hash{$_}=1;
}
open FH,"<b.txt";
while(<FH>){
chomp;
if($.==1){
my @tmp = split;
for(my $i=0;$i<=$#tmp;$i++){
if (not exists $hash{$tmp[$i]}){
push @list, $i;
}
}
}
else{
my @tmp = split;
print join " ",@tmp[@list];
print "\n";
}
}
__DATA__
age
weigh
input file: b.txt Code:
name hometown age sex weight name1 shanxi 30 mail 75 name2 xinjiang 26 femail 63 output: Code:
name1 shanxi mail name2 xinjiang femail |
| Sponsored Links | ||
|
|
![]() |
| Bookmarks |
| Tags |
| awk, columns, delete, list |
| Thread Tools | Search this Thread |
| Display Modes | Rate This Thread |
|
|
More UNIX and Linux Forum Topics You Might Find Helpful
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Deleting columns from CSV file | darshakraut | Shell Programming and Scripting | 2 | 10-19-2009 08:29 AM |
| deleting rows & columns form a csv file | code19 | Shell Programming and Scripting | 2 | 03-13-2008 10:06 AM |
| List to columns and awk help | baghera | Shell Programming and Scripting | 17 | 08-28-2007 09:20 AM |
| deleting directories from a list | amacgeek | Filesystems, Disks and Memory | 1 | 04-10-2006 02:07 AM |
| Deleting specific columns from a file | premar | Shell Programming and Scripting | 11 | 02-14-2006 07:02 AM |