The UNIX and Linux Forums  


Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
grep unix.com with google



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Reply
English Japanese Spanish French German Portuguese Italian Powered by Powered by Google
 
Thread Tools Search this Thread Rating: Thread Rating: 1 votes, 5.00 average. Display Modes
  #1 (permalink)  
Old 12-01-2009
Registered User
 

Join Date: Dec 2009
Location: Munich
Posts: 2
Deleting columns by list or file

Dear specialists out there, please help a poor awk newbie:

I have a very huge file to process consisting of 300000 columns and 1500 rows. About 20000 columns shall be deleted from that file. So it is clear, that I can't do this by writing down all the columns in an awk command like $1, $x etc.

I have a text-file containing one column with the identifiers for these 20000 columns, that shall be removed (corresponding to the identifiers in the header/first line in file to process).

Can anyone give me a hint ho to do this with awk or a shell script?
I'd appreciate any kind of help very much!

Best regards, Felix
  #2 (permalink)  
Old 12-01-2009
Registered User
 

Join Date: Jun 2006
Posts: 468
put your sample file, identifier file and desired output here please.
  #3 (permalink)  
Old 12-01-2009
Registered User
 

Join Date: Oct 2009
Location: St. Louis, MO
Posts: 78
This is very similar to a post the other day, except that person wanted to keep the columns identified in the column file.

This should work for you:

I created two test files. The first, column.dat, with 21 lines containing 300,000 columns of data delimited by spaces with a header line consisting of column headers:

column.dat

Code:
col_1 col_2 col_3 ... col_299998 col_299999 col_300000

1_1 1_2 1_3 ... 1_299998 col_299999 col_300000
2_1 2_2 2_3 ... 2_299998 col_299999 col_300000
...
...
20_1 20_2 20_3 ... 20_299998 20_299999 20_300000

The second, column.lst, contains the list of columns you want to delete, starting with col_300 and incrementing by 10 up to col_200290:

Code:
col_300
col_310
col_320
...
...
col_200270
col_200280
col_200290

The code is dependent on the columns in column.lst and the column headers in column.dat are in the same order.

Here's the perl code:

Code:
#!/usr/bin/perl

use strict;

my @a_column;
my @a_delcol;
my @a_allcol;
my @a_outcol;
my $datcol;
my $outline;
my $line;
my $date_stamp;
my $i;
my $d;

$date_stamp = localtime time;
print "START:  $date_stamp\n";

open COLFILE, "<column.lst"
  or die "can't open file: $!";

$i=0;
@a_delcol = <COLFILE>;

close COLFILE
  or die "can't close file: $!";

open DATFILE, "<column.dat"
  or die "can't open file: $!";

$datcol = <DATFILE>;
chomp ($datcol);

close DATFILE
  or die "can't close file: $!";

@a_allcol = (split ' ', $datcol);

$i=0;
$d=0;

while ( $a_allcol[$i] )
{
   chomp ($a_allcol[$i]);
   chomp ($a_delcol[$d]);

   if ( $a_allcol[$i] ne $a_delcol[$d] )
   {
      push (@a_column, $i);
      push (@a_outcol, "$a_allcol[$i]");
   }
   else
   {
      $d++;
   }
   $i++;
}

undef @a_allcol;
undef @a_delcol;

open DATFILE, "<column.dat"
  or die "can't open file: $!";

open OUTFILE, ">column.out"
  or die "can't open file: $!";

$outline = join(" ", @a_outcol);
print OUTFILE "$outline\n";

undef @a_outcol;

while($line = <DATFILE>)
{
   chomp($line);
   $outline = join(" ", (split ' ', $line) [@a_column]);
   print OUTFILE "$outline\n";
}

close DATFILE
  or die "can't close file: $!";

close OUTFILE
  or die "can't close file: $!";

$date_stamp = localtime time;
print "END:  $date_stamp\n";


Here's the timing of the script:

Code:
START:  Tue Dec  1 15:00:40 2009
    END:  Tue Dec  1 15:00:50 2009


and the output:

Code:
col_1 col_2 col_3 ... col_299 col_301 ... col_599 col_601 ... col_299999 col_300000
1_1 1_2 1_3 ... 1_299 1_301 ... 1_599 1_601 ... 1_299999 1_300000
...
...
20_1 20_2 20_3 ... 20_299 20_301 ... 20_599 20_601 ... 20_299999 20_300000

I would expect this would take about 12 to 15 minutes to complete depending on your system resources.

Good luck.
  #4 (permalink)  
Old 12-02-2009
Registered User
 

Join Date: Dec 2009
Location: Munich
Posts: 2
Hi jsmithstl,

wow, this code looks really cool! I'll try it out tomorrow (when I have access to our server at the institute again) Thank you very much for your answer!

Today I tried out a quick solution for this problem in (g)awk, which works perfectly (just as hint for future readers of this thread: AWK - delete or extract columns by list of identifiers)

The Perl-Script might be a better solution with respect to performance issues in the future, cause I'll probably have to handle files that are even (much) bigger, than the actual (300000 columns 1500 rows) one.

So thank you again very much, I'll report again after testing.
Best regards, Felix
  #5 (permalink)  
Old 12-04-2009
Registered User
 

Join Date: Oct 2009
Location: St. Louis, MO
Posts: 78
I changed the code to use a hash to store column.lst This eliminates the need for the column.lst to be in the same order as the column headers in column.dat. I also noticed that the column headers were being written twice to column.out and fixed that.

Here's the revised code:

Code:
#!/usr/bin/perl

use strict;

my $colfile="column.lst";
my $datfile="column.dat";
my $outfile="column.out";
my %seen;
my @a_column;
my @a_allcol;
my @a_colhdr;
my $datcol;
my $delcol;
my $outline;
my $line;
my $date_stamp;
my $i;

$date_stamp = localtime time;
print "START:  $date_stamp\n";

open COLFILE, "<$colfile"
  or die "can't open file: $!";

while ($delcol = <COLFILE>)
{
   chomp($delcol);
   $seen{$delcol}++;
}

close COLFILE
  or die "can't close file: $!";

open DATFILE, "<$datfile"
  or die "can't open file: $!";

$datcol = <DATFILE>;
chomp ($datcol);

@a_allcol = (split ' ', $datcol);

$i=0;

while ( $a_allcol[$i] )
{
   chomp ($a_allcol[$i]);

   unless ( $seen{$a_allcol[$i]} )
   {
      push (@a_column, $i);
      push (@a_colhdr, "$a_allcol[$i]");
      $seen{$a_allcol[$i]}++;
   }
   $i++;
}

undef @a_allcol;

open OUTFILE, ">$outfile"
  or die "can't open file: $!";

$outline = join(" ", @a_colhdr);
print OUTFILE "$outline\n";

undef @a_colhdr;

while($line = <DATFILE>)
{
   chomp($line);
   $outline = join(" ", (split ' ', $line) [@a_column]);
   print OUTFILE "$outline\n";
}

close DATFILE
  or die "can't close file: $!";

close OUTFILE
  or die "can't close file: $!";

$date_stamp = localtime time;
print "END:  $date_stamp\n";

  #6 (permalink)  
Old 12-06-2009
Registered User
 

Join Date: Jun 2007
Location: Beijing China
Posts: 1,125
perl code:

Code:
use strict;
my %hash;
my @list;
while(<DATA>){
	chomp;
	$hash{$_}=1;
}
open FH,"<b.txt";
while(<FH>){
	chomp;
	if($.==1){
		my @tmp = split;
		for(my $i=0;$i<=$#tmp;$i++){
			if (not exists $hash{$tmp[$i]}){
				push @list, $i;
			}
		}
	}
	else{
		my @tmp = split;
		print join " ",@tmp[@list];
		print "\n";
	}
}
__DATA__
age
weigh

input file:
b.txt

Code:
name hometown age sex weight
name1 shanxi 30 mail 75
name2 xinjiang 26 femail 63

output:


Code:
name1 shanxi mail
name2 xinjiang femail

Sponsored Links
Reply

Bookmarks

Tags
awk, columns, delete, list

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off


More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Deleting columns from CSV file darshakraut Shell Programming and Scripting 2 10-19-2009 08:29 AM
deleting rows & columns form a csv file code19 Shell Programming and Scripting 2 03-13-2008 10:06 AM
List to columns and awk help baghera Shell Programming and Scripting 17 08-28-2007 09:20 AM
deleting directories from a list amacgeek Filesystems, Disks and Memory 1 04-10-2006 02:07 AM
Deleting specific columns from a file premar Shell Programming and Scripting 11 02-14-2006 07:02 AM



All times are GMT -4. The time now is 10:57 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0