Deleting columns by list or file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Deleting columns by list or file
# 1  
Old 12-01-2009
Deleting columns by list or file

Dear specialists out there, please help a poor awk newbie:

I have a very huge file to process consisting of 300000 columns and 1500 rows. About 20000 columns shall be deleted from that file. So it is clear, that I can't do this by writing down all the columns in an awk command like $1, $x etc.

I have a text-file containing one column with the identifiers for these 20000 columns, that shall be removed (corresponding to the identifiers in the header/first line in file to process).

Can anyone give me a hint ho to do this with awk or a shell script?
I'd appreciate any kind of help very much!

Best regards, Felix
# 2  
Old 12-01-2009
put your sample file, identifier file and desired output here please.
# 3  
Old 12-01-2009
This is very similar to a post the other day, except that person wanted to keep the columns identified in the column file.

This should work for you:

I created two test files. The first, column.dat, with 21 lines containing 300,000 columns of data delimited by spaces with a header line consisting of column headers:

column.dat
Code:
col_1 col_2 col_3 ... col_299998 col_299999 col_300000

1_1 1_2 1_3 ... 1_299998 col_299999 col_300000
2_1 2_2 2_3 ... 2_299998 col_299999 col_300000
...
...
20_1 20_2 20_3 ... 20_299998 20_299999 20_300000

The second, column.lst, contains the list of columns you want to delete, starting with col_300 and incrementing by 10 up to col_200290:
Code:
col_300
col_310
col_320
...
...
col_200270
col_200280
col_200290

The code is dependent on the columns in column.lst and the column headers in column.dat are in the same order.

Here's the perl code:
Code:
#!/usr/bin/perl

use strict;

my @a_column;
my @a_delcol;
my @a_allcol;
my @a_outcol;
my $datcol;
my $outline;
my $line;
my $date_stamp;
my $i;
my $d;

$date_stamp = localtime time;
print "START:  $date_stamp\n";

open COLFILE, "<column.lst"
  or die "can't open file: $!";

$i=0;
@a_delcol = <COLFILE>;

close COLFILE
  or die "can't close file: $!";

open DATFILE, "<column.dat"
  or die "can't open file: $!";

$datcol = <DATFILE>;
chomp ($datcol);

close DATFILE
  or die "can't close file: $!";

@a_allcol = (split ' ', $datcol);

$i=0;
$d=0;

while ( $a_allcol[$i] )
{
   chomp ($a_allcol[$i]);
   chomp ($a_delcol[$d]);

   if ( $a_allcol[$i] ne $a_delcol[$d] )
   {
      push (@a_column, $i);
      push (@a_outcol, "$a_allcol[$i]");
   }
   else
   {
      $d++;
   }
   $i++;
}

undef @a_allcol;
undef @a_delcol;

open DATFILE, "<column.dat"
  or die "can't open file: $!";

open OUTFILE, ">column.out"
  or die "can't open file: $!";

$outline = join(" ", @a_outcol);
print OUTFILE "$outline\n";

undef @a_outcol;

while($line = <DATFILE>)
{
   chomp($line);
   $outline = join(" ", (split ' ', $line) [@a_column]);
   print OUTFILE "$outline\n";
}

close DATFILE
  or die "can't close file: $!";

close OUTFILE
  or die "can't close file: $!";

$date_stamp = localtime time;
print "END:  $date_stamp\n";


Here's the timing of the script:
Code:
START:  Tue Dec  1 15:00:40 2009
    END:  Tue Dec  1 15:00:50 2009


and the output:
Code:
col_1 col_2 col_3 ... col_299 col_301 ... col_599 col_601 ... col_299999 col_300000
1_1 1_2 1_3 ... 1_299 1_301 ... 1_599 1_601 ... 1_299999 1_300000
...
...
20_1 20_2 20_3 ... 20_299 20_301 ... 20_599 20_601 ... 20_299999 20_300000

I would expect this would take about 12 to 15 minutes to complete depending on your system resources.

Good luck.
# 4  
Old 12-02-2009
Hi jsmithstl,

wow, this code looks really cool! I'll try it out tomorrow (when I have access to our server at the institute again) Thank you very much for your answer!

Today I tried out a quick solution for this problem in (g)awk, which works perfectly (just as hint for future readers of this thread: AWK - delete or extract columns by list of identifiers)

The Perl-Script might be a better solution with respect to performance issues in the future, cause I'll probably have to handle files that are even (much) bigger, than the actual (300000 columns 1500 rows) one.

So thank you again very much, I'll report again after testing.
Best regards, Felix
# 5  
Old 12-04-2009
I changed the code to use a hash to store column.lst This eliminates the need for the column.lst to be in the same order as the column headers in column.dat. I also noticed that the column headers were being written twice to column.out and fixed that.

Here's the revised code:
Code:
#!/usr/bin/perl

use strict;

my $colfile="column.lst";
my $datfile="column.dat";
my $outfile="column.out";
my %seen;
my @a_column;
my @a_allcol;
my @a_colhdr;
my $datcol;
my $delcol;
my $outline;
my $line;
my $date_stamp;
my $i;

$date_stamp = localtime time;
print "START:  $date_stamp\n";

open COLFILE, "<$colfile"
  or die "can't open file: $!";

while ($delcol = <COLFILE>)
{
   chomp($delcol);
   $seen{$delcol}++;
}

close COLFILE
  or die "can't close file: $!";

open DATFILE, "<$datfile"
  or die "can't open file: $!";

$datcol = <DATFILE>;
chomp ($datcol);

@a_allcol = (split ' ', $datcol);

$i=0;

while ( $a_allcol[$i] )
{
   chomp ($a_allcol[$i]);

   unless ( $seen{$a_allcol[$i]} )
   {
      push (@a_column, $i);
      push (@a_colhdr, "$a_allcol[$i]");
      $seen{$a_allcol[$i]}++;
   }
   $i++;
}

undef @a_allcol;

open OUTFILE, ">$outfile"
  or die "can't open file: $!";

$outline = join(" ", @a_colhdr);
print OUTFILE "$outline\n";

undef @a_colhdr;

while($line = <DATFILE>)
{
   chomp($line);
   $outline = join(" ", (split ' ', $line) [@a_column]);
   print OUTFILE "$outline\n";
}

close DATFILE
  or die "can't close file: $!";

close OUTFILE
  or die "can't close file: $!";

$date_stamp = localtime time;
print "END:  $date_stamp\n";

# 6  
Old 12-06-2009
perl code:
Code:
use strict;
my %hash;
my @list;
while(<DATA>){
	chomp;
	$hash{$_}=1;
}
open FH,"<b.txt";
while(<FH>){
	chomp;
	if($.==1){
		my @tmp = split;
		for(my $i=0;$i<=$#tmp;$i++){
			if (not exists $hash{$tmp[$i]}){
				push @list, $i;
			}
		}
	}
	else{
		my @tmp = split;
		print join " ",@tmp[@list];
		print "\n";
	}
}
__DATA__
age
weigh

input file:
b.txt
Code:
name hometown age sex weight
name1 shanxi 30 mail 75
name2 xinjiang 26 femail 63

output:

Code:
name1 shanxi mail
name2 xinjiang femail

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Deleting a list of words from a text file

Hello, I have a list of words separated by spaces I am trying to delete from a text file, and I could not figure out what is the best way to do this. what I tried (does not work) : delete="password key number verify" arr=($delete) for i in arr { sed "s/\<${arr}\>]*//g" in.txt } >... (5 Replies)
Discussion started by: Hawk4520
5 Replies

2. Shell Programming and Scripting

Deleting all the fields(columns) from a .csv file if all rows in that columns are blanks

Hi Friends, I have come across some files where some of the columns don not have data. Key, Data1,Data2,Data3,Data4,Data5 A,5,6,,10,, A,3,4,,3,, B,1,,4,5,, B,2,,3,4,, If we see the above data on Data5 column do not have any row got filled. So remove only that column(Here Data5) and... (4 Replies)
Discussion started by: ks_reddy
4 Replies

3. Shell Programming and Scripting

Deleting specific columns

Hi group, Can you please tell how to delete specific columns from a file. I know something like awk -F, '{ print $1" "$2" "15 }' input.txt > output.txt will delete all other columns. But this is in a way to copy some particular columns. But is there any other way to select just some... (11 Replies)
Discussion started by: smitra
11 Replies

4. UNIX for Advanced & Expert Users

Help in Deleting columns and Renaming Mutliple columns in a .Csv File

Hi All, i have a .Csv file in the below format startTime, endTime, delta, gName, rName, rNumber, m2239max, m2239min, m2239avg, m100016509avg, m100019240max, metric3min, m100019240avg, propValues 11-Mar-2012 00:00:00, 11-Mar-2012 00:05:00, 300.0, vma3550a, a-1_CPU Index<1>, 200237463, 0.0,... (9 Replies)
Discussion started by: mahi_mayu069
9 Replies

5. UNIX for Dummies Questions & Answers

Deleting columns from a tab delimited text file?

I have a tab limited text file with 10000+ columns. I want to delete columns 6 through 23, how do I go about doing that? Thanks! (1 Reply)
Discussion started by: evelibertine
1 Replies

6. UNIX for Dummies Questions & Answers

Deleting columns from a space delimited text file

I have a space delimited text file with 1,000,000+ columns and 100 rows. I want to delete columns 2 through 5 (2 and 5) included from the text file. How do I do that? Thanks. (3 Replies)
Discussion started by: evelibertine
3 Replies

7. Shell Programming and Scripting

Print columns in a file by number list in another file

This should follow with my last post but I think it's better to start a new one. Now I have a list of numbers stored in pos.txt: 2 6 7 . . . n . . . And I need to extract (2n-1, 2n) columns from matrix.txt: ind1 A B C D E F G H I J K L M N O P Q R ... ind2 B C D E F G H... (3 Replies)
Discussion started by: Zoho
3 Replies

8. Shell Programming and Scripting

Deleting columns from CSV file

Hi All, Am working on perl script which should delete columns in existing CSV file. If my file is : AA,BB,CC,DD 00,11,22,33 00,55,66,77 00,99,88,21 AA,BB... are all my headers can come in any order (e.g AA,CC,BB...) and rest are values. I want to delete column CC... Can anybody help... (2 Replies)
Discussion started by: darshakraut
2 Replies

9. Shell Programming and Scripting

deleting rows & columns form a csv file

Hi , I want to delete some rows & columns from file. can someone please help me on this? Regards. (2 Replies)
Discussion started by: code19
2 Replies

10. Shell Programming and Scripting

Deleting specific columns from a file

Hi Friends, I want to delete specific columns from a file. Say my file content is as follows: "1","a","ww1",1234" "2","b","wwr3","2222" "3","c","erre","3333" Now i want to delete the column 2 and 4 from this file. That is I want the file content to be: "1","ww1" "2","wwr3"... (11 Replies)
Discussion started by: premar
11 Replies
Login or Register to Ask a Question