Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
Search Forums:



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 03-13-2010
Registered User
 

Join Date: Mar 2010
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
How to extract a subset from a huge dataset

Hi, All

I have a huge file which has 450G. Its tab-delimited format is as below


Code:
x1 A 50020 1
x1 B 50021 8
x1 C 50022 9
x1 A 50023 10
x2 D 50024 5
x2 C 50025 7
x2 F 50026 8
x2 N 50027 1
:
:

Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:


Code:
#!/usr/bin/perl

$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file

open (IN, $file1);
while ($line = <IN>)
{
  chomp($line);
  @array = split(/\t/,$line);

  if ($array[0] eq 'x10')
  {
    if (($array[2] >= 600000) && ($array[2] <= 26279795))
    {
      open (OUT, ">>$file2");
      print OUT "$line\n";
      close OUT;
    }
  }
}
close IN;
exit;

I guess the input file and output file are both too big that my script can't handle it.

Anyone knows if there is any good way to do it? Perl or Shell scripts are preferred..

All your help will be appreciated!

Last edited by Franklin52; 03-13-2010 at 12:47 PM.. Reason: Please indent your code and use code tags!!
Sponsored Links
    #2  
Old 03-13-2010
Registered User
 

Join Date: Aug 2009
Location: istanbul not constantinapole
Posts: 269
Thanks: 15
Thanked 10 Times in 10 Posts

Code:
nawk -F"[\t]" '$1~/x10/ && $3>600000  && $3<30000000'  FILE > SubFILE


Last edited by EAGL€; 03-13-2010 at 11:33 AM.. Reason: didnt see it is tab delimeted format.
Sponsored Links
    #3  
Old 03-13-2010
Registered User
 

Join Date: Mar 2010
Posts: 5
Thanks: 0
Thanked 0 Times in 0 Posts
Hi,Eagle

Thanks for your reply. I just tried your command but it failed. It said

-bash: nawk: command not found

it seems like we don't have nawk in our server.

Do you have other idea? can I just use awk?
    #4  
Old 03-13-2010
Moderator
 

Join Date: Feb 2007
Location: The Netherlands
Posts: 7,289
Thanks: 55
Thanked 427 Times in 408 Posts
Try awk instead or /usr/xpg4/bin/awk on Solaris:


Code:
awk '$1=="x10" && $3>600000 && $3<30000000'  FILE > SubFILE

Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Normalize a dataset with AWK [raven] Shell Programming and Scripting 1 03-05-2009 11:49 AM
Numbers of records in SAS dataset sasaliasim Shell Programming and Scripting 2 04-21-2008 04:55 PM
Total file size of a subset list tekster757 UNIX for Dummies Questions & Answers 3 03-21-2008 12:27 PM
How to extract a piece of information from a huge file Marcor Shell Programming and Scripting 2 03-13-2008 03:33 PM
How to extract data from a huge file? srsahu75 Shell Programming and Scripting 5 01-18-2008 04:06 AM



All times are GMT -4. The time now is 03:44 AM.