How to extract a subset from a huge dataset


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to extract a subset from a huge dataset
# 1  
Old 03-13-2010
How to extract a subset from a huge dataset

Hi, All

I have a huge file which has 450G. Its tab-delimited format is as below

Code:
x1 A 50020 1
x1 B 50021 8
x1 C 50022 9
x1 A 50023 10
x2 D 50024 5
x2 C 50025 7
x2 F 50026 8
x2 N 50027 1
:
:

Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:

Code:
#!/usr/bin/perl

$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file

open (IN, $file1);
while ($line = <IN>)
{
  chomp($line);
  @array = split(/\t/,$line);

  if ($array[0] eq 'x10')
  {
    if (($array[2] >= 600000) && ($array[2] <= 26279795))
    {
      open (OUT, ">>$file2");
      print OUT "$line\n";
      close OUT;
    }
  }
}
close IN;
exit;

I guess the input file and output file are both too big that my script can't handle it.

Anyone knows if there is any good way to do it? Perl or Shell scripts are preferred..

All your help will be appreciated!

Last edited by Franklin52; 03-13-2010 at 01:47 PM.. Reason: Please indent your code and use code tags!!
# 2  
Old 03-13-2010
Code:
nawk -F"[\t]" '$1~/x10/ && $3>600000  && $3<30000000'  FILE > SubFILE


Last edited by EAGL€; 03-13-2010 at 12:33 PM.. Reason: didnt see it is tab delimeted format.
# 3  
Old 03-13-2010
Hi,Eagle

Thanks for your reply. I just tried your command but it failed. It said

-bash: nawk: command not found

it seems like we don't have nawk in our server.

Do you have other idea? can I just use awk?
# 4  
Old 03-13-2010
Try awk instead or /usr/xpg4/bin/awk on Solaris:

Code:
awk '$1=="x10" && $3>600000 && $3<30000000'  FILE > SubFILE

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

SAS dataset to CSV

Hi Guys, Is there a way to export a sas file i.e .sas7bdat file to .csv file with header and data using unix. I dont want to use SAS program instead using unix tool or unix scripting is it possible ? (25 Replies)
Discussion started by: Master_Mind
25 Replies

2. Shell Programming and Scripting

Extract few content from a huge list of files

I have a huge list of files (about 300,000) which have a pattern like this. .I 1 .U 87049087 .S Am J Emerg .M Allied Health Personnel/*; Electric Countershock/*; .T Refibrillation managed by EMT-Ds: .P ARTICLE. .W Some patients converted from ventricular fibrillation to organized... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

3. UNIX for Advanced & Expert Users

How to extract subset file from dataset?

Hello I have a data set which looks like this : progeny sire dam gender 12 1 3 M 13 2 4 F 14 2 5 F 15 6 5 ... (13 Replies)
Discussion started by: sajmar
13 Replies

4. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Hello. I was wondering if anyone could help. I have a file containing a large table in the format: marker1 marker2 marker3 marker4 position1 position2 position3 position4 genotype1 genotype2 genotype3 genotype4 with marker being a name, position a numeric... (2 Replies)
Discussion started by: davegen
2 Replies

5. Solaris

flarecreate for zfs root dataset and ignore multiple dataset

Hi All, I want to write a script to create flar images on multiple servers. In non zfs filesystem I am using -X option to refer a file to exclude mounts on different servers. but on ZFS -X option is not working. I want multiple mounts to be ignore on ZFS base system during flarecreate. I... (0 Replies)
Discussion started by: uxravi
0 Replies

6. Programming

Dataset Library for C?

I am looking for an opensource dataset library for C. Something equivalent to ADO.Net. Specifically, I am looking for the following features: 1. Create a Dataset from a file (XML or CSV). 2. Create a Dataset from a select query using an ODBC connection. 3. Load a created Dataset into a... (1 Reply)
Discussion started by: a_programmer
1 Replies

7. Shell Programming and Scripting

Normalize a dataset with AWK

Hello everyone, i have to normalize this dataset (with 20.000 rows): 2,4,4,3,2,7,8,2,9,11,7,7,1,8,5,6 4,7,5,5,5,5,9,6,4,8,7,9,2,9,7,10 7,10,8,7,4,8,8,5,10,11,2,8,2,5,5,10 4,9,5,7,4,7,7,13,1,7,6,8,3,8,0,8,8 6,7,8,5,4,7,6,3,7,10,7,9,3,8,3,7,8 in this form:... (1 Reply)
Discussion started by: [raven]
1 Replies

8. Shell Programming and Scripting

How to extract a piece of information from a huge file

Hello All, I need some assistance to extract a piece of information from a huge file. The file is like this one : database information ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc os information cccccccccccccccccc cccccccccccccccccc... (2 Replies)
Discussion started by: Marcor
2 Replies

9. Shell Programming and Scripting

How to extract data from a huge file?

Hi, I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows: 1. Needs to create folders as the strings starts with "item_*" from the input file 2. Create a file "contents" in each folders having "license.txt(tab... (5 Replies)
Discussion started by: srsahu75
5 Replies

10. UNIX for Dummies Questions & Answers

Accessing Mainframe Dataset

Hi May I know is there a way to read/copy a mainframe (IBM OS/390) dataset (sequential file) into a UNIX directory? Thank you for your time. IcyGuava (4 Replies)
Discussion started by: IcyGuava
4 Replies
Login or Register to Ask a Question