How to extract a subset from a huge dataset Post: 302403613

Sponsored Content

Top Forums Shell Programming and Scripting How to extract a subset from a huge dataset Post 302403613 by cliffyiu on Saturday 13th of March 2010 11:06:43 AM

03-13-2010

Registered User

How to extract a subset from a huge dataset

Hi, All

I have a huge file which has 450G. Its tab-delimited format is as below

Code:

x1 A 50020 1
x1 B 50021 8
x1 C 50022 9
x1 A 50023 10
x2 D 50024 5
x2 C 50025 7
x2 F 50026 8
x2 N 50027 1
:
:

Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:

Code:

#!/usr/bin/perl

$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file

open (IN, $file1);
while ($line = <IN>)
{
  chomp($line);
  @array = split(/\t/,$line);

  if ($array[0] eq 'x10')
  {
    if (($array[2] >= 600000) && ($array[2] <= 26279795))
    {
      open (OUT, ">>$file2");
      print OUT "$line\n";
      close OUT;
    }
  }
}
close IN;
exit;

I guess the input file and output file are both too big that my script can't handle it.

Anyone knows if there is any good way to do it? Perl or Shell scripts are preferred..

All your help will be appreciated!

Last edited by Franklin52; 03-13-2010 at 01:47 PM.. Reason: Please indent your code and use code tags!!

cliffyiu

View Public Profile for cliffyiu

Find all posts by cliffyiu

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Accessing Mainframe Dataset

Hi May I know is there a way to read/copy a mainframe (IBM OS/390) dataset (sequential file) into a UNIX directory? Thank you for your time. IcyGuava

2. Shell Programming and Scripting

How to extract data from a huge file?

Hi, I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows: 1. Needs to create folders as the strings starts with "item_*" from the input file 2. Create a file "contents" in each folders having "license.txt(tab...

3. Shell Programming and Scripting

How to extract a piece of information from a huge file

Hello All, I need some assistance to extract a piece of information from a huge file. The file is like this one : database information ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc os information cccccccccccccccccc cccccccccccccccccc...

4. Shell Programming and Scripting

Normalize a dataset with AWK

Hello everyone, i have to normalize this dataset (with 20.000 rows): 2,4,4,3,2,7,8,2,9,11,7,7,1,8,5,6 4,7,5,5,5,5,9,6,4,8,7,9,2,9,7,10 7,10,8,7,4,8,8,5,10,11,2,8,2,5,5,10 4,9,5,7,4,7,7,13,1,7,6,8,3,8,0,8,8 6,7,8,5,4,7,6,3,7,10,7,9,3,8,3,7,8 in this form:...

5. Programming

Dataset Library for C?

I am looking for an opensource dataset library for C. Something equivalent to ADO.Net. Specifically, I am looking for the following features: 1. Create a Dataset from a file (XML or CSV). 2. Create a Dataset from a select query using an ODBC connection. 3. Load a created Dataset into a...

6. Solaris

flarecreate for zfs root dataset and ignore multiple dataset

Hi All, I want to write a script to create flar images on multiple servers. In non zfs filesystem I am using -X option to refer a file to exclude mounts on different servers. but on ZFS -X option is not working. I want multiple mounts to be ignore on ZFS base system during flarecreate. I...

7. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Hello. I was wondering if anyone could help. I have a file containing a large table in the format: marker1 marker2 marker3 marker4 position1 position2 position3 position4 genotype1 genotype2 genotype3 genotype4 with marker being a name, position a numeric...

8. UNIX for Advanced & Expert Users

How to extract subset file from dataset?

Hello I have a data set which looks like this : progeny sire dam gender 12 1 3 M 13 2 4 F 14 2 5 F 15 6 5 ...

9. Shell Programming and Scripting

Extract few content from a huge list of files

I have a huge list of files (about 300,000) which have a pattern like this. .I 1 .U 87049087 .S Am J Emerg .M Allied Health Personnel/*; Electric Countershock/*; .T Refibrillation managed by EMT-Ds: .P ARTICLE. .W Some patients converted from ventricular fibrillation to organized...

10. UNIX for Advanced & Expert Users

SAS dataset to CSV

Hi Guys, Is there a way to export a sas file i.e .sas7bdat file to .csv file with header and data using unix. I dont want to use SAS program instead using unix tool or unix scripting is it possible ?

LEARN ABOUT CENTOS

hugetlbfs_find_path

HUGETLBFS_FIND_PATH(3)					     Library Functions Manual					    HUGETLBFS_FIND_PATH(3)

NAME

       hugetlbfs_find_path, hugetlbfs_find_path_for_size - Locate an appropriate hugetlbfs mount point

SYNOPSIS

       #include <hugetlbfs.h>

       const char *hugetlbfs_find_path(void);
       const char *hugetlbfs_find_path_for_size(long page_size);

DESCRIPTION

       These  functions  return  a  pathname  for a mounted hugetlbfs filesystem for the appropriate huge page size.  For hugetlbfs_find_path, the
       default huge page size is used (see gethugepagesize(3)).  For hugetlbfs_find_path_for_size, a valid huge page size must be  specified  (see
       gethugepagesizes(3)).

RETURN VALUE

       On success, a non-NULL value is returned.  On failure, NULL is returned.

SEE ALSO

       libhugetlbfs(7), gethugepagesize(3), gethugepagesizes(3)

AUTHORS

       libhugetlbfs was written by various people on the libhugetlbfs-devel mailing list.

								   March 7, 2012					    HUGETLBFS_FIND_PATH(3)

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Accessing Mainframe Dataset

Discussion started by: IcyGuava

2. Shell Programming and Scripting

How to extract data from a huge file?

Discussion started by: srsahu75

3. Shell Programming and Scripting

How to extract a piece of information from a huge file

Discussion started by: Marcor

4. Shell Programming and Scripting

Normalize a dataset with AWK

Discussion started by: [raven]