Sponsored Content
Top Forums Shell Programming and Scripting How to extract a subset from a huge dataset Post 302403613 by cliffyiu on Saturday 13th of March 2010 11:06:43 AM
Old 03-13-2010
How to extract a subset from a huge dataset

Hi, All

I have a huge file which has 450G. Its tab-delimited format is as below

Code:
x1 A 50020 1
x1 B 50021 8
x1 C 50022 9
x1 A 50023 10
x2 D 50024 5
x2 C 50025 7
x2 F 50026 8
x2 N 50027 1
:
:

Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is from 600000 to 30000000. I wrote the following perl script but it doesn't work:

Code:
#!/usr/bin/perl

$file1 = $ARGV[0]; # Input file
$file2 = $ARGV[1]; # Output file

open (IN, $file1);
while ($line = <IN>)
{
  chomp($line);
  @array = split(/\t/,$line);

  if ($array[0] eq 'x10')
  {
    if (($array[2] >= 600000) && ($array[2] <= 26279795))
    {
      open (OUT, ">>$file2");
      print OUT "$line\n";
      close OUT;
    }
  }
}
close IN;
exit;

I guess the input file and output file are both too big that my script can't handle it.

Anyone knows if there is any good way to do it? Perl or Shell scripts are preferred..

All your help will be appreciated!

Last edited by Franklin52; 03-13-2010 at 01:47 PM.. Reason: Please indent your code and use code tags!!
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Accessing Mainframe Dataset

Hi May I know is there a way to read/copy a mainframe (IBM OS/390) dataset (sequential file) into a UNIX directory? Thank you for your time. IcyGuava (4 Replies)
Discussion started by: IcyGuava
4 Replies

2. Shell Programming and Scripting

How to extract data from a huge file?

Hi, I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows: 1. Needs to create folders as the strings starts with "item_*" from the input file 2. Create a file "contents" in each folders having "license.txt(tab... (5 Replies)
Discussion started by: srsahu75
5 Replies

3. Shell Programming and Scripting

How to extract a piece of information from a huge file

Hello All, I need some assistance to extract a piece of information from a huge file. The file is like this one : database information ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc os information cccccccccccccccccc cccccccccccccccccc... (2 Replies)
Discussion started by: Marcor
2 Replies

4. Shell Programming and Scripting

Normalize a dataset with AWK

Hello everyone, i have to normalize this dataset (with 20.000 rows): 2,4,4,3,2,7,8,2,9,11,7,7,1,8,5,6 4,7,5,5,5,5,9,6,4,8,7,9,2,9,7,10 7,10,8,7,4,8,8,5,10,11,2,8,2,5,5,10 4,9,5,7,4,7,7,13,1,7,6,8,3,8,0,8,8 6,7,8,5,4,7,6,3,7,10,7,9,3,8,3,7,8 in this form:... (1 Reply)
Discussion started by: [raven]
1 Replies

5. Programming

Dataset Library for C?

I am looking for an opensource dataset library for C. Something equivalent to ADO.Net. Specifically, I am looking for the following features: 1. Create a Dataset from a file (XML or CSV). 2. Create a Dataset from a select query using an ODBC connection. 3. Load a created Dataset into a... (1 Reply)
Discussion started by: a_programmer
1 Replies

6. Solaris

flarecreate for zfs root dataset and ignore multiple dataset

Hi All, I want to write a script to create flar images on multiple servers. In non zfs filesystem I am using -X option to refer a file to exclude mounts on different servers. but on ZFS -X option is not working. I want multiple mounts to be ignore on ZFS base system during flarecreate. I... (0 Replies)
Discussion started by: uxravi
0 Replies

7. Shell Programming and Scripting

How to remove a subset of data from a large dataset based on values on one line

Hello. I was wondering if anyone could help. I have a file containing a large table in the format: marker1 marker2 marker3 marker4 position1 position2 position3 position4 genotype1 genotype2 genotype3 genotype4 with marker being a name, position a numeric... (2 Replies)
Discussion started by: davegen
2 Replies

8. UNIX for Advanced & Expert Users

How to extract subset file from dataset?

Hello I have a data set which looks like this : progeny sire dam gender 12 1 3 M 13 2 4 F 14 2 5 F 15 6 5 ... (13 Replies)
Discussion started by: sajmar
13 Replies

9. Shell Programming and Scripting

Extract few content from a huge list of files

I have a huge list of files (about 300,000) which have a pattern like this. .I 1 .U 87049087 .S Am J Emerg .M Allied Health Personnel/*; Electric Countershock/*; .T Refibrillation managed by EMT-Ds: .P ARTICLE. .W Some patients converted from ventricular fibrillation to organized... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

10. UNIX for Advanced & Expert Users

SAS dataset to CSV

Hi Guys, Is there a way to export a sas file i.e .sas7bdat file to .csv file with header and data using unix. I dont want to use SAS program instead using unix tool or unix scripting is it possible ? (25 Replies)
Discussion started by: Master_Mind
25 Replies
svm-subset(1)							   User Manuals 						     svm-subset(1)

NAME
svm-subset - a subset selection tool for LIBSVM SYNOPSIS
svm-subset [ -s method ] dataset number [ output1 ] [ output2 ] DESCRIPTION
Training large data is time consuming. Sometimes one should work on a smaller subset first. The python script subset.py randomly selects a specified number of samples. For classification data, we provide a stratified selection to ensure the same class distribution in the sub- set. OPTIONS
-s method 0 -- stratified selection (classification only) (default) 1 -- random selection output1 The subset. If output1 is omitted, the subset will be printed on the screen. output2 The rest of data. FILES
See svm-train(1) for the format of dataset EXAMPLES
svm-subset heart_scale 100 file1 file2 From heart_scale 100 samples are randomly selected and stored in file1. All remaining instances are stored in file2. BUGS
Please report bugs to the Debian BTS. AUTHOR
Chih-Chung Chang, Chih-Jen Lin <cjlin@csie.ntu.edu.tw>, Chen-Tse Tsai <ctse.tsai@gmail.com> (packaging) SEE ALSO
svm-train(1), svm-predict(1) Linux DEC 2009 svm-subset(1)
All times are GMT -4. The time now is 04:45 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy