Extract data from large file 80+ million records

06-02-2009

Registered User

1, 0

Join Date: May 2009

Last Activity: 30 June 2009, 12:45 PM EDT

Posts: 1

Thanks Given: 0

Thanked 0 Times in 0 Posts

Extract data from large file 80+ million records

Hello,

I have got one file with more than 120+ million records(35 GB in size). I have to extract some relevant data from file based on some parameter and generate other output file.

What will be the besat and fastest way to extract the ne file.

sample file format :--
++++++7777jjjjjjj0000000000 ( header record)
2098 POCG 0000 KKKK
2097 KOLL 0F00 KLLL
2095 LKJH 0L99 L0IU
.
.
.
.

********66666666666**** ( trailer record

Now suppose i enter the key as 2098(field as key) , so all rercords with 2098 as the first record should be moved to new file.

**********************************************

I tried to use grep ...but it took a lot of time ..nearly 45 mintues to give me output file.

learner16s

View Public Profile for learner16s

Find all posts by learner16s

DBSWISS(1) User Commands DBSWISS(1) NAME
dbSwiss - create DBM version of Swiss-Prot data SYNOPSIS
/usr/share/librg-utils-perl/dbSwiss [OPTIONS] /usr/share/librg-utils-perl/dbSwiss --datadir /data/swissprot --infile /data/swissprot/uniprot_sprot.dat /usr/share/librg-utils-perl/dbSwiss [--help] [--man] DESCRIPTION
dbSwiss creates DBM version of Swiss-Prot data. This procedure is to replace splitSwiss.pl. splitSwiss.pl saves Swiss-Prot records in separate files resulting in over 13 million relatively tiny files that take very long to create and rsync. dbSwiss instead saves each record into a DBM database that is optimized for fast retrieval. OPTIONS
-d, --datadir=path directory of database files, default: '/mnt/project/rost_db/data/swissprot' --debug --nodebug --first20 --nofirst20 process only first 20 records, for debugging --help -i, --infile=path Swiss-Prot data flatfile, default: '/mnt/project/rost_db/data/swissprot/uniprot_sprot.dat'. --man --quiet --noquiet do not print progress status --readback --noreadback read records back after storing and print them --table name of database table and consequently the base name of database files, default: 'dbswiss' --version -w, --workdir=path Optional working directory. Automatically created and removed if not defined. AUTHOR
Laszlo Kajan <lkajan@rostlab.org> 1.0.43 2011-11-28 DBSWISS(1)

Shell Programming and Scripting

Extract data from large file 80+ million records

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need to extract 8 characters from a large file.

Discussion started by: pavand

2. Shell Programming and Scripting

Quick way to select many records from a large file

Discussion started by: zenongz

3. Shell Programming and Scripting

Split a large file in n records and skip a particular record

Discussion started by: ibmtech

4. Shell Programming and Scripting

Matching 10 Million file records with 10 Million in other file

Discussion started by: vguleria

5. Programming

Suitable data structure large number of heterogeneous records

Discussion started by: shoaibjameel123

6. Shell Programming and Scripting

awk - splitting 1 large file into multiple based on same key records

Discussion started by: kam66

7. Shell Programming and Scripting

Extract data from records that match pattern

Discussion started by: npatwardhan

8. Shell Programming and Scripting

How to Pick Random records from a large file

Discussion started by: ajithshankar@ho

9. Shell Programming and Scripting

sort a file which has 3.7 million records

Discussion started by: greenworld

10. Shell Programming and Scripting

Need to Extract Data From 94000 records

Discussion started by: vasimm

LEARN ABOUT DEBIAN

dbswiss