How to extract data from a huge file? Post: 302159541

Sponsored Content

Top Forums Shell Programming and Scripting How to extract data from a huge file? Post 302159541 by srsahu75 on Friday 18th of January 2008 01:21:01 AM

01-18-2008

Registered User

How to extract data from a huge file?

Hi,
I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows:

1. Needs to create folders as the strings starts with "item_*" from the input file
2. Create a file "contents" in each folders having "license.txt(tab \t)bundle:LICENSE" as string in it
3. Create a file "dublin_core.xml" in their respective folder "item_*" extracting the text from the input file under its "item_*" string. The would be extracted text starts with the string <dublin_core schema="dc"> and ends with </dublin_core>

Following are the sample records in the file:

item_3908
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Fernandes, A.A.</dcvalue>
<dcvalue element="contributor" qualifier="author">Sarma, Y.V.B.</dcvalue>
<dcvalue element="title" qualifier="none">Directional spectrum of ocean waves</dcvalue>
<dcvalue element="date" qualifier="issued">2000</dcvalue>
<dcvalue element="publisher" qualifier="none">GET PUB</dcvalue>
<dcvalue element="identifier" qualifier="citation">Ocean Eng., Vol.27; 345-363p.</dcvalue>
</dublin_core>
/eprints/Ocean_Eng_27_345.pdf
item_3911
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Phatarpekar, P.V.</dcvalue>
<dcvalue element="title" qualifier="none">A comparative study on growth performance</dcvalue>
<dcvalue element="identifier" qualifier="citation">Aquaculture, Vol.181; 141-155p.</dcvalue>
<dcvalue element="type" qualifier="none">Journal Article</dcvalue>
<dcvalue element="language" qualifier="iso">en</dcvalue>
<dcvalue element="subject" qualifier="none">polyculture</dcvalue>
</dublin_core>
/eprints/Aquaculture_181_141.pdf
item_3921
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Rao, B.R.</dcvalue>
<dcvalue element="contributor" qualifier="author">Veerayya, M.</dcvalue>
<dcvalue element="title" qualifier="none">Influence of marginal highs on the accumulation</dcvalue>
<dcvalue element="description" qualifier="abstract">Twenty five surficial sediment samples were</dcvalue>
</dublin_core>
/eprints/Deep-Sea_Res_(II)_47_303.pdf

Thanks & Regards

srsahu75

View Public Profile for srsahu75

Find all posts by srsahu75

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

search and grab data from a huge file

folks, In my working directory, there a multiple large files which only contain one line in the file. The line is too long to use "grep", so any help? For example, if I want to find if these files contain a string like "93849", what command I should use? Also, there is oder_id number...

2. Shell Programming and Scripting

How to extract a piece of information from a huge file

Hello All, I need some assistance to extract a piece of information from a huge file. The file is like this one : database information ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc ccccccccccccccccc os information cccccccccccccccccc cccccccccccccccccc...

3. Shell Programming and Scripting

insert a header in a huge data file without using an intermediate file

I have a file with data extracted, and need to insert a header with a constant string, say: H|PayerDataExtract if i use sed, i have to redirect the output to a seperate file like sed ' sed commands' ExtractDataFile.dat > ExtractDataFileWithHeader.dat the same is true for awk and...

4. Shell Programming and Scripting

How to extract a subset from a huge dataset

Hi, All I have a huge file which has 450G. Its tab-delimited format is as below x1 A 50020 1 x1 B 50021 8 x1 C 50022 9 x1 A 50023 10 x2 D 50024 5 x2 C 50025 7 x2 F 50026 8 x2 N 50027 1 : : Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is...

5. Shell Programming and Scripting

Three Difference File Huge Data Comparison Problem.

I got three different file: Part of File 1 ARTPHDFGAA . . Part of File 2 ARTGHHYESA . . Part of File 3 ARTPOLYWEA . .

6. Shell Programming and Scripting

Help- counting delimiter in a huge file and split data into 2 files

I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;” Here is the sample of 5 lines in the file: Name1;phone1;address1;city1;state1;zipcode1 Name2;phone2;address2;city2;state2;zipcode2;comment...

7. Shell Programming and Scripting

Extract header data from one file and combine it with data from another file

Hi, Great minds, I have some files, in fact header files, of CTD profiler, I tried a lot C programming, could not get output as I was expected, because my programming skills are very poor, finally, joined unix forum with the hope that, I may get what I want, from you people, Here I have attached...

8. Shell Programming and Scripting

Extract few content from a huge list of files

I have a huge list of files (about 300,000) which have a pattern like this. .I 1 .U 87049087 .S Am J Emerg .M Allied Health Personnel/*; Electric Countershock/*; .T Refibrillation managed by EMT-Ds: .P ARTICLE. .W Some patients converted from ventricular fibrillation to organized...

9. UNIX for Advanced & Expert Users

Need Optimization shell/awk script to aggreagte (sum) for all the columns of Huge data file

Optimization shell/awk script to aggregate (sum) for all the columns of Huge data file File delimiter "|" Need to have Sum of all columns, with column number : aggregation (summation) for each column File not having the header Like below - Column 1 "Total Column 2 : "Total ... ......

10. UNIX for Advanced & Expert Users

File comaprsons for the Huge data files ( around 60G) - Need optimized and teh best way to do this

I have 2 large file (.dat) around 70 g, 12 columns but the data not sorted in both the files.. need your inputs in giving the best optimized method/command to achieve this and redirect the not macthing lines to the thrid file ( diff.dat) File 1 - 15 columns File 2 - 15 columns Data is...

LEARN ABOUT CENTOS

pdfseparate

pdfseparate(1)						      General Commands Manual						    pdfseparate(1)

NAME

       pdfseparate - Portable Document Format (PDF) page extractor

SYNOPSIS

       pdfseparate [options] PDF-file PDF-page-pattern

DESCRIPTION

       pdfseparate extract single pages from a Portable Document Format (PDF).

       pdfseparate  reads  the PDF file PDF-file, extracts one or more pages, and writes one PDF file for each page to PDF-page-pattern, PDF-page-
       pattern should contain %d.  %d is replaced by the page number.

       The PDF-file should not be encrypted.

OPTIONS

       -f number
	      Specifies the first page to extract. If -f is omitted, extraction starts with page 1.

       -l number
	      Specifies the last page to extract. If -l is omitted, extraction ends with the last page.

       -v     Print copyright and version information.

       -h     Print usage information.	(-help and --help are equivalent.)

EXAMPLE

       pdfseparate sample.pdf sample-%d.pdf

       extracts all pages from sample.pdf, if i.e. sample.pdf has 3 pages, it produces

       sample-1.pdf, sample-2.pdf, sample-3.pdf

AUTHOR

       The pdfseparate software and documentation are copyright 1996-2004 Glyph & Cog, LLC  and  copyright  2005-2011  The  Poppler  Developers  -
       http://poppler.freedesktop.org

SEE ALSO

       pdfunite(1),

								 15 September 2011						    pdfseparate(1)