01-18-2008
How to extract data from a huge file?
Hi,
I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows:
1. Needs to create folders as the strings starts with "item_*" from the input file
2. Create a file "contents" in each folders having "license.txt(tab \t)bundle:LICENSE" as string in it
3. Create a file "dublin_core.xml" in their respective folder "item_*" extracting the text from the input file under its "item_*" string. The would be extracted text starts with the string <dublin_core schema="dc"> and ends with </dublin_core>
Following are the sample records in the file:
item_3908
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Fernandes, A.A.</dcvalue>
<dcvalue element="contributor" qualifier="author">Sarma, Y.V.B.</dcvalue>
<dcvalue element="title" qualifier="none">Directional spectrum of ocean waves</dcvalue>
<dcvalue element="date" qualifier="issued">2000</dcvalue>
<dcvalue element="publisher" qualifier="none">GET PUB</dcvalue>
<dcvalue element="identifier" qualifier="citation">Ocean Eng., Vol.27; 345-363p.</dcvalue>
</dublin_core>
/eprints/Ocean_Eng_27_345.pdf
item_3911
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Phatarpekar, P.V.</dcvalue>
<dcvalue element="title" qualifier="none">A comparative study on growth performance</dcvalue>
<dcvalue element="identifier" qualifier="citation">Aquaculture, Vol.181; 141-155p.</dcvalue>
<dcvalue element="type" qualifier="none">Journal Article</dcvalue>
<dcvalue element="language" qualifier="iso">en</dcvalue>
<dcvalue element="subject" qualifier="none">polyculture</dcvalue>
</dublin_core>
/eprints/Aquaculture_181_141.pdf
item_3921
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Rao, B.R.</dcvalue>
<dcvalue element="contributor" qualifier="author">Veerayya, M.</dcvalue>
<dcvalue element="title" qualifier="none">Influence of marginal highs on the accumulation</dcvalue>
<dcvalue element="description" qualifier="abstract">Twenty five surficial sediment samples were</dcvalue>
</dublin_core>
/eprints/Deep-Sea_Res_(II)_47_303.pdf
Thanks & Regards
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
folks,
In my working directory, there a multiple large files which only contain one line in the file. The line is too long to use "grep", so any help?
For example, if I want to find if these files contain a string like "93849", what command I should use?
Also, there is oder_id number... (1 Reply)
Discussion started by: ting123
1 Replies
2. Shell Programming and Scripting
Hello All,
I need some assistance to extract a piece of information from a huge file.
The file is like this one :
database information
ccccccccccccccccc
ccccccccccccccccc
ccccccccccccccccc
ccccccccccccccccc
os information
cccccccccccccccccc
cccccccccccccccccc... (2 Replies)
Discussion started by: Marcor
2 Replies
3. Shell Programming and Scripting
I have a file with data extracted, and need to insert a header with a constant string, say: H|PayerDataExtract
if i use sed, i have to redirect the output to a seperate file like
sed ' sed commands' ExtractDataFile.dat > ExtractDataFileWithHeader.dat
the same is true for awk
and... (10 Replies)
Discussion started by: deepaktanna
10 Replies
4. Shell Programming and Scripting
Hi, All
I have a huge file which has 450G. Its tab-delimited format is as below
x1 A 50020 1
x1 B 50021 8
x1 C 50022 9
x1 A 50023 10
x2 D 50024 5
x2 C 50025 7
x2 F 50026 8
x2 N 50027 1
:
:
Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is... (3 Replies)
Discussion started by: cliffyiu
3 Replies
5. Shell Programming and Scripting
I got three different file:
Part of File 1
ARTPHDFGAA
.
.
Part of File 2
ARTGHHYESA
.
.
Part of File 3
ARTPOLYWEA
.
. (4 Replies)
Discussion started by: patrick87
4 Replies
6. Shell Programming and Scripting
I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;”
Here is the sample of 5 lines in the file:
Name1;phone1;address1;city1;state1;zipcode1
Name2;phone2;address2;city2;state2;zipcode2;comment... (7 Replies)
Discussion started by: lv99
7 Replies
7. Shell Programming and Scripting
Hi, Great minds, I have some files, in fact header files, of CTD profiler, I tried a lot C programming, could not get output as I was expected, because my programming skills are very poor, finally, joined unix forum with the hope that, I may get what I want, from you people,
Here I have attached... (17 Replies)
Discussion started by: nex_asp
17 Replies
8. Shell Programming and Scripting
I have a huge list of files (about 300,000) which have a pattern like this.
.I 1
.U
87049087
.S
Am J Emerg
.M
Allied Health Personnel/*; Electric Countershock/*;
.T
Refibrillation managed by EMT-Ds:
.P
ARTICLE.
.W
Some patients converted from ventricular fibrillation to organized... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies
9. UNIX for Advanced & Expert Users
Optimization shell/awk script to aggregate (sum) for all the columns of Huge data file
File delimiter "|"
Need to have Sum of all columns, with column number : aggregation (summation) for each column
File not having the header
Like below -
Column 1 "Total
Column 2 : "Total
...
...... (2 Replies)
Discussion started by: kartikirans
2 Replies
10. UNIX for Advanced & Expert Users
I have 2 large file (.dat) around 70 g, 12 columns but the data not sorted in both the files.. need your inputs in giving the best optimized method/command to achieve this and redirect the not macthing lines to the thrid file ( diff.dat)
File 1 - 15 columns
File 2 - 15 columns
Data is... (9 Replies)
Discussion started by: kartikirans
9 Replies
LEARN ABOUT CENTOS
pdfseparate
pdfseparate(1) General Commands Manual pdfseparate(1)
NAME
pdfseparate - Portable Document Format (PDF) page extractor
SYNOPSIS
pdfseparate [options] PDF-file PDF-page-pattern
DESCRIPTION
pdfseparate extract single pages from a Portable Document Format (PDF).
pdfseparate reads the PDF file PDF-file, extracts one or more pages, and writes one PDF file for each page to PDF-page-pattern, PDF-page-
pattern should contain %d. %d is replaced by the page number.
The PDF-file should not be encrypted.
OPTIONS
-f number
Specifies the first page to extract. If -f is omitted, extraction starts with page 1.
-l number
Specifies the last page to extract. If -l is omitted, extraction ends with the last page.
-v Print copyright and version information.
-h Print usage information. (-help and --help are equivalent.)
EXAMPLE
pdfseparate sample.pdf sample-%d.pdf
extracts all pages from sample.pdf, if i.e. sample.pdf has 3 pages, it produces
sample-1.pdf, sample-2.pdf, sample-3.pdf
AUTHOR
The pdfseparate software and documentation are copyright 1996-2004 Glyph & Cog, LLC and copyright 2005-2011 The Poppler Developers -
http://poppler.freedesktop.org
SEE ALSO
pdfunite(1),
15 September 2011 pdfseparate(1)