01-18-2008
How to extract data from a huge file?
Hi,
I have a huge file of bibliographic records in some standard format.I need a script to do some repeatable task as follows:
1. Needs to create folders as the strings starts with "item_*" from the input file
2. Create a file "contents" in each folders having "license.txt(tab \t)bundle:LICENSE" as string in it
3. Create a file "dublin_core.xml" in their respective folder "item_*" extracting the text from the input file under its "item_*" string. The would be extracted text starts with the string <dublin_core schema="dc"> and ends with </dublin_core>
Following are the sample records in the file:
item_3908
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Fernandes, A.A.</dcvalue>
<dcvalue element="contributor" qualifier="author">Sarma, Y.V.B.</dcvalue>
<dcvalue element="title" qualifier="none">Directional spectrum of ocean waves</dcvalue>
<dcvalue element="date" qualifier="issued">2000</dcvalue>
<dcvalue element="publisher" qualifier="none">GET PUB</dcvalue>
<dcvalue element="identifier" qualifier="citation">Ocean Eng., Vol.27; 345-363p.</dcvalue>
</dublin_core>
/eprints/Ocean_Eng_27_345.pdf
item_3911
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Phatarpekar, P.V.</dcvalue>
<dcvalue element="title" qualifier="none">A comparative study on growth performance</dcvalue>
<dcvalue element="identifier" qualifier="citation">Aquaculture, Vol.181; 141-155p.</dcvalue>
<dcvalue element="type" qualifier="none">Journal Article</dcvalue>
<dcvalue element="language" qualifier="iso">en</dcvalue>
<dcvalue element="subject" qualifier="none">polyculture</dcvalue>
</dublin_core>
/eprints/Aquaculture_181_141.pdf
item_3921
<dublin_core schema="dc">
<dcvalue element="contributor" qualifier="author">Rao, B.R.</dcvalue>
<dcvalue element="contributor" qualifier="author">Veerayya, M.</dcvalue>
<dcvalue element="title" qualifier="none">Influence of marginal highs on the accumulation</dcvalue>
<dcvalue element="description" qualifier="abstract">Twenty five surficial sediment samples were</dcvalue>
</dublin_core>
/eprints/Deep-Sea_Res_(II)_47_303.pdf
Thanks & Regards
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
folks,
In my working directory, there a multiple large files which only contain one line in the file. The line is too long to use "grep", so any help?
For example, if I want to find if these files contain a string like "93849", what command I should use?
Also, there is oder_id number... (1 Reply)
Discussion started by: ting123
1 Replies
2. Shell Programming and Scripting
Hello All,
I need some assistance to extract a piece of information from a huge file.
The file is like this one :
database information
ccccccccccccccccc
ccccccccccccccccc
ccccccccccccccccc
ccccccccccccccccc
os information
cccccccccccccccccc
cccccccccccccccccc... (2 Replies)
Discussion started by: Marcor
2 Replies
3. Shell Programming and Scripting
I have a file with data extracted, and need to insert a header with a constant string, say: H|PayerDataExtract
if i use sed, i have to redirect the output to a seperate file like
sed ' sed commands' ExtractDataFile.dat > ExtractDataFileWithHeader.dat
the same is true for awk
and... (10 Replies)
Discussion started by: deepaktanna
10 Replies
4. Shell Programming and Scripting
Hi, All
I have a huge file which has 450G. Its tab-delimited format is as below
x1 A 50020 1
x1 B 50021 8
x1 C 50022 9
x1 A 50023 10
x2 D 50024 5
x2 C 50025 7
x2 F 50026 8
x2 N 50027 1
:
:
Now, I want to extract a subset from this file. In this subset, column 1 is x10, column 2 is... (3 Replies)
Discussion started by: cliffyiu
3 Replies
5. Shell Programming and Scripting
I got three different file:
Part of File 1
ARTPHDFGAA
.
.
Part of File 2
ARTGHHYESA
.
.
Part of File 3
ARTPOLYWEA
.
. (4 Replies)
Discussion started by: patrick87
4 Replies
6. Shell Programming and Scripting
I’m new to Linux script and not sure how to filter out bad records from huge flat files (over 1.3GB each). The delimiter is a semi colon “;”
Here is the sample of 5 lines in the file:
Name1;phone1;address1;city1;state1;zipcode1
Name2;phone2;address2;city2;state2;zipcode2;comment... (7 Replies)
Discussion started by: lv99
7 Replies
7. Shell Programming and Scripting
Hi, Great minds, I have some files, in fact header files, of CTD profiler, I tried a lot C programming, could not get output as I was expected, because my programming skills are very poor, finally, joined unix forum with the hope that, I may get what I want, from you people,
Here I have attached... (17 Replies)
Discussion started by: nex_asp
17 Replies
8. Shell Programming and Scripting
I have a huge list of files (about 300,000) which have a pattern like this.
.I 1
.U
87049087
.S
Am J Emerg
.M
Allied Health Personnel/*; Electric Countershock/*;
.T
Refibrillation managed by EMT-Ds:
.P
ARTICLE.
.W
Some patients converted from ventricular fibrillation to organized... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies
9. UNIX for Advanced & Expert Users
Optimization shell/awk script to aggregate (sum) for all the columns of Huge data file
File delimiter "|"
Need to have Sum of all columns, with column number : aggregation (summation) for each column
File not having the header
Like below -
Column 1 "Total
Column 2 : "Total
...
...... (2 Replies)
Discussion started by: kartikirans
2 Replies
10. UNIX for Advanced & Expert Users
I have 2 large file (.dat) around 70 g, 12 columns but the data not sorted in both the files.. need your inputs in giving the best optimized method/command to achieve this and redirect the not macthing lines to the thrid file ( diff.dat)
File 1 - 15 columns
File 2 - 15 columns
Data is... (9 Replies)
Discussion started by: kartikirans
9 Replies
LEARN ABOUT DEBIAN
fntsample
fntsample(1) General Commands Manual fntsample(1)
NAME
fntsample - PDF and PostScript font samples generator
SYNOPSIS
fntsample [ OPTIONS ] -f FONT-FILE -o OUTPUT-FILE
fntsample -h
DESCRIPTION
fntsample program can be used to generate font samples that show Unicode coverage of the font and are similar in appearance to Unicode
charts. Samples can be saved into PDF (default) or PostScript file.
OPTIONS
fntsample supports the following options.
--font-file, -f FONT-FILE
Make samples of FONT-FILE.
--font-index, -n IDX
Font index for FONT-FILE specified using --font-file option. Useful for files that contain multiple fonts, like TrueType Collec-
tions (.ttc). By default font with index 0 is used.
--output-file, -o OUTPUT-FILE
Write output to OUTPUT-FILE.
--other-font-file, -d OTHER-FONT
Compare FONT-FILE with OTHER-FONT. Glyphs added to FONT-FILE will be highlighted.
--other-index, -m IDX
Font index for OTHER-FONT specified using --other-font-file option.
--postscript-output, -s
Use PostScript format for output instead of PDF.
--svg, -g
Use SVG format for output. The generated document contains one page. Use range selection options to specify which.
--print-outline, -l
Print document outlines data to standard output. This data can be used to add outlines (aka bookmarks) to resulting PDF file with
pdfoutline program.
--include-range, -i RANGE
Show characters in RANGE.
--exclude-range, -x RANGE
Do not show characters in RANGE.
--style, -t "STYLE: VAL"
Set STYLE to value VAL. Run fntsample with option --help to see list of styles and default values.
--help, -h
Display help text and exit.
Parameter RANGE for -i and -x can be given as one integer or a pair of integers delimited by minus sign (-). Integers can be specified in
decimal, hexadecimal (0x...) or octal (0...) format. One integer of a pair can be missing (-N can be used to specify all characters with
codes less or equal to N, and N- for all characters with codes greather or equal to N). Multiple -i and -x options can be used.
EXAMPLES
Make PDF samples for font.ttf and write them to file samples.pdf:
fntsample -f font.ttf -o samples.pdf
Make PDF samples for font.ttf, compare it with oldfont.ttf and highlight new glyphs. Write output to file samples.pdf:
fntsample -f font.ttf -d oldfont.ttf -o samples.pdf
Make PostScript samples for font.ttf and write output to file samples.ps. Show only glyphs for characters with codes less or equal to
U+04FF but exclude U+0370-U+03FF:
fntsample -f font.ttf -s -o samples.ps -i -0x04FF -x 0x0370-0x03FF
Make PDF samples for font.ttf and save output to file samples.pdf adding outlines to it:
fntsample -f font.ttf -o temp.pdf -l > outlines.txt
pdfoutline temp.pdf outlines.txt samples.pdf
AUTHOR
Copyright (C) 2007 Eugeniy Meshcheryakov <eugen@debian.org>
Homepage: <http://fntsample.sourceforge.net/>
SEE ALSO
pdfoutline(1)
2010-10-14 fntsample(1)