01-29-2009
Removing lines from large files.. quickest method?
Hi
I have some files that contain be anything up to 100k lines - eg. file100k
I have another file called file5k and I need to produce filec which will contain everything in file100k minus what matches in file 5k..
ie.
File100k contains
1FP
2FP
3FP
File5k contains
2FP
I would normally do a grep pattern search with a for loop or something so I would output entire contents of file100k in to filec except anything found in file5k..
Problem is that with 100k entries to search - 5 thousand times.. its takes some time with normal unix tools (can take 10-15 mins for one of these 100k files) and I am wondering is there a way to do this faster - maybe with a perl command or something..
Hope I am making sense... can you help out??
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi! I have a large set of pairs of text files (each pair in their own subdirectory) and each pair shares head/tail (a couple of first and last lines) but differs in the middle part. I need to delete the heads/tails and keep only the middle portions in which they differ. The lengths of heads/tails... (1 Reply)
Discussion started by: dobryden
1 Replies
2. Shell Programming and Scripting
Hi Guru's , I have a whole bunch of files in /var/tmp that i need to strip any blank lines from, so ive written the following script to identify the lines (which works perfectly).. but i wanted to know, how can I actually strip the identified lines from the actual source files ??
my... (11 Replies)
Discussion started by: hcclnoodles
11 Replies
3. Shell Programming and Scripting
Hi,
i want to replace
"Hi How
are You when
did you go to
delhi"
to
"Hi How
are you when
did you come from
delhi"
in a file.
Any idea how to do it? (2 Replies)
Discussion started by: abhitanshu
2 Replies
4. UNIX for Dummies Questions & Answers
Hey everyone, I have a question about comparing two files. I have two lists of files. The first list, todo.csv, lists a series of compounds my supervisor wants me to perform calculations on. The second list, done.csv, lists a series of compounds that I have already performed calculations on.... (2 Replies)
Discussion started by: Stuart Ness
2 Replies
5. Shell Programming and Scripting
i have a file that's about 2GB, i have to get the total number of lines in this file every 10 minutes.
the interval is not an issue. i just need the proper, most efficient way to do this.
any ideas?
i got the following from another thread on this site, but:
awk 'int(100*rand())%5<1'... (12 Replies)
Discussion started by: SkySmart
12 Replies
6. Shell Programming and Scripting
Hello,
Activity to perform:
1. Find all of the "*.tmp" files in a given user directory
2. Determine which ones have "find" in them.
3. Replace the "find sequence" of commands with a "list set" of commands.
Example:
Original file:
--------------
define lastn1 = "A"
define... (7 Replies)
Discussion started by: manishdivs
7 Replies
7. Shell Programming and Scripting
if
then
`rm /52/bip_log_1.txt`
echo "file bip_eg.txt removed"
fi
I am using above code to remove a temorary log file
if
then
`rm /52/bip_log_1.txt`
echo "file bip_eg.txt removed"
fi
The file - e is showing error. WHY? (7 Replies)
Discussion started by: rafa_fed2
7 Replies
8. UNIX for Dummies Questions & Answers
Hi Everybody! First post! Totally noobie.
I'm using the terminal to read a poorly formatted book.
The text file contains, in the middle of paragraphs, hyphenation to split words that are supposed to be on multiple pages. It looks ve -- ry much like this.
I was hoping to use grep -v " -- "... (5 Replies)
Discussion started by: AxeHandle
5 Replies
9. Programming
I wanted to know what is the best way to query json formatted files for content? Ex. Data
https://usn.ubuntu.com/usn-db/database-all.json.bz2
When looking at keys as in:
import json
json_data = json.load(open('database-all.json'))
for keys in json_data.iterkeys():
print 'Keys--> {}... (0 Replies)
Discussion started by: metallica1973
0 Replies
10. Shell Programming and Scripting
Hi All,
I am having a situation now to delete a huge number of temp files created during run times approx. 16700+ files. We have never imagined that we will get this this much big list of files during run time. It worked fine for lesser no of files in the list. But when list is huge we are... (7 Replies)
Discussion started by: mad man
7 Replies
LEARN ABOUT DEBIAN
pdfgrep
pdfgrep(1) USER COMMANDS pdfgrep(1)
NAME
pdfgrep - search pdf files for a regular expression
SYNOPSIS
pdfgrep [OPTION...] PATTERN FILE...
DESCRIPTION
Search for PATTERN in each FILE. PATTERN is an extended regular expression.
pdfgrep works much like grep, with one distinction: It operates on pages and not on lines.
OPTIONS
-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files.
-H, --with-filename
Print the file name for each match. This is the default setting when there is more than one file to search.
-h, --no-filename
Suppress the prefixing of file name on output. This is the default setting when there is only one file to search.
-n, --page-number
Prefix each match with the number of the page where it was found.
-c, --count
Suppress normal output. Instead print the number of matches for each input file. Note that unlike grep, multiple matches on the same
page will be counted individually.
-C, --context NUM
Print at most NUM characters of context around each match. The exact number will vary, because pdfgrep tries to respect word bound-
aries. If NUM is "line", the whole line will be printed. If this option is not set, pdfgrep tries to print lines that are not longer
than the terminal width.
--color WHEN
Surround file names, page numbers and matched text with escape sequences to display them in color on the terminal. (The default set-
ting is auto).
WHEN can be:
always Always use colors, even when stdout is not a terminal.
never Do not use colors.
auto Use colors only when stdout is a terminal.
-R, -r, --recursive
Recursively search all files (restricted by --include and --exclude) under each directory.
--exclude=GLOB
Skip files whose base name matches GLOB. See glob(7) for wildcards you can use. You can use this option multiple times to exclude
more patterns. It takes precedence over --include. Note, that in- and excludes apply only to files found via --recursive and not to
the argument list.
--include=GLOB
Only search files whose base name matches GLOB. See --exclude for details. The default is *.pdf.
--unac Remove accents and ligatures from both the search pattern and the PDF documents. This is useful if you want to search for a word
containing 'ae', but the PDF uses the single character 'ae' instead. See unac(3) and unaccent(1) for details.
[This option is experimental and only available if pdfgrep is compiled with unac support.]
-q, --quiet
Suppress all normal output to stdout. Errors will be printed and the exit codes will be returned (see below).
--help Print a short summary of the options.
-V, --version
Show version information
ENVIRONMENT VARIABLES
The behavior of pdfgrep is affected by the following environment variable.
GREP_COLORS
Specifies the colors and other attributes used to highlight various parts of the output. The syntax and values are like GREP_COLORS
of grep. See grep(1) for more details. Currently only the capabilities mt, ms, mc, fn, ln and se are used by pdfgrep, where mt, ms
and mc have the same effect on pdfgrep.
EXIT STATUS
Normally, the exit status is 0 if at least one match is found, 1 if no match is found and 2 if an error occurred. But if the --quiet or -q
option is used and a match was found, pdfgrep will return 0 regardless of errors.
AUTHOR
Hans-Peter Deifel <hpdeifel at gmx.de>
SEE ALSO
grep(1), regex(7)
version 1.2 February 14, 2012 pdfgrep(1)