07-16-2004
Huge (repeated Entry) text files
Somebody HELP!
I have a huge log file (TEXT) 76298035 bytes.
It's a logfile of IMEIs and IMSIS that I get from my EIR node.
Here is how the contents of the file look like:
000000,
1 33016382000913 652020100423994
1 33016382002353 652020100430743
1 33017035101003 652020100441736
....
....
....
235800,
1 35725620987678 652020100545862
Problem is, the file is to some degree made huge by repeated entries ( repeated lines - non consecutive).
I have tried this code to eliminate the repeated entries:
cat myfile | sed -n 'G; s/\n/&&/; /^\([ -~]*\n\).*\n\1/d; s/\n//; h; P' | tee mynewfile | wc -l
but it takes forever and stops midway, at 024000 instead of 235800.
10 More Discussions You Might Find Interesting
1. Shell Programming and Scripting
Hi expert,
I am using C shell. And i trying to delete repeated word.
Example file.txt:
BLUE
YELLOW
RED
VIOLET
RED
RED
BLUE
WHITE
YELLOW
BLACK
and i wan store the output into a new file:
BLUE (6 Replies)
Discussion started by: vincyoxy
6 Replies
2. Shell Programming and Scripting
Hi,
I need to extract data from a text file in which data has a pattern. I need to extract all repeated pattern and then save it to different files.
example:
input is:
ST*867*000352214
BPT*00*1000352214*090311
SE*1*1
ST*867*000352215
BPT*00*1000352214*090311
SE*1*2
... (5 Replies)
Discussion started by: apjneeraj
5 Replies
3. UNIX for Advanced & Expert Users
I have the following situation:
a text file with 50000 string patterns:
abc2344536
gvk6575556
klo6575556
....
and 3 text files each with more than 1 million lines:
...
000000 abc2344536 46575 0000
000000 abc2344536 46575 4444
000000 abc2344555 46575 1234
...
I... (8 Replies)
Discussion started by: andy2000
8 Replies
4. Shell Programming and Scripting
I have this 2 files:
k5login
sanwar@systems.nyfix.com
jjamnik@systems.nyfix.com
nisha@SYSTEMS.NYFIX.COM
rdpena@SYSTEMS.NYFIX.COM
service/backups-ora@SYSTEMS.NYFIX.COM
ivanr@SYSTEMS.NYFIX.COM
nasapova@SYSTEMS.NYFIX.COM
tpulay@SYSTEMS.NYFIX.COM
rsueno@SYSTEMS.NYFIX.COM... (11 Replies)
Discussion started by: linuxgeek
11 Replies
5. Shell Programming and Scripting
I have a text file where I need to find the string = ST*850*
This string is repetaed several times in the file, so I need to know how many times it appears in the file, this is the text files:
ISA*00* *00* *08*925485USNR *ZZ*IMSALADDERSP... (13 Replies)
Discussion started by: cucosss
13 Replies
6. Shell Programming and Scripting
Hi,
I need to correct line breaks for huge files (more than 1MM records in a file) and then format it properly.
Except the header and trailer, each record starts with 'D'.
Requirement:Scan the whole file except the header and trailer records and see if any of the records start with... (19 Replies)
Discussion started by: kikionline
19 Replies
7. Shell Programming and Scripting
Please can you help in providing the most repeated entry in the 2nd column and give its count
Here is an input file
1, This , is a forum
2, This , is a forum
1, There , is a forum
2, This , is not right
Here the most repeated entry is "This" and count is 3
So output... (4 Replies)
Discussion started by: necro98
4 Replies
8. Shell Programming and Scripting
Hi all,
I want to remove the remove bracket sign ( ) and put in the separate column I also want to remove the repeated entry like in first row in below input (PA156) is repeated
ESR1 (PA156) leflunomide (PA450192) (PA156) leflunomide (PA450192)
CHST3 (PA26503) docetaxel... (2 Replies)
Discussion started by: manigrover
2 Replies
9. Shell Programming and Scripting
Hi below is the input file, i need to find repeated words and sum up the values of it which is second field from the repeated work.Im trying but getting no where close to it.Kindly give me a hint on how to go about it
Input
fruits,apple,20,fruits,mango,20,veg,carrot,12,veg,raddish,30... (11 Replies)
Discussion started by: 100bees
11 Replies
10. UNIX for Beginners Questions & Answers
Dears
i want to extract lines only that have first entry repeated 3 times or above , ex data :
-bash-3.00$ cat INTCONT-IS.CSV
M205-00-106_AMDRN:1-0-6-22,12-662-4833,intContact,2016-11-15 02:32:16,50
M205-00-106_AMDRN:1-0-23-17,12-616-0462,intContact,2016-11-15 02:32:23,50... (5 Replies)
Discussion started by: is2_egypt
5 Replies
uniq(1) User Commands uniq(1)
NAME
uniq - report or filter out repeated lines in a file
SYNOPSIS
uniq [-c | -d | -u] [-f fields] [-s char] [ input_file [output_file]]
uniq [-c | -d | -u] [-n] [ + m] [ input_file [output_file]]
DESCRIPTION
The uniq utility will read an input file comparing adjacent lines, and write one copy of each input line on the output. The second and suc-
ceeding copies of repeated adjacent input lines will not be written.
Repeated lines in the input will not be detected if they are not adjacent.
OPTIONS
The following options are supported:
-c Precedes each output line with a count of the number of times the line occurred in the input.
-d Suppresses the writing of lines that are not repeated in the input.
-f fields Ignores the first fields fields on each input line when doing comparisons, where fields is a positive decimal integer. A
field is the maximal string matched by the basic regular expression:
[[:blank:]]*[^[:blank:]]*
If fields specifies more fields than appear on an input line, a null string will be used for comparison.
-s chars Ignores the first chars characters when doing comparisons, where chars is a positive decimal integer. If specified in con-
junction with the -f option, the first chars characters after the first fields fields will be ignored. If chars specifies
more characters than remain on an input line, a null string will be used for comparison.
-u Suppresses the writing of lines that are repeated in the input.
-n Equivalent to -f fields with fields set to n.
+m Equivalent to -s chars with chars set to m.
OPERANDS
The following operands are supported:
input_file A path name of the input file. If input_file is not specified, or if the input_file is -, the standard input will be used.
output_file A path name of the output file. If output_file is not specified, the standard output will be used. The results are unspeci-
fied if the file named by output_file is the file named by input_file.
EXAMPLES
Example 1: Using the uniq command
The following example lists the contents of the uniq.test file and outputs a copy of the repeated lines.
example% cat uniq.test
This is a test.
This is a test.
TEST.
Computer.
TEST.
TEST.
Software.
example% uniq -d uniq.test
This is a test.
TEST.
example%
The next example outputs just those lines that are not repeated in the uniq.test file.
example% uniq -u uniq.test
TEST.
Computer.
Software.
example%
The last example outputs a report with each line preceded by a count of the number of times each line occurred in the file:
example% uniq -c uniq.test
2 This is a test.
1 TEST.
1 Computer.
2 TEST.
1 Software.
example%
ENVIRONMENT VARIABLES
See environ(5) for descriptions of the following environment variables that affect the execution of uniq: LANG, LC_ALL, LC_CTYPE, LC_MES-
SAGES, and NLSPATH.
EXIT STATUS
The following exit values are returned:
0 Successful completion.
>0 An error occurred.
ATTRIBUTES
See attributes(5) for descriptions of the following attributes:
+-----------------------------+-----------------------------+
| ATTRIBUTE TYPE | ATTRIBUTE VALUE |
+-----------------------------+-----------------------------+
|Availability |SUNWesu |
+-----------------------------+-----------------------------+
|CSI |Enabled |
+-----------------------------+-----------------------------+
|Interface Stability |Standard |
+-----------------------------+-----------------------------+
SEE ALSO
comm(1), pack(1), pcat(1), sort(1), uncompress(1), attributes(5), environ(5), standards(5)
SunOS 5.10 20 Dec 1996 uniq(1)