I am not an expert with linux, but following various posts on this forum, I have been trying to write a script to match pattern of charters occurring together in a file.
My file has approximately 200 million characters (upper and lower case), with about 50 characters per line. I have merged all the lines together to make it one line using
I now have all charcters in my file in the same line without spaces.
I am trying to count the number of times the specific characters occur together. For example, in the file below
I am trying to look for the pattern 'tr' that occurs in the sentence. The script I have now is
The above script works perfectly fine for a small file, but when I try to run it on my actual file with more than 200 million characters, it takes ages to finish the task (I lost patience and did not check the total time taken).
Is there a way I can optimize the code?
Next, I have been trying to get the position of the match. For example, in the above example file, 'tr' is starts on 4th and 27th position. I just want the number as output.
By definition, grep, sort, and uniq work on text files; and the input your feeding to grep is not a line. (A line ends with a newline character and, including the newline character, contains no more than LINE_MAX bytes. On most systems, LINE_MAX is the minimum allowed by the standards, 2048.)
The standards also require operands to follow options on the command line. So, what you are doing is not portable and will not work at all on many systems.
On Linux systems, where the command you showed might work, it will take a lot longer than processing a normal text file because you require the entire (200Mb) file to be read into the address space of grep at once.
If the command line you showed works on your system, you may be able to get offsets in the file offsets (0-based rather than 1-based) of each match (rather than the number of occurrences of TR, Tr, tR, and tr) by using the command line:
This User Gave Thanks to Don Cragun For This Post:
@ Scrutinizer: The patterns in the original file are indeed spread over two consecutive lines. That is the reason I merged the two.
I did manage to get an answer for the problem from Jotne and Tom Fenech at stackoverflow.
To count the number of occurrences:
To get the position:
Another approach:
Thank you trying to help me.
@ Don Cragun: perfect explanation for why the script I tried did not work.
Amazed by the capabilities of what scripting can do.
The awk utility is also only defined to work when the input files it reads are text files. So, although some versions of awk can handle long, and/or incomplete lines or both, many cannot. If you would like something that should work on any UNIX or Linux system, you could try something like this:
Note that this works on your input file before stripping out the <newline> characters, so instead of having to allocate 200Mb of memory to read in your one-line file, it just needs to read one ~50 character line at a time.
With the following randomly generated list of upper- and lower-case letters (except for the 1st 8 and last 8 characters in the file):
it produces the output:
giving you the number of matches found and their positions in the file (not counting <newline> characters).
This User Gave Thanks to Don Cragun For This Post:
Indeed it is best to keep the file original. Awk can be easily adjusted to work with the original file. For example an adjustment of Jotne's suggestion:
Will maybe work with gawk and maybe mawk, since they have very good line limitations.
Also a perl solution like:
But while it perhaps may be even less likely than awk to run into line length limitations, just like the awk approach it will read the entire file in memory, which with 200M records is at least a 200 MB footprint...
I came up with a similar approach to Don's, but it uses index() rather than match() and it works for variable length patterns:
Also, with all the approaches so far, they will look for the next match AFTER last match.
This next approach will also find additional pattern that were already part of a previous match:
If we take the last part of Don's example: trtRTrTR, when trying to match "try" it will find 3 matches, while the others find only two.
Output:
Whereas the previous (using the pattern "trt" ) will find:
Hi,
I know how to replace a string with another in a file.
But, i wish to replace the below string pattern
EncryptedPassword="{gafgfa}]\asffafsf312a" i.e EncryptedPassword="<any random string>"
To
EncryptedPassword=""
i.e remove the random password to a empty string.
Can you... (3 Replies)
Hi All,
My Problem is like below.
I have a file which contains just one row and contains data like
PO_CREATE12457888888888889SK1234567878744551111111111SK89456321145789955455555SK8888888815788852222
i want to extract SK12345678
SK89456321
SK88888888
So basically SK and next 8... (4 Replies)
The sample file:
dept1: user1,user2,user3
dept2: user4,user5,user6
dept3: user7,user8,user9
I want to match by '/^dept2.*/' but don't want to have substring 'dept2:' in output. How to compose such regex? (8 Replies)
here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb
cat dump.sql
INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
I am facing a problem and I would be grateful if you can help me :wall:
I have a list of words like
And I have a datafile like
the box of
the box of tissues out of
of tissues out of
the book, the
the book, the pen and the
the pen and the
I want to find Patterns of “x.*x” where... (2 Replies)
Hi,
I have a string looks like the following:
USERS 32767.9844 UNDOTBS1 32767.9844 SYSAUX 32767.9844 SYSTEM 32767.9844 EMS 8192 EMS 8192 EMS_INDEXES 4096 EMS_INDEXES 4096 8 rows selected.
How do I extract a sub-string to get the expected output as following:
EMS 8192
EMS_INDEXES 4096
... (3 Replies)
I have a file a file having entries are like
@ram@sham@sita
@krishan@kumar
@deep@kumar@hello@sham
in this file all line are having different no of pattern-@.
need to fetch the substring after the last pattern.
like
sita
kumar
sham
thanks in advance (3 Replies)
I have a file with the below format,
GS*8*****
ST*1********
A*
B*
E*
RMR*123455(This is the unique number to locate this row)
F*
SE*1***
GE**
GS*9*****
ST*2
H*
J*
RMR*567889(This is the unique number to locate this row)
L*
SE*
GE***** (16 Replies)
Hi All,
I am pretty new to pattern matching and extraction using shell scripting. Could anyone please help me in extracting the word matching a pattern from a line in bash.
Input Sample (can vary between any of the 3 samples below):
1) Adaptec SCSI RAID 5445
2) Adaptec SCSI 5445S RAID
3)... (8 Replies)