Help to make awk script more efficient for large files
Hello,
Error
What it is
It is a unix shell script that contains an awk program as well as some unix commands.
The heart of the script is the awk program.
It works, you can run it on your machine.
What is does
It takes an input file and removes duplicate records.
It creates an output file containing a sorted version of the input file.
It creates an output file containing a sorted archive of the duplicate records.
It creates an output file containing a sorted version of only the wanted records with the duplicates removed.
It tells you by standard output the number of duplicate records found, 0 or more.
It creates a flag file if duplicate records were found.
It sets the locale to the default locale of the server I run in on, in my case I set them to HP-UX server defaults to override any changes to these environment variables..
What it needs to do
It needs to made more efficient for processing large input files. When I run it with large input files say 351 MB I get this err:
How to execute the script
The script is run from the UNIX command line, passing parameters to the script.
Note*
You should give the script the required permissions with chmod. (for example chmod 777 rem_dups.sh)
If necessary, convert the script to unix formart with the dos2unix command. (for example dos2unix rem_dups.sh )
You can run the script by pasting this command on the command line. Description of parameters
There are 8 parameters passed to the script from the UNIX command line. Content of input file Description of the input file Note* that there must always be 1 and only 1 blank line at the end of the input file.
The input file contains 4 columns per record separated by the tab character. (in ASCII this is 009)
The last column is the column used to determine which record to keep. The value with the greatest datetime is kept and the rest are moved
to the duplicates archive file.
For example in sct_det3_10_20110516_143947.txt in the program 20110516_143947 is the date and time.
In the program 20110516143947 is the value used to determine which record to keep.
The key is made up, in this case (you could specify whatever column(s) to be the key), of columns 1, 2 and 3.
In the program the key is separated by a hyphen to make, for example, 31erescca-010240-10- .
Then the site is added to the key from the 4th column.
The site in sct_det3_10_20110516_143936.txt is 10. (The 10 that you see after the det3)
So the final key value become the key & the site to make 31erescca-010240-10-10
Descriptions and content of output files
temp_sort_inputfile_out.txt
This is the sorted version of the input file used to compare with the sorted ouput file (the one that doesn=t contain duplicates). duplicates_flagfile_out.txt
This is just an empty file indicating duplicates were found. It's not used for anything else in this process.
dups_archive
This file contains only the duplicate records that were found in the input file. It is sorted. out.txt
This is the final ouput file containing only the unique records. All duplicates have been removed. It is also sorted. The code of the script
This is the code of the script. You can change sh to ksh or bourne shell whichever you are using.
Feel free to remove the locale environment variables if you wish. Some characters have a problem with the sort command,
that's why I added it. The locale info goes to standard output.
Last edited by script_op2a; 05-22-2011 at 04:15 PM..
Reason: just making it easier to read
Use GNU Awk? "sort ... | uniq -d | wc -l | read dup_ct"
Hello, thank you very mucho for your post,
Do you mean using sort instead of awk? Do you think I could do it using only sort?
If so, could you explain the different piped sections?
Let me see if I understand:
sort |
this pipes the sorted input file to the uniq command (d option keeps only 1 of the duplicate lines) pipes to wc - l count # of non-duplicates? pipes to the read..
I'm not sure what the read command does in this case.
I'm also thinking about and will surely post the final code on this thread.
I need to sort the input file based on the key columns specificed for a particular file.
Say columns 1,2 and 3 to keep it simple. And use say column 4 (which is a datetime column) to determine which record to keep.
If I used sort to put that greatest column 4 value on top then use awk to just remove all the duplicates execept the 1st index of it.
Perhaps using code from this post where the same error was encountered:
Once sort can ensure the newest by key is first, and a sort -u following on kye will keep the first.
True, my script bit counts full duplicates uniquely. If you want the count or difference, you can compare the input of 'sort -u' to the output, perhaps using 'comm'. Since 'comm' expects unique records, if there are full duplicates, I run the sorted lists through 'uniq -c' to take full duplicates to a count and one record.
Awk is limited by its heap memory, but sort/comm/uniq/wc and pipes are robust with huge sets.
lol, yeah, you were reinventing the wheel. Oh well.
sort -s -k 3,6
for instance, sorts on the 3rd through 6th fields of the input. The -s allows the sort to be "stable" so you can make another sort later and the ordering will be consistent.
You can use uniq -c to count the number of times each line appears in a sorted list. Other options give you the ability to count only repeated lines, only count unique lines, skip the first N fields, etc.
comm is pretty useful too, reporting on lines common to both files (or only in the first but not the second, or vice-versa)
I have nginx web server logs with all requests that were made and I'm filtering them by date and time.
Each line has the following structure:
127.0.0.1 - xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br)
These text files are... (21 Replies)
Hi there, I'm camor and I'm trying to process huge files with bash scripting and awk.
I've got a dataset folder with 10 files (16 millions of row each one - 600MB), and I've got a sorted file with all keys inside.
For example:
a sample_1 200
a.b sample_2 10
a sample_3 10
a sample_1 10
a... (4 Replies)
Hi there,
I had run into some fortran code to modify. Obviously, it was written without thinking of high performance computing and not parallelized... Now I would like to make the code "on track" and parallel. After a whole afternoon thinking, I still cannot find where to start. Can any one... (3 Replies)
Hi All,
I have some 80,000 files in a directory which I need to rename. Below is the command which I am currently running and it seems, it is taking fore ever to run this command. This command seems too slow. Is there any way to speed up the command. I have have GNU Parallel installed on my... (6 Replies)
Hi,
I need some help creating a tidy shell program with awk or other language that will split large length files efficiently.
Here is an example dump:
<A001_MAIL.DAT>
0001 Ronald McDonald 01 H81
0002 Elmo St. Elmo 02 H82
0003 Cookie Monster 01 H81
0004 Oscar ... (16 Replies)
I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in... (6 Replies)
I have the following code.
printf "Test Message Report" > report.txt
while read line
do
msgid=$(printf "%n" "$line" | cut -c1-6000| sed -e 's///g' -e 's|.*ex:Msg\(.*\)ex:Msg.*|\1|')
putdate=$(printf "%n" "$line" | cut -c1-6000| sed -e 's///g' -e 's|.*PutDate\(.*\)PutTime.*|\1|')... (9 Replies)