Help to make awk script more efficient for large files


 
Thread Tools Search this Thread
Homework and Emergencies Emergency UNIX and Linux Support Help to make awk script more efficient for large files
# 1  
Old 05-21-2011
Help to make awk script more efficient for large files

Hello,

Error

Code:
awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt

What it is
It is a unix shell script that contains an awk program as well as some unix commands.
The heart of the script is the awk program.
It works, you can run it on your machine.

What is does
It takes an input file and removes duplicate records.
It creates an output file containing a sorted version of the input file.
It creates an output file containing a sorted archive of the duplicate records.
It creates an output file containing a sorted version of only the wanted records with the duplicates removed.
It tells you by standard output the number of duplicate records found, 0 or more.
It creates a flag file if duplicate records were found.
It sets the locale to the default locale of the server I run in on, in my case I set them to HP-UX server defaults to override any changes to these environment variables..

What it needs to do
It needs to made more efficient for processing large input files. When I run it with large input files say 351 MB I get this err:

Code:
awk: Internal software error in the tostring function on TS1101?05044400?.0085498227?0?.0011041461?.0034752266?.00397045?0?0?0?0?0?0?11/02/10?09/23/10???10?no??0??no?sct_det3_10_20110516_143936.txt

How to execute the script
The script is run from the UNIX command line, passing parameters to the script.

Note*
You should give the script the required permissions with chmod. (for example chmod 777 rem_dups.sh)
If necessary, convert the script to unix formart with the dos2unix command. (for example dos2unix rem_dups.sh )

You can run the script by pasting this command on the command line.
Code:
./rem_dups.sh '1,2,3' . in.txt . out.txt dups_archive 009 4

Description of parameters
There are 8 parameters passed to the script from the UNIX command line.
Code:
Unix parameter $1: Value: '1,2,3'  Description: This is a comma-separated list of column positions in the input file that together make up the key used
to define a unique record
Unix parameter $2: Value: . (for same directory, it could be any directory) Description: This is the path to the input file.
Unix parameter $3: Value: in.txt (you can give it whatever name)  Description: This is the filename of the input file.
Unix parameter $4: Value: . (for same directory, it could be any directory) Description: This is the path to the output file with duplicates removed.
Unix parameter $5: Value: out.txt (you can give it whatever name)  Description: This is the filename of the output file with duplicates removed.
Unix parameter $6: Value: dups_archive (you can give it whatever name)  Description: This is the filename of the archive file containing the duplicate records.
Unix parameter $7: Value: 009 (this is the ASCII code for the tab character) Description: this is the ASCII code for the delimiter in the input file. 
Unix parameter $8: Value: 4 Description: this is the column position in the input file of the site-datetime column used to determine which record to keep
and which record to move to the duplicates archive file. It is also used to add the site to the key value to make the key complete in order to define a unique record.

Content of input file
Code:
31erescca    010240    10    sct_det3_10_20110516_143947.txt
11erescca    010240    10    sct_det3_10_20110516_143936.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt
21erescca    010240    10    sct_det3_10_20110516_143937.txt
31erescca    010240    10    sct_det3_10_20110516_233947.txt
11erescca    010240    10    sct_det3_10_20110516_143936.txt

Description of the input file
Note* that there must always be 1 and only 1 blank line at the end of the input file.

The input file contains 4 columns per record separated by the tab character. (in ASCII this is 009)
The last column is the column used to determine which record to keep. The value with the greatest datetime is kept and the rest are moved
to the duplicates archive file.

For example in sct_det3_10_20110516_143947.txt in the program 20110516_143947 is the date and time.
In the program 20110516143947 is the value used to determine which record to keep.

The key is made up, in this case (you could specify whatever column(s) to be the key), of columns 1, 2 and 3.
In the program the key is separated by a hyphen to make, for example, 31erescca-010240-10- .

Then the site is added to the key from the 4th column.
The site in sct_det3_10_20110516_143936.txt is 10. (The 10 that you see after the det3)
So the final key value become the key & the site to make 31erescca-010240-10-10

Descriptions and content of output files

temp_sort_inputfile_out.txt
This is the sorted version of the input file used to compare with the sorted ouput file (the one that doesn=t contain duplicates).
Code:
11erescca    010240    10    sct_det3_10_20110516_143936.txt
11erescca    010240    10    sct_det3_10_20110516_143936.txt
21erescca    010240    10    sct_det3_10_20110516_143937.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt
31erescca    010240    10    sct_det3_10_20110516_233947.txt

duplicates_flagfile_out.txt
This is just an empty file indicating duplicates were found. It's not used for anything else in this process.

dups_archive

This file contains only the duplicate records that were found in the input file. It is sorted.
Code:
11erescca    010240    10    sct_det3_10_20110516_143936.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt
31erescca    010240    10    sct_det3_10_20110516_143947.txt

out.txt
This is the final ouput file containing only the unique records. All duplicates have been removed. It is also sorted.
Code:
11erescca    010240    10    sct_det3_10_20110516_143936.txt
21erescca    010240    10    sct_det3_10_20110516_143937.txt
31erescca    010240    10    sct_det3_10_20110516_233947.txt

The code of the script
This is the code of the script. You can change sh to ksh or bourne shell whichever you are using.
Feel free to remove the locale environment variables if you wish. Some characters have a problem with the sort command,
that's why I added it. The locale info goes to standard output.
Code:
#!/usr/bin/sh

LANG=""
LC_CTYPE="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_MESSAGES="C"
LC_ALL=""

export LANG
export LC_CTYPE
export LC_COLLATE
export LC_MONETARY
export LC_NUMERIC
export LC_TIME
export LC_MESSAGES
export LC_ALL

locale

pos="$1"
infile="$2/$3"
outfile="$4/$5"
temp_dups_file="$2/$6"
temp_sort_file="$2/temp_sort_inputfile_$5"
flagfile="$2/duplicates_flagfile_$5"
delimiter="$7"
site_datetime="$8"

awk -v key_cols="$pos" -v delim="$delimiter" -v site_dt="$site_datetime" '
 
BEGIN {
    FS=sprintf("%c", delim);
    numkeys=split(key_cols,k,",");
} 

{ 

n=split($(site_dt), z, "_")
datetime=z[n-1] substr(z[n], 1 , 6) 
site=z[n-2];

for (x = 1; x <= numkeys; x++)
{    
    key=key $(k[x]) "-";
}

keysite=key site;
key="";

if(datetime > m[keysite]) 
{
    m[keysite] = datetime; out[keysite] = $0;
}

next }

END { for (keysite in out) print out[keysite] }' <$infile |sort > $outfile

sort $infile > $temp_sort_file

diff $outfile $temp_sort_file |awk '/^>/ {print $0}'|sed 's/^> //' > $temp_dups_file

LINES=$(wc -l < $temp_dups_file)

if [ "$LINES" -gt 0 ]
then
        echo "There were $LINES duplicate records."
     touch $flagfile
else
        echo "There were $LINES duplicate records. 0 duplicate records."
fi


Last edited by script_op2a; 05-22-2011 at 04:15 PM.. Reason: just making it easier to read
# 2  
Old 05-26-2011
Use GNU Awk? "sort ... | uniq -d | wc -l | read dup_ct"
This User Gave Thanks to DGPickett For This Post:
# 3  
Old 05-26-2011
Quote:
Originally Posted by DGPickett
Use GNU Awk? "sort ... | uniq -d | wc -l | read dup_ct"
Hello, thank you very mucho for your post,

Do you mean using sort instead of awk? Do you think I could do it using only sort?
If so, could you explain the different piped sections?

Let me see if I understand:

sort |
this pipes the sorted input file to the uniq command (d option keeps only 1 of the duplicate lines) pipes to wc - l count # of non-duplicates? pipes to the read..

I'm not sure what the read command does in this case.

I'm also thinking about and will surely post the final code on this thread.

I need to sort the input file based on the key columns specificed for a particular file.
Say columns 1,2 and 3 to keep it simple. And use say column 4 (which is a datetime column) to determine which record to keep.

If I used sort to put that greatest column 4 value on top then use awk to just remove all the duplicates execept the 1st index of it.

Perhaps using code from this post where the same error was encountered:

https://www.unix.com/shell-programmin...based-key.html
Code:
awk -F "," ' NR == FNR {   cnt[$1] ++ } NR != FNR {   if (cnt[$1] == 1)     print $0 }' your-file your-file

I thinking about mixing the sort with that type of awk code to make it work for large files.

If you have any more ideas, suggestion or code sample please let me know.
# 4  
Old 05-31-2011
Once sort can ensure the newest by key is first, and a sort -u following on kye will keep the first.

True, my script bit counts full duplicates uniquely. If you want the count or difference, you can compare the input of 'sort -u' to the output, perhaps using 'comm'. Since 'comm' expects unique records, if there are full duplicates, I run the sorted lists through 'uniq -c' to take full duplicates to a count and one record.

Awk is limited by its heap memory, but sort/comm/uniq/wc and pipes are robust with huge sets.
# 5  
Old 06-01-2011
lol, yeah, you were reinventing the wheel. Oh well.

sort -s -k 3,6

for instance, sorts on the 3rd through 6th fields of the input. The -s allows the sort to be "stable" so you can make another sort later and the ordering will be consistent.

You can use uniq -c to count the number of times each line appears in a sorted list. Other options give you the ability to count only repeated lines, only count unique lines, skip the first N fields, etc.

comm is pretty useful too, reporting on lines common to both files (or only in the first but not the second, or vice-versa)
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to make awk command faster for large amount of data?

I have nginx web server logs with all requests that were made and I'm filtering them by date and time. Each line has the following structure: 127.0.0.1 - xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br) These text files are... (21 Replies)
Discussion started by: brenoasrm
21 Replies

2. Shell Programming and Scripting

Combining awk command to make it more efficient

VARIABLE="jhovan 5259 5241 0 20:11 ? 00:00:00 /proc/self/exe --type=gpu-process --channel=5182.0.1597089149 --supports-dual-gpus=false --gpu-driver-bug-workarounds=2,45,57 --disable-accelerated-video-decode --gpu-vendor-id=0x80ee --gpu-device-id=0xbeef --gpu-driver-vendor... (3 Replies)
Discussion started by: SkySmart
3 Replies

3. Shell Programming and Scripting

Process multiple large files with awk

Hi there, I'm camor and I'm trying to process huge files with bash scripting and awk. I've got a dataset folder with 10 files (16 millions of row each one - 600MB), and I've got a sorted file with all keys inside. For example: a sample_1 200 a.b sample_2 10 a sample_3 10 a sample_1 10 a... (4 Replies)
Discussion started by: camor
4 Replies

4. Programming

Help with make this Fortran code more efficient (in HPC manner)

Hi there, I had run into some fortran code to modify. Obviously, it was written without thinking of high performance computing and not parallelized... Now I would like to make the code "on track" and parallel. After a whole afternoon thinking, I still cannot find where to start. Can any one... (3 Replies)
Discussion started by: P_E_M_Lee
3 Replies

5. Shell Programming and Scripting

Running rename command on large files and make it faster

Hi All, I have some 80,000 files in a directory which I need to rename. Below is the command which I am currently running and it seems, it is taking fore ever to run this command. This command seems too slow. Is there any way to speed up the command. I have have GNU Parallel installed on my... (6 Replies)
Discussion started by: shoaibjameel123
6 Replies

6. Shell Programming and Scripting

AWK Shell Program to Split Large Files

Hi, I need some help creating a tidy shell program with awk or other language that will split large length files efficiently. Here is an example dump: <A001_MAIL.DAT> 0001 Ronald McDonald 01 H81 0002 Elmo St. Elmo 02 H82 0003 Cookie Monster 01 H81 0004 Oscar ... (16 Replies)
Discussion started by: mkastin
16 Replies

7. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in... (6 Replies)
Discussion started by: metronomadic
6 Replies

8. Shell Programming and Scripting

Is there a way to make this more efficient

I have the following code. printf "Test Message Report" > report.txt while read line do msgid=$(printf "%n" "$line" | cut -c1-6000| sed -e 's///g' -e 's|.*ex:Msg\(.*\)ex:Msg.*|\1|') putdate=$(printf "%n" "$line" | cut -c1-6000| sed -e 's///g' -e 's|.*PutDate\(.*\)PutTime.*|\1|')... (9 Replies)
Discussion started by: gugs
9 Replies
Login or Register to Ask a Question