Help to make awk script more efficient for large files Post: 302525451

Sponsored Content

Homework and Emergencies Emergency UNIX and Linux Support Help to make awk script more efficient for large files Post 302525451 by script_op2a on Thursday 26th of May 2011 05:17:11 PM

05-26-2011

Registered User

Quote:

Originally Posted by DGPickett

Use GNU Awk? "sort ... | uniq -d | wc -l | read dup_ct"

Hello, thank you very mucho for your post,

Do you mean using sort instead of awk? Do you think I could do it using only sort?
If so, could you explain the different piped sections?

Let me see if I understand:

sort |
this pipes the sorted input file to the uniq command (d option keeps only 1 of the duplicate lines) pipes to wc - l count # of non-duplicates? pipes to the read..

I'm not sure what the read command does in this case.

I'm also thinking about and will surely post the final code on this thread.

I need to sort the input file based on the key columns specificed for a particular file.
Say columns 1,2 and 3 to keep it simple. And use say column 4 (which is a datetime column) to determine which record to keep.

If I used sort to put that greatest column 4 value on top then use awk to just remove all the duplicates execept the 1st index of it.

Perhaps using code from this post where the same error was encountered:

https://www.unix.com/shell-programmin...based-key.html

Code:

awk -F "," ' NR == FNR {   cnt[$1] ++ } NR != FNR {   if (cnt[$1] == 1)     print $0 }' your-file your-file

I thinking about mixing the sort with that type of awk code to make it work for large files.

If you have any more ideas, suggestion or code sample please let me know.

script_op2a

View Public Profile for script_op2a

Find all posts by script_op2a

8 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Is there a way to make this more efficient

2. Shell Programming and Scripting

Sed or awk script to remove text / or perform calculations from large CSV files

I have a large CSV files (e.g. 2 million records) and am hoping to do one of two things. I have been trying to use awk and sed but am a newbie and can't figure out how to get it to work. Any help you could offer would be greatly appreciated - I'm stuck trying to remove the colon and wildcards in...

3. Shell Programming and Scripting

AWK Shell Program to Split Large Files

Hi, I need some help creating a tidy shell program with awk or other language that will split large length files efficiently. Here is an example dump: <A001_MAIL.DAT> 0001 Ronald McDonald 01 H81 0002 Elmo St. Elmo 02 H82 0003 Cookie Monster 01 H81 0004 Oscar ...

4. Shell Programming and Scripting

Running rename command on large files and make it faster

Hi All, I have some 80,000 files in a directory which I need to rename. Below is the command which I am currently running and it seems, it is taking fore ever to run this command. This command seems too slow. Is there any way to speed up the command. I have have GNU Parallel installed on my...

5. Programming

Help with make this Fortran code more efficient (in HPC manner)

Hi there, I had run into some fortran code to modify. Obviously, it was written without thinking of high performance computing and not parallelized... Now I would like to make the code "on track" and parallel. After a whole afternoon thinking, I still cannot find where to start. Can any one...

6. Shell Programming and Scripting

Process multiple large files with awk

Hi there, I'm camor and I'm trying to process huge files with bash scripting and awk. I've got a dataset folder with 10 files (16 millions of row each one - 600MB), and I've got a sorted file with all keys inside. For example: a sample_1 200 a.b sample_2 10 a sample_3 10 a sample_1 10 a...

7. Shell Programming and Scripting

Combining awk command to make it more efficient

VARIABLE="jhovan 5259 5241 0 20:11 ? 00:00:00 /proc/self/exe --type=gpu-process --channel=5182.0.1597089149 --supports-dual-gpus=false --gpu-driver-bug-workarounds=2,45,57 --disable-accelerated-video-decode --gpu-vendor-id=0x80ee --gpu-device-id=0xbeef --gpu-driver-vendor...

8. Shell Programming and Scripting

How to make awk command faster for large amount of data?

I have nginx web server logs with all requests that were made and I'm filtering them by date and time. Each line has the following structure: 127.0.0.1 - xyz.com GET 123.ts HTTP/1.1 (200) 0.000 s 3182 CoreMedia/1.0.0.15F79 (iPhone; U; CPU OS 11_4 like Mac OS X; pt_br) These text files are...

LEARN ABOUT DEBIAN

igawk

IGAWK(1)							 Utility Commands							  IGAWK(1)

NAME

       igawk - gawk with include files

SYNOPSIS

       igawk [ all gawk options ] -f program-file [ -- ] file ...
       igawk [ all gawk options ] [ -- ] program-text file ...

DESCRIPTION

       Igawk is a simple shell script that adds the ability to have ``include files'' to gawk(1).

       AWK programs for igawk are the same as for gawk, except that, in addition, you may have lines like

	      @include getopt.awk

       in your program to include the file getopt.awk from either the current directory or one of the other directories in the search path.

OPTIONS

       See gawk(1) for a full description of the AWK language and the options that gawk supports.

EXAMPLES

       cat << EOF > test.awk
       @include getopt.awk

       BEGIN {
	    while (getopt(ARGC, ARGV, "am:q") != -1)
		 ...
       }
       EOF

       igawk -f test.awk

SEE ALSO

       gawk(1)

       Effective AWK Programming, Edition 1.0, published by the Free Software Foundation, 1995.

AUTHOR

       Arnold Robbins (arnold@skeeve.com).

Free Software Foundation					    Nov 3 1999								  IGAWK(1)