Incredibly inefficient cat | grep script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Incredibly inefficient cat | grep script
# 8  
Old 08-14-2014
With your new information, you can speed up Corona's perl solutions with awk:
Code:
pipesplit.pl -l 5000 awk 'FILENAME=="-" {A[$1]; next} ($1 in A)' - FS="," production_list.csv < accurate_list.csv

The hash lookups are faster than an RE that can match anywhere in the lines.
Also, increasing the 5000 by a factor 10 will increase speed (and memory consumption) by the same factor.
BTW I find the following order more intuitive:
Code:
< accurate_list.csv pipesplit.pl -l 5000 awk 'FILENAME=="-" {A[$1]; next} ($1 in A)' - FS="," production_list.csv


Last edited by MadeInGermany; 08-14-2014 at 08:40 AM.. Reason: added the missing FS=","
This User Gave Thanks to MadeInGermany For This Post:
# 9  
Old 08-14-2014
Quote:
Originally Posted by Chubler_XL
Off topic a little (apologies to the OP), but I love this util Corona688!

I was wondering if it could be done as a bash script. I came up with the script below.
Thank you! Not my first crack at that problem, just the first one good enough to bother reusing.

I wrote it in Perl since a while read LINE loop isn't terribly efficient when thousands to millions of lines are involved. [edit: Actually, I wrote perl since the original read raw binary...] This probably doesn't matter when the grep is going to be taking so much longer anyway. Smilie I see what you mean about that bug. I should be opening the process after a line is read, not before.

That's a nice translation, and putting the args in getopt makes it look so much more standard/official.

I really don't like the look of that eval, though. Eval would have made my life easier in perl too, but I worked hard to avoid it, to keep things safe and sane -- it will do weird things like eating quotes, evaluating accidental expressions, mangling things containing $, executing backticks, stopping at # considering it a comment, etc. Even if it takes 30 lines of code to do the same thing without it, that'd be better. Or maybe the value should be exported to the environment. I'll see what I can do...

Last edited by Corona688; 08-14-2014 at 03:08 PM..
# 10  
Old 08-14-2014
This replaces the eval with a safer text replacement loop. I've also turned fname into FNAME since I'm exporting it to the environment as an afterthought. Otherwise your code looks terrific.

Code:
#!/bin/bash
lines=1000
ccount=0

# Convert "command" "-flag" "@FNAME@" into "command" "-flag" "000001"
# uses $FNAME external.  Result is in ARGS array.
parsecmd() {
        local N=0
        ARGS=()
        while [ "$#" -gt 0 ]
        do
                ARGS[$((N++))]="${1//@FNAME@/${FNAME}}"
                shift
        done
}

while getopts l: opt 2> /dev/null
do
    case $opt in
        l) lines=$OPTARG ;;
        *) echo "Illegal option -$opt" >&2
           exit 1
        ;;
    esac
done
shift $((OPTIND-1))

if [ $# -le 0 ]
then
    cat >&2 <<EOF
linesplit:  Designed to read from standard input
            and split into multiple streams, running one command per loop
            and writing into its STDIN.  @FNAME@ is available as an
            sequence number if needed.

syntax:  linesplit [-l lines] command arguments @FNAME@ ...
EOF
    exit 1
fi

while [[ ${PIPESTATUS[0]} != 33 ]]
do
    ((ccount++))
    printf -v FNAME "%08d" $ccount
    export FNAME
    parsecmd "$@" # Convert @FNAME@ into $FNAME, put in ${ARGS[@]}

    for((n=0; n<lines; n++))
    do
        read line && echo "$line" || exit 33
    done | "${ARGS[@]}"
done

printf "Wrote %'d full chunks of %'d lines\n" $((ccount-1)) $lines >&2

# 11  
Old 08-14-2014
I was thinking of the [command] parameter as being a lot like that implemented by ssh: which also has all the eval issues you mentioned. I find the double expansion of ssh a pain, but at least there is a precedent for this sort of thing (su -c also comes to mind).

Perhaps I'm missing something by without the eval how would you do this:

Code:
$ printf "%s\n" {00..99} | ./linesplit -l 10 wc -l \> out_@FNAME@.txt

I also find it very cool that you can do stuff line this:

Code:
./linesplit 'tr -d '^M' | grep -vi ignore > out_@FNAME@.txt'

Can you explain what the parsecmd is for I cant see how it would be different to:

Code:
done | ${@//@FNAME@/$FNAME}

Edit: Another thought the read should include the -r option to stop it treating backslash in the input specially.

Last edited by Chubler_XL; 08-14-2014 at 05:20 PM..
# 12  
Old 08-14-2014
Quote:
Originally Posted by Chubler_XL
I was thinking of the [command] parameter as being a lot like that implemented by ssh: which also has all the eval issues you mentioned. I find the double expansion of ssh a pain, but at least there is a precedent for this sort of thing (su -c also comes to mind).
It's very traditional to get a shell when you do a shell login, but that's kind of a self-fulfilling prophecy.

sudo does not perform an old-fashioned tty LOGIN process and does not behave like that:

Code:
sudo 'echo $HOSTNAME'

sudo:  echo $HOSTNAME: command not found

Further, most common "command modifying" utilities like nice, nohup, xargs, env etc don't. The only single exception I can think of off-hand is 'watch'.

Furthermore, permitting expansion to happen in the same shell is asking for trouble. Expansion inside ssh won't cause ssh itself to blow up from syntax errors... Kind of bad form. I'd want to put it in an external shell at the very least.

Quote:
Perhaps I'm missing something by without the eval how would you do this:

Code:
$ printf "%s\n" {00..99} | ./linesplit -l 10 wc -l \> out_@FNAME@.txt

That problem would be more easily solved with standard split I think.

Quote:
I also find it very cool that you can do stuff line this:

Code:
./linesplit 'tr -d '^M' | grep -vi ignore > out_@FNAME@.txt'

Why not put tr before linesplit? I see what you're getting at, but it'd be simple enough to put it in an awk command.

The awk command suggested by MadeInGermany would mess up when re-parsed unless you re-quote and escape it extravagantly.

Quote:
Can you explain what the parsecmd is for I cant see how it would be different to:

Code:
done | ${@//@FNAME@/$FNAME}

It substitutes inside individual tokens instead of cramming into one string, substituting, and splitting back apart. This preserves splitting.

Last edited by Corona688; 08-14-2014 at 06:20 PM..
# 13  
Old 08-15-2014
Solved

Hello all,

I finally got it working. I used this:
Code:
gawk 'BEGIN { FS=","} NR == FNR{a[$0];next} $1 in a' new_msisdn_list.csv production_list.csv > output.txt

real   0m36.061s
user  0m30.694s
sys   0m0.704s


Thank you all who replied. I appreciate your help.

Last edited by rbatte1; 08-15-2014 at 07:43 AM.. Reason: Added CODE tags
This User Gave Thanks to Cludgie For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Grep or cat The Whole Directory PROBLEMS :(

Hi Guys This is my first post so I am not sure how things go here. I'm sorry if I'm breaking the rule or something. Feel free to correct me about that :) So as I was saying... I'd been trying to grep this folder containing 900,000 txt files but seems no luck. I get either "No such file... (6 Replies)
Discussion started by: Nexeu
6 Replies

2. Shell Programming and Scripting

Replace cat and grep with <

Hello someone told me to use OS=`awk '{print int($3)}' < /etc/redhat-release` instead of OS=cat /etc/redhat-release | `awk '{print int($3)}'` any idea for the reason ? (5 Replies)
Discussion started by: nimafire
5 Replies

3. UNIX for Dummies Questions & Answers

Grep and cat combined

Hello, i need to search one word (snp1) from many files and copy the content of the columns of this word in new file. example: file 1: SNP BP CHR P snp1 1 3 0.01 snp2 2 2 0.05 . . file 2: SNP BP CHR P snp1 1 3 0.06 snp2 2 2 0.3 output... (6 Replies)
Discussion started by: biopsy
6 Replies

4. Shell Programming and Scripting

grep or cat using sed

Is there a way using grep or cat a file to create a new file based on whether the first 9 positions of each record is less than 399999999? This is a fixed file format. (3 Replies)
Discussion started by: ski
3 Replies

5. Shell Programming and Scripting

cat -n and grep

I am not sure if using cat -n is the most efficient way to split a file into multiple files, one file per line in the source file. I thought using cat -n would make it easy to process the file because it produces an output that numbers each line that I could then grep for with the regex "^ *$i".... (3 Replies)
Discussion started by: kapu
3 Replies

6. Shell Programming and Scripting

cat /etc/passwd and grep -v on /etc/shells

Hi All, I'd like to do this cat /etc/passwd and grep -v on the /etc/shells list I'd like to find all shell that doesn't exist on the /etc/passwd. Is there an easy way without doing a egrep -v "/bin/sh|/bin/bash................"? How do I use a file /etc/shells as my list for... (4 Replies)
Discussion started by: itik
4 Replies

7. Shell Programming and Scripting

Problem with IF - CAT - GREP in simple shell script

Hi all, Here is my requirement I have to search 'ORA' word in out.log file,if it is present then i need to send that file (out.log) content to some mail id.If 'ORA' word is not in that file then i need to send 'load succesful' message to some mail id. The below the shell script is not... (5 Replies)
Discussion started by: mak_boop
5 Replies

8. Shell Programming and Scripting

Perl sum really inefficient!!

Hi all, I have a file like the following: ID, 2,Andrew,0,1,2,3,4,2,5,6,7,7,9,3,4,5,34,3,2,1,5,6,78,89,8,7,6...................... 4,James,0,6,7,0,5,6,4,7,8,9,6,46,6,3,2,5,6,87,0,341,0,5,2,5,6.................... END, (there are more entires on each line but to keep it simple I've left... (10 Replies)
Discussion started by: Donkey25
10 Replies

9. Shell Programming and Scripting

cat in the command line doesn't match cat in the script

Hello, So I sorted my file as I was supposed to: sort -n -r -k 2 -k 1 file1 | uniq > file2 and when I wrote > cat file2 in the command line, I got what I was expecting, but in the script itself ... sort -n -r -k 2 -k 1 averages | uniq > temp cat file2 It wrote a whole... (21 Replies)
Discussion started by: shira
21 Replies

10. UNIX for Advanced & Expert Users

cat and grep not working

I am trying to cat a file and then grep that file for a number. I can do it fine on other files but this particular file will not do anything. I tried running it on an older file from the same device but it is just not working. The file is nothing more than a flat file on a unix box. Here is just a... (3 Replies)
Discussion started by: jphess
3 Replies
Login or Register to Ask a Question