Incredibly inefficient cat | grep script


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Incredibly inefficient cat | grep script
# 1  
Incredibly inefficient cat | grep script

Hi there,

I have 2 files that I am trying to work on.

File 1 contains a reference list of unique subscriber numbers ( 7 million entries in total)

File 2 contains a list of the subscriber numbers and their tariff (15 million entries in total). This file is in the production system and hasn't had old subscribers removed for some time so more than half of the entries need removed.

I created the following couple of lines to try to obtain the active 7 million subscriber numbers and tariffs from the behemoth 15 million list

Code:
 cat accurate_list.csv | while read ref
> do
> grep $ref production_list.csv >> new_msisdn_list.csv
> done

While this is actually working, it's only producing around 20 entries per second, which will take days to complete.

I'm afraid I'm a complete noob and can't come up with anything more inventive. I'm sure awk/sed/perl or probably any other number of languages would be perfect for something like this.

Anyway any suggestions are gratefully received.

Thanks
Cludgie
# 2  
What's the exact contents of each file?

Right now, you're literally running millions of greps, comparing each ID to every other ID. That's a huge waste of time, obviously. You only really need to compare each ID to itself.

Because it seems to me that you're just looking for IDs that are in both files. If each ID only occurs once in each file, pull the IDs out of each file, combine them into one file, sort them, and only count IDs that are duplicated in the combined file.

Or you can figure out how to use the "join" utility, though that may not be any faster than what you're already doing, although it does have the huge advantage of not having to fork() and exec() a new process millions of times.
# 3  
Normally,
Code:
grep -f accurate_list.csv  production_list.csv

would be ideally fitted for the job, but I'm afraid the sheer numbers will blast it.
How about using split to get to smaller accurate lists, or, if both files are sorted, even smaller production files, and give it a try?
# 4  
The other question is, how do you want your output?

Right now, the way you're doing it, it's going to bundle them in groups of particular ID's. If you do it in "bulk", it may not end up sorted that politely.

If that's not a problem, I wrote this perl script to do 'split' like things without the temp file mess:

Code:
#!/usr/bin/perl

# Die screaming instead of silently if child fails
$SIG{PIPE} = sub { die("SIGPIPE"); };

my $lines=1000;

my $running=1, $ccount=0, @l;

while($#ARGV >= 0)
{
        if($ARGV[0] eq "-l") {
                $lines=$ARGV[1]+0;
                if($lines <= 0)
                {
                        printf(STDERR "Invalid size\n", $ARGV[0]);
                        exit(1);
                }
                shift;  shift;
                next;
        }

        last;
}

if($#ARGV < 0)
{
        printf(STDERR "pipesplit.pl:  Designed to read from standard input, \n");
        printf(STDERR "and split into multiple streams, running one command per loop\n");
        printf(STDERR "and writing into its STDIN.  \@FNAME\@ is available as an\n");
        printf(STDERR "sequence number if needed.\n\n");
        printf(STDERR "syntax:  pipesplit.pl [-l lines] command arguments \@FNAME\@ ...\n");
        exit(1);
}

#print $ARGV[0], "\n";
#exit(0);

my @l=@ARGV;

while($running) {
        my $n, $fname=sprintf("%08d", $ccount++);
        my $fr=\$fname;

        # Use given arguments as a command, with @FNAME@ substituted
        # for an incrementing number like 00000001

        open(OUT, "|-",
                map { my $v=$_; $v =~ s/\@FNAME\@/${$fr}/; $v } @l
        );

        for($n=0; $n<$lines; $n++)
        {
                my $line=<STDIN>;

                if(length($line) == 0) {
                        $running=0;
                        last;
                }

#               print STDERR "chunk $ccount line $line";

                print OUT $line;
        }

        close(OUT);
}

printf(STDERR "Wrote %d chunks of %d lines\n", $ccount-1, $lines);

You would use it like
Code:
./linesplit.pl -l 5000 grep -F -f - production_list.csv < accurate_list.csv > new_msisdn_list.csv


Last edited by Corona688; 08-13-2014 at 04:45 PM.. Reason: wrong code
These 4 Users Gave Thanks to Corona688 For This Post:
# 5  
I'm not sure what thes files look like. But if they are sorted on the id field it sounds like this is just what the "join" command does. Here is a sample run:
Code:
$
$
$ cat file1
8  user 092 kjhuhggty
4  user 343 nbvnvcvc
9  user 391 jllklklkj
6  user 549 rewrewer
2  user 654 kjlkjl
7  user 760 jbjftgd
1  user 777 hkhghgh
3  user 888 hghfgfhgf
5  user 984 nbvnbvmn
$
$
$ cat file2
391
654
760
777
888
999
$
$
$ join -1 3 -2 1 -o "1.1 1.2 1.3 1.4"  file1 file2
9 user 391 jllklklkj
2 user 654 kjlkjl
7 user 760 jbjftgd
1 user 777 hkhghgh
3 user 888 hghfgfhgf
$
$

# 6  
Quote:
Originally Posted by Corona688
I wrote this perl script to do 'split' like things without the temp file mess
Off topic a little (apologies to the OP), but I love this util Corona688!

I was wondering if it could be done as a bash script. I came up with the script below.

While testing I did discover that if lines split evenly it does run the command 1 more time than needed with empty input (e.g. -l 10 for a 100 line file).

Code:
#!/bin/bash
lines=1000
ccount=0

while getopts l: opt 2> /dev/null
do
    case $opt in
        l) lines=$OPTARG ;;
        *) echo "Illegal option -$opt" >&2
           exit 1
        ;;
    esac
done
shift $((OPTIND-1))

if [ $# -le 0 ]
then
    cat >&2 <<EOF
linesplit:  Designed to read from standard input
            and split into multiple streams, running one command per loop
            and writing into its STDIN.  @FNAME@ is available as an
            sequence number if needed.

syntax:  linesplit [-l lines] command arguments @FNAME@ ...
EOF
    exit 1
fi

while [[ ${PIPESTATUS[0]} != 33 ]]
do
    ((ccount++))
    printf -v fname "%08d" $ccount
    for((n=0; n<lines; n++))
    do
        read line && echo "$line" || exit 33
    done | eval ${@//@FNAME@/$fname}
done

printf "Wrote %'d full chunks of %'d lines\n" $((ccount-1)) $lines >&2

eg:

Code:
$ printf "%s\n" $(seq 1 100) | ./linesplit.pl -l 10 wc -l
10
10
10
10
10
10
10
10
10
10
0
Wrote 10 chunks of 10 lines

$ printf "%s\n" $(seq 1 100) | ./linesplit.pl -l 3000 wc -l
100
Wrote 0 chunks of 3000 lines


Last edited by Chubler_XL; 08-13-2014 at 08:45 PM..
This User Gave Thanks to Chubler_XL For This Post:
# 7  
@achenle - The smaller reference file contains the 12 digit numeric reference for every active subscriber we have (format=000000000000). The much larger production file contains the 12 digit reference and an alphanumeric tariff code separated by a coma. Tariff codes have the following A000000 or AA00000 or A000000-AA00000 or AA00000-AA00000. The longer tariff codes indicate an additional service or add-on. So the format is 000000000000,A000000 or any other tariff/add-on variant

My aim is to extract the reference and tariff info from the production file, using the references contained in the accurate list.

@corona688 - The output of should be the same format as the production file which is reference,tariff ie 000000000000,AA00000. I don't think sorting is an issue at this point.

Guys thanks very much, I'll give it a go.
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #965
Difficulty: Medium
The HTML syntax requires a doctype to be specified to ensure that the browser renders the page in standards mode.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Grep or cat The Whole Directory PROBLEMS :(

Hi Guys This is my first post so I am not sure how things go here. I'm sorry if I'm breaking the rule or something. Feel free to correct me about that :) So as I was saying... I'd been trying to grep this folder containing 900,000 txt files but seems no luck. I get either "No such file... (6 Replies)
Discussion started by: Nexeu
6 Replies

2. Shell Programming and Scripting

Replace cat and grep with <

Hello someone told me to use OS=`awk '{print int($3)}' < /etc/redhat-release` instead of OS=cat /etc/redhat-release | `awk '{print int($3)}'` any idea for the reason ? (5 Replies)
Discussion started by: nimafire
5 Replies

3. UNIX for Dummies Questions & Answers

Grep and cat combined

Hello, i need to search one word (snp1) from many files and copy the content of the columns of this word in new file. example: file 1: SNP BP CHR P snp1 1 3 0.01 snp2 2 2 0.05 . . file 2: SNP BP CHR P snp1 1 3 0.06 snp2 2 2 0.3 output... (6 Replies)
Discussion started by: biopsy
6 Replies

4. Shell Programming and Scripting

grep or cat using sed

Is there a way using grep or cat a file to create a new file based on whether the first 9 positions of each record is less than 399999999? This is a fixed file format. (3 Replies)
Discussion started by: ski
3 Replies

5. Shell Programming and Scripting

cat -n and grep

I am not sure if using cat -n is the most efficient way to split a file into multiple files, one file per line in the source file. I thought using cat -n would make it easy to process the file because it produces an output that numbers each line that I could then grep for with the regex "^ *$i".... (3 Replies)
Discussion started by: kapu
3 Replies

6. Shell Programming and Scripting

cat /etc/passwd and grep -v on /etc/shells

Hi All, I'd like to do this cat /etc/passwd and grep -v on the /etc/shells list I'd like to find all shell that doesn't exist on the /etc/passwd. Is there an easy way without doing a egrep -v "/bin/sh|/bin/bash................"? How do I use a file /etc/shells as my list for... (4 Replies)
Discussion started by: itik
4 Replies

7. Shell Programming and Scripting

Problem with IF - CAT - GREP in simple shell script

Hi all, Here is my requirement I have to search 'ORA' word in out.log file,if it is present then i need to send that file (out.log) content to some mail id.If 'ORA' word is not in that file then i need to send 'load succesful' message to some mail id. The below the shell script is not... (5 Replies)
Discussion started by: mak_boop
5 Replies

8. Shell Programming and Scripting

Perl sum really inefficient!!

Hi all, I have a file like the following: ID, 2,Andrew,0,1,2,3,4,2,5,6,7,7,9,3,4,5,34,3,2,1,5,6,78,89,8,7,6...................... 4,James,0,6,7,0,5,6,4,7,8,9,6,46,6,3,2,5,6,87,0,341,0,5,2,5,6.................... END, (there are more entires on each line but to keep it simple I've left... (10 Replies)
Discussion started by: Donkey25
10 Replies

9. Shell Programming and Scripting

cat in the command line doesn't match cat in the script

Hello, So I sorted my file as I was supposed to: sort -n -r -k 2 -k 1 file1 | uniq > file2 and when I wrote > cat file2 in the command line, I got what I was expecting, but in the script itself ... sort -n -r -k 2 -k 1 averages | uniq > temp cat file2 It wrote a whole... (21 Replies)
Discussion started by: shira
21 Replies

10. UNIX for Advanced & Expert Users

cat and grep not working

I am trying to cat a file and then grep that file for a number. I can do it fine on other files but this particular file will not do anything. I tried running it on an older file from the same device but it is just not working. The file is nothing more than a flat file on a unix box. Here is just a... (3 Replies)
Discussion started by: jphess
3 Replies

Featured Tech Videos