Best search technique

03-29-2010

Registered User

211, 8

Join Date: Sep 2009

Last Activity: 12 December 2018, 2:49 PM EST

Location: America

Posts: 211

Thanks Given: 0

Thanked 8 Times in 7 Posts

Best search technique - Need help

I have a snippet file with the shown below:

data file

Code:

Actual file has approx 50 Millions such lines with bigger number

Now I need to find a way so that I can pull out all uniq sets
example

Case satisfying conditions
3 2 is uniq
as first column has 3 and 2nd column is "2" only no other 2nd number
also 5 and 1 and it appears only once.

Cases does not work
1,2 etc WHERE 1 has other numbers in 2nd column.

Already tried:
1. Tried database mysql.. Does not work
2. grep and awk : very slow .. My script is running for more than 3 days
3. Sort column and comparing with scnd column ... Need help on any unix/linux tool to do this
4. comm commands also seems to be scared of too much data ...
5. Tried perl with BINMODE and reading a BLOCK etc etc ... Slower than grep and egrep.

Any ideas on how to get this details.....

Last edited by chakrapani; 03-29-2010 at 08:59 AM..

chakrapani

View Public Profile for chakrapani

Find all posts by chakrapani

03-29-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

It's hard to tell without seeing the code you run ...
Please try the following command and let us know the timings:

Code:

awk 'END {
  for (__ in _)
    if (_[__] < 2) 
      print __
  }
{ _[$0]++ }' infile

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

What is the size (in MB) of your data file?

radoulov

View Public Profile for radoulov

Find all posts by radoulov

03-29-2010

Registered User

211, 8

Join Date: Sep 2009

Last Activity: 12 December 2018, 2:49 PM EST

Location: America

Posts: 211

Thanks Given: 0

Thanked 8 Times in 7 Posts

Hi Thanks for your reply.

Currently I am using code:

Code:

#!/bin/bash

function lookfor {
y=$1
s=$2
ONE="1"
echo -e "NOW Processing file: $s \n\n"
for file in $(cat $s )
do
# Make sure the lenght is bigger of looking number is bigger than 5
if [ "${#file}" -ge 5 ]
 then
 VAL=$( grep "$file" $y | awk '{ print $2}' | sort | uniq | wc -l )
 if [ -n "$VAL" ]
   then
    if [ "x$VAL" == "x$ONE" ]
       then
        echo  -en "$file"
        grep "|$file|" $y >> fnd.txt
     fi
   fi
echo -en "." # Show some activity
else
  echo -en "-" # Show some activity that I rejected this number
fi
done
}

lookfor "hugelog1.log" "firstRowUniq.1" ;
lookfor "hugelog2.log" "firstRowUniq.2" ;
lookfor "hugelog3.log" "firstRowUniq.3" ;

I have broken hugelog in some files and firstRowUniq is the file having uniq
of the first column from hugelog file. in Linux what I mean

$ awk '{ print $1 }' hugelog1.log | sort | uniq > firstRowUniq.1

Please note log file hugelog*.log is not sorted. the size of all hugefile combined is around 3 GB

chakrapani

View Public Profile for chakrapani

Find all posts by chakrapani

03-29-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Thanks,
please try the command I posted with the original data file.

radoulov

View Public Profile for radoulov

Find all posts by radoulov

03-29-2010

Registered User

211, 8

Join Date: Sep 2009

Last Activity: 12 December 2018, 2:49 PM EST

Location: America

Posts: 211

Thanks Given: 0

Thanked 8 Times in 7 Posts

Hi
When I try the awk ... I see everything in the file is printed. Could you let me know what is _ and __ in awk command.

Or was it FIll in the Blank for me ... I am bit confused.

chakrapani

View Public Profile for chakrapani

Find all posts by chakrapani

03-29-2010

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Quote:

Originally Posted by chakrapani

Hi
When I try the awk ... I see everything in the file is printed. Could you let me know what is _ and __ in awk command.

Thanks

Could you post the complete output given the sample input above?

Based on your input file:

Code:

The output I get is:

Code:

Is it wrong and if yes, why?

radoulov

View Public Profile for radoulov

Find all posts by radoulov

03-29-2010

Registered User

211, 8

Join Date: Sep 2009

Last Activity: 12 December 2018, 2:49 PM EST

Location: America

Posts: 211

Thanks Given: 0

Thanked 8 Times in 7 Posts

Code:

Ok what I need is the set to remain same so 1 2 and 1 4 should not be there as 2 and 4 changes but if we have same set all the time it is ok . example 3 2 and 5 1 etc in the original code.

What I was doing in my code was to to get uniq numbers and then grep and the numbers in the file and then see if scnd number appears only once using wc -l
It works but slow ..

chakrapani

View Public Profile for chakrapani

Find all posts by chakrapani

Shell Programming and Scripting

Best search technique

8 More Discussions You Might Find Interesting

1. What is on Your Mind?

YouTube: Search Engine Optimization | How To Fix Soft 404 Errors and A.I. Tales from Google Search

Discussion started by: Neo

2. Shell Programming and Scripting

Search pattern on logfile and search for day/dates and skip duplicate lines if any

Discussion started by: newbie_01

3. Shell Programming and Scripting

Perl - start search by using search button or by pressing the enter key

Discussion started by: popeye

4. Linux

Best Compression technique ?

Discussion started by: selvarajvss

5. Shell Programming and Scripting

Password Obscuring Technique

Discussion started by: Gangadhar Reddy

6. Shell Programming and Scripting

Perl - use search keywords from array and search a file and print 3rd field when matched

Discussion started by: chidori

7. UNIX for Dummies Questions & Answers

FORK/EXEC technique

Discussion started by: marshmallow

8. UNIX for Dummies Questions & Answers

Difference Technique's???

Discussion started by: Shakey21