Best search technique


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Best search technique
# 1  
Old 03-29-2010
Best search technique - Need help

I have a snippet file with the shown below:

data file
Code:
1 2
1 3
1 3
4 2
3 2
2 1
2 2
5 1
3 2
3 2
2 3
1 4

Actual file has approx 50 Millions such lines with bigger number

Now I need to find a way so that I can pull out all uniq sets
example

Case satisfying conditions
3 2 is uniq
as first column has 3 and 2nd column is "2" only no other 2nd number
also 5 and 1 and it appears only once.

Cases does not work
1,2 etc WHERE 1 has other numbers in 2nd column.

Already tried:
1. Tried database mysql.. Does not work
2. grep and awk : very slow .. My script is running for more than 3 days
3. Sort column and comparing with scnd column ... Need help on any unix/linux tool to do this
4. comm commands also seems to be scared of too much data ...
5. Tried perl with BINMODE and reading a BLOCK etc etc ... Slower than grep and egrep.

Any ideas on how to get this details.....

Last edited by chakrapani; 03-29-2010 at 08:59 AM..
# 2  
Old 03-29-2010
It's hard to tell without seeing the code you run ...
Please try the following command and let us know the timings:
Code:
awk 'END {
  for (__ in _)
    if (_[__] < 2) 
      print __
  }
{ _[$0]++ }' infile

Use gawk, nawk or /usr/xpg4/bin/awk on Solaris.

What is the size (in MB) of your data file?
# 3  
Old 03-29-2010
Hi Thanks for your reply.

Currently I am using code:

Code:
#!/bin/bash

function lookfor {
y=$1
s=$2
ONE="1"
echo -e "NOW Processing file: $s \n\n"
for file in $(cat $s )
do
# Make sure the lenght is bigger of looking number is bigger than 5
if [ "${#file}" -ge 5 ]
 then
 VAL=$( grep "$file" $y | awk '{ print $2}' | sort | uniq | wc -l )
 if [ -n "$VAL" ]
   then
    if [ "x$VAL" == "x$ONE" ]
       then
        echo  -en "$file"
        grep "|$file|" $y >> fnd.txt
     fi
   fi
echo -en "." # Show some activity
else
  echo -en "-" # Show some activity that I rejected this number
fi
done
}

lookfor "hugelog1.log" "firstRowUniq.1" ;
lookfor "hugelog2.log" "firstRowUniq.2" ;
lookfor "hugelog3.log" "firstRowUniq.3" ;

I have broken hugelog in some files and firstRowUniq is the file having uniq
of the first column from hugelog file. in Linux what I mean

$ awk '{ print $1 }' hugelog1.log | sort | uniq > firstRowUniq.1

Please note log file hugelog*.log is not sorted. the size of all hugefile combined is around 3 GB
# 4  
Old 03-29-2010
Thanks,
please try the command I posted with the original data file.
# 5  
Old 03-29-2010
Hi
When I try the awk ... I see everything in the file is printed. Could you let me know what is _ and __ in awk command.

Or was it FIll in the Blank for me ... I am bit confused.
# 6  
Old 03-29-2010
Quote:
Originally Posted by chakrapani
Hi
When I try the awk ... I see everything in the file is printed. Could you let me know what is _ and __ in awk command.

Thanks
Could you post the complete output given the sample input above?

Based on your input file:
Code:
1 2
1 3
1 3
4 2
3 2
2 1
2 2
5 1
3 2
3 2
2 3
1 4

The output I get is:

Code:
1 2
1 4
4 2
2 1
2 2
2 3
5 1

Is it wrong and if yes, why?
# 7  
Old 03-29-2010
Code:
1 2
1 4
4 2
2 1
2 2
2 3
5 1

Ok what I need is the set to remain same so 1 2 and 1 4 should not be there as 2 and 4 changes but if we have same set all the time it is ok . example 3 2 and 5 1 etc in the original code.


What I was doing in my code was to to get uniq numbers and then grep and the numbers in the file and then see if scnd number appears only once using wc -l
It works but slow ..
Login or Register to Ask a Question

Previous Thread | Next Thread

8 More Discussions You Might Find Interesting

1. What is on Your Mind?

YouTube: Search Engine Optimization | How To Fix Soft 404 Errors and A.I. Tales from Google Search

Getting a bit more comfortable making quick YT videos in 4K, here is: Search Engine Optimization | How To Fix Soft 404 Errors and A.I. Tales from Google Search Console https://youtu.be/I6b9T2qcqFo (0 Replies)
Discussion started by: Neo
0 Replies

2. Shell Programming and Scripting

Search pattern on logfile and search for day/dates and skip duplicate lines if any

Hi, I've written a script to search for an Oracle ORA- error on a log file, print that line and the .trc file associated with it as well as the dateline of when I assumed the error occured. In most it is the first dateline previous to the error. Unfortunately, this is not a fool proof script.... (2 Replies)
Discussion started by: newbie_01
2 Replies

3. Shell Programming and Scripting

Perl - start search by using search button or by pressing the enter key

#Build label and text box $main->Label( -text => "Input string below:" )->pack(); $main->Entry( -textvariable => \$text456 )->pack(); $main->Button( -text => "Search", -command => sub { errchk ($text456) ... (4 Replies)
Discussion started by: popeye
4 Replies

4. Linux

Best Compression technique ?

Hi all, I am working on a sample backup code, where i read the files per 7200 bytes and send it to server. Before sending to server, i compress each 7200 bytes using zlib compression algorithm using dictionary max length of 1.5 MB . I find zlib is slow. Can anyone recommend me a... (3 Replies)
Discussion started by: selvarajvss
3 Replies

5. Shell Programming and Scripting

Password Obscuring Technique

Hi, We have a unix shell script which tries login to database. The user name and password to connect to database is stored in a file connection.sql. Now connection.sql has contents def ora_user =&1 CONNECT A_PROXY/abc123@DEV01 When on UNIX server we connect to database and set spool on... (7 Replies)
Discussion started by: Gangadhar Reddy
7 Replies

6. Shell Programming and Scripting

Perl - use search keywords from array and search a file and print 3rd field when matched

Hi , I have been trying to write a perl script to do this job. But i am not able to achieve the desired result. Below is my code. my $current_value=12345; my @users=("bob","ben","tom","harry"); open DBLIST,"<","/var/tmp/DBinfo"; my @input = <DBLIST>; foreach (@users) { my... (11 Replies)
Discussion started by: chidori
11 Replies

7. UNIX for Dummies Questions & Answers

FORK/EXEC technique

Hi! Can someone explain me exactly this technique? Why a process (PARENT) creates a copy of itself with FORK (CHILD)? What's the reason of this behaviour? Sorry, but I cannot understand the logic behind it. Thanks. (4 Replies)
Discussion started by: marshmallow
4 Replies

8. UNIX for Dummies Questions & Answers

Difference Technique's???

Is there any better way of doing this? I only want to find a status of a diff, ie diff the file and return to me whether it is different or not or non-existant. This example works, however I think it could be less messier: workd=`pwd`;find $workd -name "*.sum" | while read line ; do... (1 Reply)
Discussion started by: Shakey21
1 Replies
Login or Register to Ask a Question