awk uniq and longest string of a column as index


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk uniq and longest string of a column as index
# 1  
Old 09-06-2012
awk uniq and longest string of a column as index

I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions:
1) longest string of each pattern in column 2, ignore any sub-string, as the index;
2) all the unique patterns after 1);
3) print the whole row;

input:
Code:
1 ABCDEFGHI longest_sequence1
2  ABCDEFGH substring_a
3    CDEFG  substring_b
4   ACBDEFGH longest_sequence2_# Note_the order ACB
5   ACBDEFG substring_c
6   ABCDE substring_d
7   ADBCE longest_sequence3_# Note the order ADB
8   ADBC substring_e
9   ABC substring_f
10   DBC substring_g

ouput:
Code:
1 ABCDEFGHI longest_sequence1
4   ACBDEFGH longest_sequence2_# Note_the order ACB
7          ADBCE  longest_sequence3_# Note the order ADB

I first pickup only the unique patterns of column2
Code:
awk !x[$2]++ infile > temp.file

and the file became less than ~5 millions. Not sure this is do-able with awk, and need some expertise for the second step to pickup the longest of each pattern.
Thanks a lot in advance!

Yifang
# 2  
Old 09-06-2012
awk isn't the same everywhere, some implementations have much larger line-length limits than others. What's your system?
# 3  
Old 09-06-2012
Linux 3.2.0-3-amd64 #1 SMP Thu Jun 28 09:07:26 UTC 2012 x86_64 GNU/Linux
# 4  
Old 09-12-2012
I tried perl script, but did not get what I want, actually empty output. Can anybody help me on my code?
Code:
#!/usr/bin/perl
#This script is to print the longest string of each type, ignore any substrings

use strict;
use warnings;

my $infile  = $ARGV[0];
my $outfile = $ARGV[1];
my @DB;

open(INFILE, "<$infile") or die "Cannot open the input file $!\n";

while (<INFILE>) {
    chomp $_;
foreach my $member (@DB) {
 if (index($member, $_)>=0) {
    next;
    } else   {
       push (@DB, $_);
    }
}
}
close(INFILE);

open (OUTFILE, ">$outfile") or die "Cannot open the output file $!\n";

foreach my $ID (@DB) {
     print OUTFILE "$ID\n";
    }

close(OUTFILE);

infile.txt:
Code:
ABCDEFGHI
ABCDEFGH
CDEFG
ACBDEFGH
ACBDEFG
ABCDE
ADBCE
ADBC
ABC
DBC

And I am expecting output as:
Code:
ABCDEFGHI
ACBDEFGH
ADBCE

Thanks again!
# 5  
Old 09-13-2012
Quote:
Originally Posted by yifangt
Code:
awk !x[$2]++ infile > temp.file

and the file became less than ~5 millions. Not sure this is do-able with awk, and need some expertise for the second step to pickup the longest of each pattern.
Thanks a lot in advance!

Yifang
something like that ?
Code:
awk '
$3 ~ /^longest_sequence/ {
  if (x[$2] == 0) { print }
  x[$2]++
}
' file > output

# 6  
Old 09-13-2012
I mean based on the second column, not the third, which I used here for comments description that is actually not there in my real data. Thanks for your input though.
yt
# 7  
Old 09-13-2012
I don't quite follow how you select the 'longest' sequence where you select some, but not the others....
Here's my take (based on your most recent 1-column sample file: awk -f yi.awk myFile
yi.awk:
Code:
BEGIN { ord_init()}
function ord_init(  i,t) {
  for (i=0;i<=255;i++) {
    t=sprintf("%c",i)
    _ord[t]=i
  }
}
function norm(str,   i,n)
{
   for(i=1;i<=length(str);i++)
     n+=_ord[sprintf("%c", substr(str,i,1))]
   return(n)

}
{
  _n=norm($1)
   la[_n]=$1
   lc[_n]++
}
END {
  for(i in la)
      if (lc[i]>1)
        print la[i]
}

This User Gave Thanks to vgersh99 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace substring by longest string in common field (awk)

Hi, Let's say I have a pipe-separated input like so: name_10|A|BCCC|cat_1 name_11|B|DE|cat_2 name_10|A|BC|cat_3 name_11|B|DEEEEEE|cat_4 Using awk, for records with common field 2, I am trying to replace all the shortest substrings by the longest string in field 3. In order to get the... (5 Replies)
Discussion started by: beca123456
5 Replies

2. Shell Programming and Scripting

Parse the longest matching string

Hello experts, I am trying to unscramble a mixed signal into component signals. Let the list of known signals be $ cat tmplist DU DU4016 GFF GFF2010 GFF201019 G2115 G211 DU40 (1 Reply)
Discussion started by: senhia83
1 Replies

3. Shell Programming and Scripting

Need help in awk: running a loop with one column and segregate data 4 each uniq value in that field

Hi All, I have a file like this(having 2 column). Column 1: like a,b,c.... Column 2: having numbers. I want to segregate those numbers based on column 1. Example: file. a 5 b 9 b 620 a 710 b 230 a 330 b 1910 (4 Replies)
Discussion started by: Raza Ali
4 Replies

4. Shell Programming and Scripting

Bring values in the second column into single line (comma sep) for uniq value in the first column

I want to bring values in the second column into single line for uniq value in the first column. My input jvm01, Web 2.0 Feature Pack Library jvm01, IBM WebSphere JAX-RS jvm01, Custom01 Shared Library jvm02, Web 2.0 Feature Pack Library jvm02, IBM WebSphere JAX-RS jvm03, Web 2.0 Feature... (10 Replies)
Discussion started by: kchinnam
10 Replies

5. Shell Programming and Scripting

awk : search last index in specific column

I am trying to search a given text in a file and find its last occurrence index. The task is to append the searched index in the same file but in a separate column. I am able to accomplish the task partially and looking for a solution. Following is the detailed description: names_file.txt ... (17 Replies)
Discussion started by: tarun.trehan
17 Replies

6. Shell Programming and Scripting

Finding the length of the longest column

Hi, I am trying to figure out how to get the length of the longest column in the entire file (because the length varies from one row to the other) I was doing this at first to check how many fields I have for the first row: awk '{print NF; exit}' file Now, I can do this: awk '{ if... (4 Replies)
Discussion started by: MIA651
4 Replies

7. Shell Programming and Scripting

Longest length of string in array

I would be grateful if someone could help me. I am trying to write a .sh script in UNIX. I have the following code; User=john User=james User=ian User=martin for x in ${User} do print ${#x} done This produces the following output; 4 5 3 6 (12 Replies)
Discussion started by: mmab
12 Replies

8. Shell Programming and Scripting

Find longest string and print it

Hello all, I need to find the longest string in a select field and print that field. I have tried a few different methods and I always end up one step from where I need to be. Methods thus far: nawk '{if (length($1) > long) long=length($1); if(length($1)==long) print $1}' The above... (6 Replies)
Discussion started by: SEinT
6 Replies

9. UNIX for Dummies Questions & Answers

How to remove duplicated based on longest row & largest value in a column

Hii i have a file with data as shown below. Here i need to remove duplicates of the rows in such a way that it just checks for 2,3,4,5 column for duplicates.When deleting duplicates,retain largest row i.e with many columns with values should be selected.Then it must remove duplicates such that by... (11 Replies)
Discussion started by: reva
11 Replies

10. Shell Programming and Scripting

Using Awk in shell script to extract an index of a substring from a parent string

Hi All, I am new to this shell scripting world. Struck up with a problem, can anyone of you please pull me out of this. Requirement : Need to get the index of a substring from a parent string Eg : index("Sandy","dy") should return 4 or 3. My Approach : I used Awk function index to... (2 Replies)
Discussion started by: sandeepms17
2 Replies
Login or Register to Ask a Question