awk uniq and longest string of a column as index


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk uniq and longest string of a column as index
# 8  
Old 09-13-2012
awk uniq and longest string of a column as index

Thanks vgersh99!
The key point is the substring of the current line to any of the lines that have been read. Kind of recursively comparison.
What's in my mind is:
Code:
read in line;
compare current line to the old ones;
If it is new, remember it;
If it is longer than any of the memory (i.e. any member of the memory is substring of current line), replace the old one with current line;
if it is a substring of any of the memory, ignore current one;

as awk is processing one line at a time, I thought it is good to handle this problem.
Code:
If it is new, remember it;

may not be accurate. Each line for sure is a unique string, but can be substring/"parent"string of other.
Thanks a lot!
yi

Last edited by yifangt; 09-13-2012 at 10:33 AM.. Reason: bug of the algorithm
# 9  
Old 09-13-2012
still a bit vague, but getting your desired output and probably not the most efficient one given the amount of data....
awk -f yi.awk myFile
yi.awk:
Code:
{a[$0]}
END {
  for (i in a)
    for (j in a)
      if (length(i) > length(j) && i ~ j)
        delete a[j]

  for (i in a)
    print i
}


Last edited by vgersh99; 09-13-2012 at 11:17 AM..
# 10  
Old 09-13-2012
Can I ask why need i<=255?
# 11  
Old 09-13-2012
Quote:
Originally Posted by yifangt
Can I ask why need i<=255?
huh? I don't see any mention of '255' in my most recent posting.
The mention of '255' was in the post where I didn't quite understand what you're after - try the most recent post/solution.
# 12  
Old 09-13-2012
Quote:
Originally Posted by vgersh99
Code:
i ~ j

It's probably a good idea to avoid regular expressions. If the real data can contain regular expression metacharacters, they could lead to an erroneous result. Even if the data is strictly alphabetical (as in the sample data), it might be a little bit faster to just use index(i,j).

Regards,
Alister
# 13  
Old 09-13-2012
Quote:
Originally Posted by alister
It's probably a good idea to avoid regular expressions. If the real data can contain regular expression metacharacters, they could lead to an erroneous result. Even if the data is strictly alphabetical (as in the sample data), it might be a little bit faster to just use index(i,j).

Regards,
Alister
nice 'nit-picking' - good idea! Smilie
Thanks
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace substring by longest string in common field (awk)

Hi, Let's say I have a pipe-separated input like so: name_10|A|BCCC|cat_1 name_11|B|DE|cat_2 name_10|A|BC|cat_3 name_11|B|DEEEEEE|cat_4 Using awk, for records with common field 2, I am trying to replace all the shortest substrings by the longest string in field 3. In order to get the... (5 Replies)
Discussion started by: beca123456
5 Replies

2. Shell Programming and Scripting

Parse the longest matching string

Hello experts, I am trying to unscramble a mixed signal into component signals. Let the list of known signals be $ cat tmplist DU DU4016 GFF GFF2010 GFF201019 G2115 G211 DU40 (1 Reply)
Discussion started by: senhia83
1 Replies

3. Shell Programming and Scripting

Need help in awk: running a loop with one column and segregate data 4 each uniq value in that field

Hi All, I have a file like this(having 2 column). Column 1: like a,b,c.... Column 2: having numbers. I want to segregate those numbers based on column 1. Example: file. a 5 b 9 b 620 a 710 b 230 a 330 b 1910 (4 Replies)
Discussion started by: Raza Ali
4 Replies

4. Shell Programming and Scripting

Bring values in the second column into single line (comma sep) for uniq value in the first column

I want to bring values in the second column into single line for uniq value in the first column. My input jvm01, Web 2.0 Feature Pack Library jvm01, IBM WebSphere JAX-RS jvm01, Custom01 Shared Library jvm02, Web 2.0 Feature Pack Library jvm02, IBM WebSphere JAX-RS jvm03, Web 2.0 Feature... (10 Replies)
Discussion started by: kchinnam
10 Replies

5. Shell Programming and Scripting

awk : search last index in specific column

I am trying to search a given text in a file and find its last occurrence index. The task is to append the searched index in the same file but in a separate column. I am able to accomplish the task partially and looking for a solution. Following is the detailed description: names_file.txt ... (17 Replies)
Discussion started by: tarun.trehan
17 Replies

6. Shell Programming and Scripting

Finding the length of the longest column

Hi, I am trying to figure out how to get the length of the longest column in the entire file (because the length varies from one row to the other) I was doing this at first to check how many fields I have for the first row: awk '{print NF; exit}' file Now, I can do this: awk '{ if... (4 Replies)
Discussion started by: MIA651
4 Replies

7. Shell Programming and Scripting

Longest length of string in array

I would be grateful if someone could help me. I am trying to write a .sh script in UNIX. I have the following code; User=john User=james User=ian User=martin for x in ${User} do print ${#x} done This produces the following output; 4 5 3 6 (12 Replies)
Discussion started by: mmab
12 Replies

8. Shell Programming and Scripting

Find longest string and print it

Hello all, I need to find the longest string in a select field and print that field. I have tried a few different methods and I always end up one step from where I need to be. Methods thus far: nawk '{if (length($1) > long) long=length($1); if(length($1)==long) print $1}' The above... (6 Replies)
Discussion started by: SEinT
6 Replies

9. UNIX for Dummies Questions & Answers

How to remove duplicated based on longest row & largest value in a column

Hii i have a file with data as shown below. Here i need to remove duplicates of the rows in such a way that it just checks for 2,3,4,5 column for duplicates.When deleting duplicates,retain largest row i.e with many columns with values should be selected.Then it must remove duplicates such that by... (11 Replies)
Discussion started by: reva
11 Replies

10. Shell Programming and Scripting

Using Awk in shell script to extract an index of a substring from a parent string

Hi All, I am new to this shell scripting world. Struck up with a problem, can anyone of you please pull me out of this. Requirement : Need to get the index of a substring from a parent string Eg : index("Sandy","dy") should return 4 or 3. My Approach : I used Awk function index to... (2 Replies)
Discussion started by: sandeepms17
2 Replies
Login or Register to Ask a Question