awk : extracting unique lines based on columns


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting awk : extracting unique lines based on columns
# 8  
Old 05-01-2010
Quote:
Originally Posted by itkamaraj
please explain the logic

what is a !~ $2

please explain the logic


thanks
kamaraj
Nope Smilie
I've found it on the net some time ago and it simply emulates uniq, applied on the 2nd column.
Create a testfile with following content and play with it, try $0, $1, and $2 and see the difference:
Code:
abc 000
abc 123
abc 123
sdf 234
fds 234
jkl 999
lkj 222
lkj 222
qwe 443
rew 323

genehunter,
how can you know that it didn't work? Just because our results differ?
I guess without having the original file it will be hard to find the error.
You will have to copy out e.g. the first 500 lines off the datafile in a tempfile and see if the results differ.
Maybe you can then attach that tempfile to your posting, so we all can examine it and try to find the error (if there is any Smilie)
# 9  
Old 05-01-2010
Try this...

Code:
awk '!arr[$2]++' infile

# 10  
Old 05-01-2010
Quote:
Originally Posted by malcomex999
Try this...

Code:
awk '!arr[$2]++' infile

That simply means, just like "my" suggestion, print if not already in array, right?
# 11  
Old 05-01-2010
Solution #2 implicitly assumes that identical occurrences in column2 are adjacent, which is possibly not the case. This would explain the higher outcome. This can be checked by sorting first, e.g.:
Code:
sort -k2,2 snp.txt | awk 'a !~ $2; {a=$2}'

# 12  
Old 05-01-2010
another solution:-

Code:
nawk '! _[$2]++' infile.txt

the above solution doesn't need pipe or sed. even if the identical rows (same 2nd colom) are not after each others.

BR

---------- Post updated at 15:50 ---------- Previous update was at 15:30 ----------

also in perl use below:-

Code:
perl  -lane 'print if ( ! $h{$F[0]}++) ;' snp.txt

# 13  
Old 05-01-2010
Upon printing just the col2 from the file and sorting uniq, I found that both malcolmex999 and pseudocoder codes give same results.
I further looked at the first few lines of the col2 from results (without uniq) and found that kamaraj code shows duplicates
Ahmead.diab your code shows same results as malcolmex999 and pseudocoder. Thanks!
Code:
Final_Apr30 :~>head kamarajtestsnpuniq
SNP_A
rs996312
rs9942844
rs9942844
rs990327
rs988961
rs987824
rs976263
rs976263
rs976240
Final_Apr30 :~>head malcomxsnpuniq
SNP_A
rs7259854
rs2981575
rs11150978
rs11200014
rs2981579
rs1078806
rs1219648
rs6590505
rs6590504
Final_Apr30 :~>head pesudocodersnpuniq
SNP_A
rs7259854
rs2981575
rs11150978
rs11200014
rs2981579
rs1078806
rs1219648
rs6590505
rs6590504

Code:
line count
    56446 malcomx
    56446 malcomxsnpuniq
    57747 pesudocoder
    57747 pesudocodersnpuniq
    56446 kamarajtest
    24657 kamarajtestuniq

Hope this helps clarify.
Thank you all for your help and patience.
~GH
"Stand Up to Cancer!"

Last edited by genehunter; 05-01-2010 at 02:48 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Linux

To get all the columns in a CSV file based on unique values of particular column

cat sample.csv ID,Name,no 1,AAA,1 2,BBB,1 3,AAA,1 4,BBB,1 cut -d',' -f2 sample.csv | sort | uniq this gives only the 2nd column values Name AAA BBB How to I get all the columns of CSV along with this? (1 Reply)
Discussion started by: sanvel
1 Replies

2. Shell Programming and Scripting

Remove lines with unique information in indicated columns

Hi, I have the 3-column, tab-separated following data: dot is-big 2 dot is-round 3 dot is-gray 4 cat is-big 3 hot in-summer 5 I want to remove all of those lines in which the values of Columns 1 and 2 are identical. In this way, the results would be as follows: dot is-big 2 cat... (4 Replies)
Discussion started by: owwow14
4 Replies

3. Shell Programming and Scripting

Find unique lines based off of bytes

Hello All, I have two VERY large .csv files that I want to compare values based on substrings. If the lines are unique, then print the line. For example, if I run a diff file1.csv and file2.csv I get results similar to +_id34,brown,car,2006 +_id1,blue,train,1985... (5 Replies)
Discussion started by: jl487
5 Replies

4. Shell Programming and Scripting

count the unique records based on certain columns

Hi everyone, I have a file result.txt with records as following and another file mirna.txt with a list of miRNAs e.g. miR22, miR123, miR13 etc. Gene Transcript miRNA Gar Nm_111233 miR22 Gar Nm_123440 miR22 Gar Nm_129939 miR22 Hel Nm_233900 miR13 Hel ... (6 Replies)
Discussion started by: miclow
6 Replies

5. Shell Programming and Scripting

How to merge columns into lines, using unique keys?

I would really appreciate a sulution for this : invoice# client# 5929 231 4358 231 2185 231 6234 231 1166 464 1264 464 3432 464 1720 464 9747 464 1133 791 4930 791 5496 791 6291 791 8681 989 3023 989 (2 Replies)
Discussion started by: hemo21
2 Replies

6. UNIX Desktop Questions & Answers

Extracting only unique data between two columns

:wall:Hi there, I am trying to extract/filter a unique data between specific columns from a tab deliminated file, that has a number of columns: input file as follow: 5 rs1 70 A C 7 1 1 Blue 5 rs9 66 A E ... (2 Replies)
Discussion started by: houkto
2 Replies

7. Shell Programming and Scripting

Extracting several lines of text after a unique string

I'm attempting to write a script to identify users who have sudo access on a server. I only want to extract the ID's of the sudo users after a unique line of text. The list of sudo users goes to the EOF so I only need the script to start after the unique line of text. I already have a script to... (1 Reply)
Discussion started by: bouncer
1 Replies

8. Shell Programming and Scripting

Extracting Text Between Two Unique Lines

Hi all! Im trying to extract a portion of text from a file and put it into a new file. I need all the lines between <Placement> and </Placement> including the Placemark lines themselves. Is there a way to extract all instances of these and not just the first one found? I've tried using sed and... (4 Replies)
Discussion started by: Grizzly
4 Replies

9. Shell Programming and Scripting

extracting unique lines from text file

I have a file with 14million lines and I would like to extract all the unique lines from the file into another text file. For example: Contents of file1 happy sad smile happy funny sad I want to run a command against file one that only returns the unique lines (ie 1 line for happy... (3 Replies)
Discussion started by: soliberus
3 Replies

10. Shell Programming and Scripting

Remove lines, Sorted with Time based columns using AWK & SORT

Hi having a file as follows MediaErr.log 84 Server1 Policy1 Schedule1 master1 05/08/2008 02:12:16 84 Server1 Policy1 Schedule1 master1 05/08/2008 02:22:47 84 Server1 Policy1 Schedule1 master1 05/08/2008 03:41:26 84 Server1 Policy1 ... (1 Reply)
Discussion started by: karthikn7974
1 Replies
Login or Register to Ask a Question