Print lines based upon unique values in Nth field


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Print lines based upon unique values in Nth field
# 1  
Old 12-28-2017
Print lines based upon unique values in Nth field

For some reason I am having difficulty performing what should be a fairly easy task. I would like to print lines of a file that have a unique value in the first field. For example, I have a large data-set with the following excerpt:

Code:
PS003,001 MZMWR/ L-DWD// *
PS003,001 B-!!BRX[/+W M(N-PN(H/J >BCLWM// BN/+W *
PS004,001 L-(H-1M]]NYX[/ B-NGJN(H/WT MZMWR/ L-DWD// *
PS005,001 L-(H-1M]]NYX[/ >L H-NXJLWT/(W(T MZMWR/ L-DWD// *
PS006,001 L-(H-1M]]NYX[/ B-NGJN(H/WT <L H-CMJNJ/T MZMWR/ L-DWD// *
PS007,001 CGJWN/ L-DWD// >CR C(JR[ L-JHWH// *
PS007,001 <L DBR/J KWC=// BN/ JMJNJ/ *
PS008,001 L-(H-1M]]NYX[/ <L H-GTJT/ MZMWR/ L-DWD// *
PS009,001 L-(H-1M]]NYX[/ <LMWT/ L-(H-BN/ MZMWR/ L-DWD// *
PS011,001 L-(H-1M]]NYX[/ L-DWD// B-JHWH// XS)HJ[TJ >JK !T!>MR[W L-NPC/+J *
PS011,001 !!NWD[)JW HR/+KM YPWR/ *

The output I desire is this:
Code:
PS004,001 L-(H-1M]]NYX[/ B-NGJN(H/WT MZMWR/ L-DWD// *
PS005,001 L-(H-1M]]NYX[/ >L H-NXJLWT/(W(T MZMWR/ L-DWD// *
PS006,001 L-(H-1M]]NYX[/ B-NGJN(H/WT <L H-CMJNJ/T MZMWR/ L-DWD// *
PS008,001 L-(H-1M]]NYX[/ <L H-GTJT/ MZMWR/ L-DWD// *
PS009,001 L-(H-1M]]NYX[/ <LMWT/ L-(H-BN/ MZMWR/ L-DWD// *

I have attempted 'sort' with appropriate flags which should work, but for some reason I cannot get it to. For example:

Code:
sort -u -k1,1

I have also tried an 'awk' solution:

Code:
awk '!a[$1]++'

Both of the latter seem to give me the first of the two repeated values in $1, such as:

Code:
PS003,001 MZMWR/ L-DWD// *
PS004,001 L-(H-1M]]NYX[/ B-NGJN(H/WT MZMWR/ L-DWD// *
PS005,001 L-(H-1M]]NYX[/ >L H-NXJLWT/(W(T MZMWR/ L-DWD// *
PS006,001 L-(H-1M]]NYX[/ B-NGJN(H/WT <L H-CMJNJ/T MZMWR/ L-DWD// *
PS007,001 CGJWN/ L-DWD// >CR C(JR[ L-JHWH// *
PS008,001 L-(H-1M]]NYX[/ <L H-GTJT/ MZMWR/ L-DWD// *
PS009,001 L-(H-1M]]NYX[/ <LMWT/ L-(H-BN/ MZMWR/ L-DWD// *
PS011,001 L-(H-1M]]NYX[/ L-DWD// B-JHWH// XS)HJ[TJ >JK !T!>MR[W L-NPC/+J *

However, this is not correct. Any help would be greatly appreciated.
# 2  
Old 12-28-2017
With a little more work, you can do this in awk, without the sort, but:
Code:
$ awk '{A[$1]++; L[$1]=$0} END { for( a in A ) if( A[a] == 1 ) print L[a] }' file | sort
PS004,001 L-(H-1M]]NYX[/ B-NGJN(H/WT MZMWR/ L-DWD// *
PS005,001 L-(H-1M]]NYX[/ >L H-NXJLWT/(W(T MZMWR/ L-DWD// *
PS006,001 L-(H-1M]]NYX[/ B-NGJN(H/WT <L H-CMJNJ/T MZMWR/ L-DWD// *
PS008,001 L-(H-1M]]NYX[/ <L H-GTJT/ MZMWR/ L-DWD// *
PS009,001 L-(H-1M]]NYX[/ <LMWT/ L-(H-BN/ MZMWR/ L-DWD// *

(noting that PS004,001, not just PS004 counts as $1, as , is not a field separator)
# 3  
Old 12-28-2017
Thanks Scott! Worked like a charm!
# 4  
Old 12-28-2017
Depending on your uniq version, this might also work:
Code:
uniq -uw5 file
PS004,001 L-(H-1M]]NYX[/ B-NGJN(H/WT MZMWR/ L-DWD// *
PS005,001 L-(H-1M]]NYX[/ >L H-NXJLWT/(W(T MZMWR/ L-DWD// *
PS006,001 L-(H-1M]]NYX[/ B-NGJN(H/WT <L H-CMJNJ/T MZMWR/ L-DWD// *
PS008,001 L-(H-1M]]NYX[/ <L H-GTJT/ MZMWR/ L-DWD// *
PS009,001 L-(H-1M]]NYX[/ <LMWT/ L-(H-BN/ MZMWR/ L-DWD// *

# 5  
Old 01-03-2018
Provided the input file is sorted on column 1, the following awk works as well and does not consume much memory (especially with a big input file)
Code:
awk '
{
  if ($1!=p1) {
    if (c==1) print p0
    c=0
  }
  c++
  p1=$1; p0=$0
}
END {
  if (c==1) print p0
}
' file

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to print lines based on text in field and value in two additional fields

In the awk below I am trying to print the entire line, along with the header row, if $2 is SNV or MNV or INDEL. If that condition is met or is true, and $3 is less than or equal to 0.05, then in $7 the sub pattern :GMAF= is found and the value after the = sign is checked. If that value is less than... (0 Replies)
Discussion started by: cmccabe
0 Replies

2. Shell Programming and Scripting

Print count of unique values

Hello experts, I am converting a number into its binary output as : read n echo "obase=2;$n" | bc I wish to count the maximum continuous occurrences of the digit 1. Example : 1. The binary equivalent of 5 = 101. Hence the output must be 1. 2. The binary... (3 Replies)
Discussion started by: H squared
3 Replies

3. Shell Programming and Scripting

awk to print unique text in field before hyphen

Trying to print the unique values in $2 before the -, currently the count is displayed. Hopefully, the below is close. Thank you :). file chr2:46603668-46603902 EPAS1-902|gc=54.3 253.1 chr2:211471445-211471675 CPS1-1205|gc=48.3 264.7 chr19:15291762-15291983 NOTCH3-1003|gc=68.8 195.8... (3 Replies)
Discussion started by: cmccabe
3 Replies

4. Shell Programming and Scripting

awk to print unique text in field

I am trying to use awk to print the unique entries in $2 So in the example below there are 3 lines but 2 of the lines match in $2 so only one is used in the output. File.txt chr17:29667512-29667673 NF1:exon.1;NF1:exon.2;NF1:exon.38;NF1:exon.4;NF1:exon.46;NF1:exon.47 703.807... (5 Replies)
Discussion started by: cmccabe
5 Replies

5. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

6. Shell Programming and Scripting

awk - printing nth field based on parameter

I have a need to print nth field based on the parameter passed. Suppose I have 3 fields in a file, passing 1 to the function should print 1st field and so on. I have attempted below function but this throws an error due to incorrect awk syntax. function calcmaxlen { FIELDMAXLEN=0 ... (5 Replies)
Discussion started by: krishmaths
5 Replies

7. UNIX for Dummies Questions & Answers

Print Nth to last field

Hey, I'm sure this is answered somewhere but my Googling has turned up nothing. I have a file with data in the following format: <desription of event> at <time and date>The desription of the event is variable length and hence when the list is displayed it is hard to easily see the date (and... (8 Replies)
Discussion started by: RECrerar
8 Replies

8. Shell Programming and Scripting

How to Print from nth field to mth fields using awk

Hi, Is there any short method to print from a particular field till another filed using awk? Example File: File1 ==== 1|2|acv|vbc|......|100|342 2|3|afg|nhj|.......|100|346 Expected output: File2 ==== acv|vbc|.....|100 afg|nhj|.....|100 (8 Replies)
Discussion started by: machomaddy
8 Replies

9. Shell Programming and Scripting

Compare Tab Separated Field with AWK to all and print lines of unique fields.

Hi. I have a tab separated file that has a couple nearly identical lines. When doing: sort file | uniq > file.new It passes through the nearly identical lines because, well, they still are unique. a) I want to look only at field x for uniqueness and if the content in field x is the... (1 Reply)
Discussion started by: rocket_dog
1 Replies

10. Shell Programming and Scripting

Find top N values for field X based on field Y's value

I want to find the top N entries for a certain field based on the values of another field. For example if N=3, we want the 3 best values for each entry: Entry1 ||| 100 Entry1 ||| 95 Entry1 ||| 30 Entry1 ||| 80 Entry1 ||| 50 Entry2 ||| 40 Entry2 ||| 20 Entry2 ||| 10 Entry2 ||| 50... (1 Reply)
Discussion started by: FrancoisCN
1 Replies
Login or Register to Ask a Question