Find unique lines based off of bytes

09-11-2013

Registered User

94, 1

Join Date: Apr 2010

Last Activity: 23 January 2014, 3:02 PM EST

Posts: 94

Thanks Given: 15

Thanked 1 Time in 1 Post

Find unique lines based off of bytes

Hello All,
I have two VERY large .csv files that I want to compare values based on substrings. If the lines are unique, then print the line.

For example, if I run a

Code:

diff file1.csv and file2.csv

I get results similar to

Code:

+_id34,brown,car,2006
+_id1,blue,train,1985
+_id73,white,speed_boat,1990
-_id34,brown,car,2006
-_id72,white,plane,2010
-_id73,white,speed_boat,1990

I want to compare the ids (string between "_" and ",") and if it's unique, then print the line so my output would be like the following:

Output:

Code:

+_id1,blue,train,1985
-_id72,white,plane,2010

I was thinking I could sue the cut command and delimit on the first "_" but didn't know how to compare all the values up until you reach the first comma.

Any suggestions?

Last edited by Scott; 09-13-2013 at 12:39 PM.. Reason: Code tags for input and output too

jl487

View Public Profile for jl487

Find all posts by jl487

09-11-2013

Registered User

503, 195

Join Date: Sep 2013

Last Activity: 22 January 2021, 1:52 PM EST

Location: France

Posts: 503

Thanks Given: 43

Thanked 195 Times in 176 Posts

Hi,
Awk command will better, but you can try:

Code:

$ cat comp.txt 
+_id34,brown,car,2006
+_id1,blue,train,1985
+_id73,white,speed_boat,1990
-_id34,brown,car,2006
-_id72,white,plane,2010
-_id73,white,speed_boat,1990

Code:

$ sed 's/[+-]_\([^,]*,\).*/\1/' comp.txt | sort | uniq -u | grep  -f - comp.txt 
+_id1,blue,train,1985
-_id72,white,plane,2010

Regards.

disedorgue

View Public Profile for disedorgue

Find all posts by disedorgue

09-11-2013

Registered User

1,801, 116

Join Date: Oct 2003

Last Activity: 15 May 2015, 11:55 AM EDT

Location: 54.23, -4.53

Posts: 1,801

Thanks Given: 1

Thanked 116 Times in 101 Posts

Try...

Code:

diff file[12].csv | awk -F '_|,' '{a[$2]=$0;b[$2]++}END{for(i in a)if(b[i]==1)print a[i]}'

Ygor

View Public Profile for Ygor

Find all posts by Ygor

09-11-2013

Registered User

94, 1

Join Date: Apr 2010

Last Activity: 23 January 2014, 3:02 PM EST

Posts: 94

Thanks Given: 15

Thanked 1 Time in 1 Post

it works! Thanks!

jl487

View Public Profile for jl487

Find all posts by jl487

09-13-2013

Registered User

94, 1

Join Date: Apr 2010

Last Activity: 23 January 2014, 3:02 PM EST

Posts: 94

Thanks Given: 15

Thanked 1 Time in 1 Post

i think i may have found a flaw in this code. It seems as though if there is a space in any of the lines, it's ignored/thrown out, even if it's unique.

Can someone please help me?

jl487

View Public Profile for jl487

Find all posts by jl487

09-13-2013

Registered User

474, 160

Join Date: Feb 2011

Last Activity: 22 May 2020, 9:47 AM EDT

Posts: 474

Thanks Given: 51

Thanked 160 Times in 135 Posts

This appears to work for the example text:

Code:

fgrep -v -f<(sed 's/^._\(id[0-9][0-9]*\).*$/\1/' < comp.txt  | sort | uniq -d) comp.txt

Code:

fgrep -v -f file input

lists lines from input where they don't match the lines in file (each line is a substring, obviously)

Code:

<(...)

is a process substitution allowing another command to be used
instead of a file

Code:

sed 's/^._\(id[0-9][0-9]*\).*$/\1/'

gives us the list of _idxx

Code:

uniq -d

lists duplicated lines (uniq lines are thrown away).

Does this work for you?

Andrew

apmcd47

View Public Profile for apmcd47

Find all posts by apmcd47

Shell Programming and Scripting

Find unique lines based off of bytes

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print lines based upon unique values in Nth field

Discussion started by: jvoot

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

Discussion started by: cokedude

3. Shell Programming and Scripting

Based on column in file1, find match in file2 and print matching lines

Discussion started by: pathunkathunk

4. Shell Programming and Scripting

Transpose lines from individual blocks to unique lines

Discussion started by: Ophiuchus

5. UNIX for Dummies Questions & Answers

X bytes of 0, Y bytes of random data, Z bytes of 5, T bytes of 1. ??

Discussion started by: razolo13

6. Shell Programming and Scripting

Find and count unique date values in a file based on position

Discussion started by: ronan1219

7. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

Discussion started by: anurupa777

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Discussion started by: krishnix

9. Shell Programming and Scripting

parsing file based on characters/bytes

Discussion started by: cheeko111

10. Shell Programming and Scripting

awk : extracting unique lines based on columns

Discussion started by: genehunter