Find unique lines based off of bytes


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find unique lines based off of bytes
# 1  
Old 09-11-2013
Find unique lines based off of bytes

Hello All,
I have two VERY large .csv files that I want to compare values based on substrings. If the lines are unique, then print the line.

For example, if I run a

Code:
diff file1.csv and file2.csv

I get results similar to
Code:
+_id34,brown,car,2006
+_id1,blue,train,1985
+_id73,white,speed_boat,1990
-_id34,brown,car,2006
-_id72,white,plane,2010
-_id73,white,speed_boat,1990

I want to compare the ids (string between "_" and ",") and if it's unique, then print the line so my output would be like the following:

Output:
Code:
+_id1,blue,train,1985
-_id72,white,plane,2010

I was thinking I could sue the cut command and delimit on the first "_" but didn't know how to compare all the values up until you reach the first comma.

Any suggestions?

Last edited by Scott; 09-13-2013 at 12:39 PM.. Reason: Code tags for input and output too
# 2  
Old 09-11-2013
Hi,
Awk command will better, but you can try:
Code:
$ cat comp.txt 
+_id34,brown,car,2006
+_id1,blue,train,1985
+_id73,white,speed_boat,1990
-_id34,brown,car,2006
-_id72,white,plane,2010
-_id73,white,speed_boat,1990

Code:
$ sed 's/[+-]_\([^,]*,\).*/\1/' comp.txt | sort | uniq -u | grep  -f - comp.txt 
+_id1,blue,train,1985
-_id72,white,plane,2010

Regards.
# 3  
Old 09-11-2013
Try...
Code:
diff file[12].csv | awk -F '_|,' '{a[$2]=$0;b[$2]++}END{for(i in a)if(b[i]==1)print a[i]}'

# 4  
Old 09-11-2013
it works! Thanks!
# 5  
Old 09-13-2013
i think i may have found a flaw in this code. It seems as though if there is a space in any of the lines, it's ignored/thrown out, even if it's unique.

Can someone please help me?
# 6  
Old 09-13-2013
This appears to work for the example text:
Code:
fgrep -v -f<(sed 's/^._\(id[0-9][0-9]*\).*$/\1/' < comp.txt  | sort | uniq -d) comp.txt

so
Code:
fgrep -v -f file input

lists lines from input where they don't match the lines in file (each line is a substring, obviously)
Code:
<(...)

is a process substitution allowing another command to be used
instead of a file
Code:
sed 's/^._\(id[0-9][0-9]*\).*$/\1/'

gives us the list of _idxx
Code:
uniq -d

lists duplicated lines (uniq lines are thrown away).

Does this work for you?

Andrew
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print lines based upon unique values in Nth field

For some reason I am having difficulty performing what should be a fairly easy task. I would like to print lines of a file that have a unique value in the first field. For example, I have a large data-set with the following excerpt: PS003,001 MZMWR/ L-DWD// * PS003,001... (4 Replies)
Discussion started by: jvoot
4 Replies

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

3. Shell Programming and Scripting

Based on column in file1, find match in file2 and print matching lines

file1: file2: I need to find matches for any lines in file1 that appear in file2. Desired output is '>' plus the file1 term, followed by the line after the match in file2 (so the title is a little misleading): This is honestly beyond what I can do without spending the whole night on it, so I'm... (2 Replies)
Discussion started by: pathunkathunk
2 Replies

4. Shell Programming and Scripting

Transpose lines from individual blocks to unique lines

Hello to all, happy new year 2013! May somebody could help me, is about a very similar problem to the problem I've posted here where the member rdrtx1 and bipinajith helped me a lot. https://www.unix.com/shell-programming-scripting/211147-map-values-blocks-single-line-2.html It is very... (3 Replies)
Discussion started by: Ophiuchus
3 Replies

5. UNIX for Dummies Questions & Answers

X bytes of 0, Y bytes of random data, Z bytes of 5, T bytes of 1. ??

Hello guys. I really hope someone will help me with this one.. So, I have to write this script who: - creates a file home/student/vmdisk of 10 mb - formats that file to ext3 - mounts that partition to /mnt/partition - creates a file /mnt/partition/data. In this file, there will... (1 Reply)
Discussion started by: razolo13
1 Replies

6. Shell Programming and Scripting

Find and count unique date values in a file based on position

Hello, I need some sort of way to extract every date contained in a file, and count how many of those dates there are. Here are the specifics: The date format I'm looking for is mm/dd/yyyy I only need to look after line 45 in the file (that's where the data begins) The columns of... (2 Replies)
Discussion started by: ronan1219
2 Replies

7. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

9. Shell Programming and Scripting

parsing file based on characters/bytes

I have a datafile that is formatted as fixed. I know that each line should contain 880 characters. I want to separate the file into 2 files, one that has lines with 880 characters and the other file with everything else. Is this possible ? (9 Replies)
Discussion started by: cheeko111
9 Replies

10. Shell Programming and Scripting

awk : extracting unique lines based on columns

Hi, snp.txt CHR_A SNP_A BP_A_st BP_A_End CHR_B BP_B SNP_B R2 p-SNP_A p-SNP_B 5 rs1988728 74904317 74904318 5 74960646 rs1427924 0.377333 0.000740085 0.013930081 5 ... (12 Replies)
Discussion started by: genehunter
12 Replies
Login or Register to Ask a Question