Remove Doubles Without Sort?


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Remove Doubles Without Sort?
# 8  
Old 12-12-2012
Quote:
Originally Posted by jim mcnamara
That was helpful. If I've understood correctly, bipinajith's pattern is telling awk to treat the data as a single column array, assign keys to each value, (hash table?), then if the same key comes up again, it's negated.
It's still hard for me to parse the pattern, though. It seems awk is doing an awful lot with very little. But I guess these are all just operators, and all the functions are built into awk.
I still have questions, of course. But I feel like I need to do some reading, first. I appreciate all of your help, everyone!
# 9  
Old 12-14-2012
Quote:
Originally Posted by sudon't
That was helpful. If I've understood correctly, bipinajith's pattern is telling awk to treat the data as a single column array
$0 means 'the entire unmodified line', pure and simple. You can change what 'line' means to awk, but by default, it uses newlines like everything else.

Quote:
assign keys to each value, (hash table?)
You can think of it as a hash table if you like. It could actually be built from a tree or list, but that doesn't really matter -- the point is, you can do a["qwerty"]=5; print a["qwerty"] and get 5 out.
Quote:
It's still hard for me to parse the pattern, though. It seems awk is doing an awful lot with very little.
awk is like grep or sed, in that it has a built-in loop which runs code on every line individually. But it's like perl or shell in that it has no hardcoded function.

awk statements are like conditional { code block }. Whenever the conditional is true, it runs the { code block }. If you leave off { code block }, it assumes { print }, which will print the entire unmodified line.

So, awk '1' acts like cat, because 1 is always true. awk '/regex/' acts like grep "regex", because /regex/ is true whenever the current line matches the regular expression.

Now imagine what happens for every line for awk 'A[$0]++

The first time a line is seen, the value of A for that line will be "", a blank string. awk will consider this false, and not print the line. Next time, it will be a nonzero number, which awk considers true, causing it to print the line.
These 2 Users Gave Thanks to Corona688 For This Post:
# 10  
Old 12-15-2012
Code:
awk '!arr[$0]++' wordlist_file

# 11  
Old 12-20-2012
"Work in Place"

Sorry, had a bit on my plate over the last week, so I had to come back to this.

So, I tried bipinajith's one-liner:
Code:
awk '!arr[$0]++' ~/path/to/file.txt

While it did remove doubles, the output only went to stdout, and the file was untouched. I thought simple redirection would do the job, but no soap. I get a "no such file" complaint.
Code:
awk '!arr[$0]++' ~/path/to/file.txt >~/path/to/newfile.txt

After a while, I figured out that the print statement would get me there:
Code:
awk '!arr[$0]++' ~/path/to/file.txt print$0 >~/path/to/newfile.txt

But I'm still wondering how you get it to "work in place" like sed's -i flag? Isn't that the behavior bipinajith was expecting from his original line?
Also, I got this error, although as far I can tell, everything worked as expected:
Code:
awk: can't open file print-bash
 input record number 9427508, file print-bash
 source line number 1

What does that mean?
# 12  
Old 12-20-2012
This should definitely work!!
Code:
awk '!arr[$0]++' ~/path/to/file.txt > ~/path/to/newfile.txt

Please double check and verify if you are missing something.
This User Gave Thanks to Yoda For This Post:
# 13  
Old 12-20-2012
Quote:
Originally Posted by bipinajith
This should definitely work!!
Code:
awk '!arr[$0]++' ~/path/to/file.txt > ~/path/to/newfile.txt

Please double check and verify if you are missing something.
Huh, you're right, I must of overlooked something. Of course it should work, and it does. But didn't you expect your original code to work in place? In other words, to over-write the original file? Or did I misinterpret?
# 14  
Old 12-20-2012
The contents of a field, as seen by awk, can be changed within an awk program; this changes what awk perceives as the current input record. The actual input is untouched; awk never modifies the input file.

So you should redirect the output to another file and rename it back to original file if required.
This User Gave Thanks to Yoda For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Concatenate and sort to remove duplicates

Following is the input. 1st and 3rd block are same(block starts here with '*' and ends before blank line) , 2nd and 4th blocks are also the same: cat <file> * Wed Feb 24 2016 Tariq Saeed <tariq.x.saeed@mail.com> 2.0.7-1.0.7 - add vmcore dump support for ocfs2 * Mon Jun 8 2015 Brian Maly... (4 Replies)
Discussion started by: Paras Pandey
4 Replies

2. Shell Programming and Scripting

Sort and Remove duplicates

Here is my task : I need to sort two input files and remove duplicates in the output files : Sort by 13 characters from 97 Ascending Sort by 1 characters from 96 Ascending If duplicates are found retain the first value in the file the input files are variable length, convert... (4 Replies)
Discussion started by: ysvsr1
4 Replies

3. Shell Programming and Scripting

Bash - remove duplicates without sort

I need to use bash to remove duplicates without using sort first. I can not use: cat file | sort | uniq But when I use only cat file | uniq some duplicates are not removed. (4 Replies)
Discussion started by: locoroco
4 Replies

4. UNIX for Dummies Questions & Answers

Grep words with X doubles only

Hi! I'm trying to figure out how to find words with X number of doubles, only. I'm searching a dictionary, (one word per line). For instance, if you want to find words containing only one pair of double letters, you could do something like this: egrep '(.)\1' wordlist.txt |egrep -v '(.)\1.*(.)\2'... (3 Replies)
Discussion started by: sudon't
3 Replies

5. Shell Programming and Scripting

awk syntax mistake doubles desired output

I am trying to add a line to a BASH shell script to print out a large variable length table on a web page. I am very new to this obviously, but I tried this with awk and it prints out every line twice. What I am doing wrong? echo "1^2^3%4^5^6%7^8^9%" | awk 'BEGIN { RS="%"; FS="^"; } {for (i =... (6 Replies)
Discussion started by: awknewb123
6 Replies

6. Shell Programming and Scripting

remove duplicates and sort

Hi, I'm using the below command to sort and remove duplicates in a file. But, i need to make this applied to the same file instead of directing it to another. Thanks (6 Replies)
Discussion started by: dvah
6 Replies

7. UNIX Desktop Questions & Answers

need help writing a program to look for doubles

to determine if two two doubles are equal, we check to see if their absolute difference is very close to zero. . .if two numbers are less than .00001 apart, theyre equal. keep a count field in each record (as you did in p5). once the list is complete, ask the user to see if an element is on... (2 Replies)
Discussion started by: rickym2626
2 Replies

8. Shell Programming and Scripting

How to remove duplicate records with out sort

Can any one give me command How to delete duplicate records with out sort. Suppose if the records like below: 345,bcd,789 123,abc,456 234,abc,456 712,bcd,789 out tput should be 345,bcd,789 123,abc,456 Key for the records is 2nd and 3rd fields.fields are seperated by colon(,). (19 Replies)
Discussion started by: svenkatareddy
19 Replies

9. Solaris

How to remove duplicate records with out sort

Can any one give me command How to delete duplicate records with out sort. Suppose if the records like below: 345,bcd,789 123,abc,456 234,abc,456 712,bcd,789 out tput should be 345,bcd,789 123,abc,456 Key for the records is 2nd and 3rd fields.fields are seperated by colon(,). (2 Replies)
Discussion started by: svenkatareddy
2 Replies

10. Programming

long doubles

hey there, i've been trrying to calculate the first 10000 fibonacci numbers using a long double. weird thing is that from a certain value it returns Inf. i'm declaring the vars as long double var; and printing them to a file using: fprintf(filepointer, "%.0Ld\n", var); am i doing... (1 Reply)
Discussion started by: crashnburn
1 Replies
Login or Register to Ask a Question