Removing duplicates depending on file size


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Removing duplicates depending on file size
# 1  
Old 07-08-2013
Wrench Removing duplicates depending on file size

Hi all,

I am working with a huge amount of files in a Linux environment and I was trying to filter my data. Here's what my data looks like

Code:
Name............................Size
OLUSDN.gf.gif-1.JPEG.......5 kb
LKJFDA01.gf.gif-1.JPEG.....3 kb
LKJFDA01.gf.gif-2.JPEG.....1 kb
LKJFDA01.gif-3.JPEG.........0 kb
JLKJAIN11.gf.gif-1.JPEG.....3 kb
LKJFAD.gf.gif-1.JPEG.........2 kb
LKJFAD.gf.gif-4.JPEG.........5 kb
LKJFAD.gf.gif-5.JPEG.........7 kb


The first part of the filename (anything before the first dot is similar in many of them).
I would like to keep the files with unique names. In case the first part of the name (before the first dot) is similar, look for the largest file size and keep it. My resulting data should look something like this:

Code:
Name.............................Size
OLUSDN.gf.gif-1.JPEG.......5 kb
LKJFDA01.gf.gif-1.JPEG.....3 kb
JLKJAIN11.gf.gif-1.JPEG.....3 kb
LKJFAD.gf.gif-5.JPEG....... 7 kb

I think `awk` can do it but I am not sure how to handle duplicates with `awk`
I hope this is not very complicated Smilie
Many thanks,

Last edited by jim mcnamara; 07-08-2013 at 07:07 AM..
# 2  
Old 07-08-2013
I tried this awk code and it returned expected result.

Code:
sort -t"." -k1 yourfile | awk -F"." 'BEGIN{row=$0;T=$1;} {if ($1==T) {FS=" ";if($2>max){max=$2;row=$0;}FS=".";} else {print row;row=$0}; T=$1} END{print row}'

Input File:
Code:
OLUSDN.gf.gif-1.JPEG 5 kb
LKJFDA01.gf.gif-1.JPEG 3 kb
LKJFDA01.gf.gif-2.JPEG 1 kb
LKJFDA01.gif-3.JPEG 0 kb
JLKJAIN11.gf.gif-1.JPEG 3 kb
LKJFAD.gf.gif-1. 2 kb
LKJFAD.gf.gif-4. 5 kb
LKJFAD.gf.gif-5. 7 kb


Output obtained:
Code:
JLKJAIN11.gf.gif-1.JPEG 3 kb
LKJFAD.gf.gif-5. 7 kb
LKJFDA01.gf.gif-1.JPEG 3 kb
OLUSDN.gf.gif-1.JPEG 5 kb

To exclude header, you may add a condition
Code:
if (NR>1)

in the awk code.

Last edited by krishmaths; 07-08-2013 at 07:21 AM.. Reason: Comments for excluding header
# 3  
Old 07-08-2013
try sort..

Code:
 
sort -u -t. -k1,1 filename

# 4  
Old 07-08-2013
Hi Krishmaths,

Thanks for the script, this is essentially what I want to do but I still have a couple of concerns about the command. The file size isn't really in the second column "$2". I was just representing the file size of each file. How do I get the file size involved in the command?.

The second thing is about the sort command. I have multiple files in a Folder, so can I still use a directory instead of "yourfile" in your command?

Many Thanks,
# 5  
Old 07-08-2013
To sort multiple files, you may give the files with a wildcard as argument to sort as below

Code:
sort file*

If you do not have a specific pattern to use the wildcard then you may need to find a way to provide all filenames as argument or you may redirect all files into a single file and then sort the single file.

Coming to the problem of file size position, do all the records end with <size> kb?

If yes then we can try to grab the number using a sed command, provided kb is fixed.
# 6  
Old 07-08-2013
So basically I am unable to work with directories in Linux? This is a very long process to put them into a file and find them again, specially with someone who's a newbie in Linux (like me), the error margin is huge!


The file sizes aren't available anywhere, I just know that they actually differ and I wanted to grab the largest in size. So my files after an ls command would look like this :
Code:
OLUSDN.gf.gif-1.JPEG    LKJFDA01.gf.gif-1.JPEG    LKJFDA01.gf.gif-2.JPEG    LKJFDA01.gif-3.JPEG

and so on. I don't really want to see the files sizes if the Linux is able to decide that in its own memory... I'm happy with an output of file names, just like the input... in some other location.
Cheers,

---------- Post updated at 07:09 AM ---------- Previous update was at 06:47 AM ----------

I know that the
Code:
ls -al

would give me the 5th column as the file size, but I am now sure how to assign that column to the files to make it part of the name...
# 7  
Old 07-08-2013
Duplication Analyses

Hi,

I think below duplication analyses can help you.
Code:
[goksel@gokcell 2july]$ cat file1 
Goksel Yangin
Deneme Test
Goksel Yangin
Deneme Test
Ali Veli
Hasan Huseyin
Test 12345
Unix Linux
Linux Unix
Goksel Yangin
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c
      1 Ali Veli
      2 Deneme Test
      3 Goksel Yangin
      1 Hasan Huseyin
      1 Linux Unix
      1 Test 12345
      1 Unix Linux
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}'
Veli    Ali     DUP1
Test    Deneme  DUP2
Yangin  Goksel  DUP3
Huseyin Hasan   DUP1
Unix    Linux   DUP1
12345   Test    DUP1
Linux   Unix    DUP1
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP1"
Veli    Ali     DUP1
Huseyin Hasan   DUP1
Unix    Linux   DUP1
12345   Test    DUP1
Linux   Unix    DUP1
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP2"
Test    Deneme  DUP2
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP3"
Yangin  Goksel  DUP3

Regards,
Goksel Yangin
Computer Engineer

Last edited by Franklin52; 07-08-2013 at 10:53 AM.. Reason: Please use code tags, thank you
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3.I have tried previous post also,but in that complete line must be similar.In this case i have to verify first column only regardless what is the content in succeeding columns. (3 Replies)
Discussion started by: sagar_1986
3 Replies

2. Shell Programming and Scripting

Removing duplicates from new file

i hav two files like i want to remove/delete all the duplicate lines in file2 which are viz unix,unix2,unix3 (2 Replies)
Discussion started by: sagar_1986
2 Replies

3. UNIX for Dummies Questions & Answers

Grep from pattern file without removing duplicates?

I have been using grep to output whole lines using a pattern file with identifiers (fileA): fig|562.2322.peg.1 fig|562.2322.peg.3 fig|562.2322.peg.3 fig|562.2322.peg.3 fig|562.2322.peg.7 From fileB with corresponding identifiers in the second column: NODE_0 fig|562.2322.peg.1 peg ... (2 Replies)
Discussion started by: Mauve
2 Replies

4. UNIX for Dummies Questions & Answers

Removing duplicates from a file

Hi All, I am merging files coming from 2 different systems ,while doing that I am getting duplicates entries in the merged file I,01,000131,764,2,4.00 I,01,000131,765,2,4.00 I,01,000131,772,2,4.00 I,01,000131,773,2,4.00 I,01,000168,762,2,2.00 I,01,000168,763,2,2.00... (5 Replies)
Discussion started by: Sri3001
5 Replies

5. Shell Programming and Scripting

formatting a file and removing duplicates

Hi, I have a file that I want to change the format of. It is a large file in rows but I want it to be comma separated (comma then a space). The current file looks like this: HI, Joe, Bob, Jack, Jack After I would want to remove any duplicates so it would look like this: HI, Joe,... (2 Replies)
Discussion started by: kylle345
2 Replies

6. Shell Programming and Scripting

Removing Duplicates from file

Hi Experts, Please check the following new requirement. I got data like the following in a file. FILE_HEADER 01cbbfde7898410| 3477945| home| 1 01cbc275d2c122| 3478234| WORK| 1 01cbbe4362743da| 3496386| Rich Spare| 1 01cbc275d2c122| 3478234| WORK| 1 This is pipe separated file with... (3 Replies)
Discussion started by: tinufarid
3 Replies

7. Shell Programming and Scripting

Removing duplicates from log file?

I have a log file with posts looking like this: -- Messages can be delivered by different systems at different times. The id number is used to sort out duplicate messages. What I need is to strip the arrival time from each post, sort posts by id number, and reattach arrival time to respective... (2 Replies)
Discussion started by: Ilja
2 Replies

8. UNIX for Dummies Questions & Answers

removing duplicates of a pattern from a file

hey all, I need some help. I have a text file with names in it. My target is that if a particular pattern exists in that file more than once..then i want to rename all the occurences of that pattern by alternate patterns.. for e.g if i have PATTERN occuring 5 times then i want to... (3 Replies)
Discussion started by: ashisharora
3 Replies

9. Shell Programming and Scripting

Removing duplicates in a sorted file by field.

I have data like this: It's sorted by the 2nd field (TID). envoy,90000000000000634600010001,04/11/2008,23:19:27,RB00266,0015,DETAIL,ERROR, envoy,90000000000000634600010001,04/12/2008,04:23:45,RB00266,0015,DETAIL,ERROR,... (1 Reply)
Discussion started by: kinksville
1 Replies

10. UNIX for Dummies Questions & Answers

removing duplicates from a file

i have a file with some 1000 entries it will contain entries like 1000,ram 2000,pankaj 1001,rahim 1000,ram 2532,govind 2000,pankaj 3000,venkat 2532,govind what i want is i want to extract only the distinct rows from this file so my output should contain only 1000,ram... (2 Replies)
Discussion started by: trichyselva
2 Replies
Login or Register to Ask a Question