Removing duplicates depending on file size

07-08-2013

Registered User

26, 0

Join Date: Nov 2012

Last Activity: 7 May 2014, 6:21 AM EDT

Posts: 26

Thanks Given: 15

Thanked 0 Times in 0 Posts

Removing duplicates depending on file size

Hi all,

I am working with a huge amount of files in a Linux environment and I was trying to filter my data. Here's what my data looks like

Code:

Name............................Size
OLUSDN.gf.gif-1.JPEG.......5 kb
LKJFDA01.gf.gif-1.JPEG.....3 kb
LKJFDA01.gf.gif-2.JPEG.....1 kb
LKJFDA01.gif-3.JPEG.........0 kb
JLKJAIN11.gf.gif-1.JPEG.....3 kb
LKJFAD.gf.gif-1.JPEG.........2 kb
LKJFAD.gf.gif-4.JPEG.........5 kb
LKJFAD.gf.gif-5.JPEG.........7 kb

The first part of the filename (anything before the first dot is similar in many of them).
I would like to keep the files with unique names. In case the first part of the name (before the first dot) is similar, look for the largest file size and keep it. My resulting data should look something like this:

Code:

Name.............................Size
OLUSDN.gf.gif-1.JPEG.......5 kb
LKJFDA01.gf.gif-1.JPEG.....3 kb
JLKJAIN11.gf.gif-1.JPEG.....3 kb
LKJFAD.gf.gif-5.JPEG....... 7 kb

I think `awk` can do it but I am not sure how to handle duplicates with `awk`
I hope this is not very complicated

Many thanks,

Last edited by jim mcnamara; 07-08-2013 at 07:07 AM..

Error404

View Public Profile for Error404

Find all posts by Error404

07-08-2013

Registered User

454, 69

Join Date: Sep 2006

Last Activity: 9 March 2020, 5:17 AM EDT

Location: Bangalore, India

Posts: 454

Thanks Given: 29

Thanked 69 Times in 68 Posts

I tried this awk code and it returned expected result.

Code:

sort -t"." -k1 yourfile | awk -F"." 'BEGIN{row=$0;T=$1;} {if ($1==T) {FS=" ";if($2>max){max=$2;row=$0;}FS=".";} else {print row;row=$0}; T=$1} END{print row}'

Input File:

Code:

OLUSDN.gf.gif-1.JPEG 5 kb
LKJFDA01.gf.gif-1.JPEG 3 kb
LKJFDA01.gf.gif-2.JPEG 1 kb
LKJFDA01.gif-3.JPEG 0 kb
JLKJAIN11.gf.gif-1.JPEG 3 kb
LKJFAD.gf.gif-1. 2 kb
LKJFAD.gf.gif-4. 5 kb
LKJFAD.gf.gif-5. 7 kb

Output obtained:

Code:

JLKJAIN11.gf.gif-1.JPEG 3 kb
LKJFAD.gf.gif-5. 7 kb
LKJFDA01.gf.gif-1.JPEG 3 kb
OLUSDN.gf.gif-1.JPEG 5 kb

To exclude header, you may add a condition

Code:

if (NR>1)

in the awk code.

Last edited by krishmaths; 07-08-2013 at 07:21 AM.. Reason: Comments for excluding header

krishmaths

View Public Profile for krishmaths

Find all posts by krishmaths

07-08-2013

Registered User

2,050, 105

Join Date: Jun 2008

Last Activity: 9 June 2020, 12:25 AM EDT

Location: INDIA, Bangalore

Posts: 2,050

Thanks Given: 16

Thanked 105 Times in 102 Posts

try sort..

Code:

 
sort -u -t. -k1,1 filename

vidyadhar85

View Public Profile for vidyadhar85

Find all posts by vidyadhar85

07-08-2013

Registered User

26, 0

Join Date: Nov 2012

Last Activity: 7 May 2014, 6:21 AM EDT

Posts: 26

Thanks Given: 15

Thanked 0 Times in 0 Posts

Hi Krishmaths,

Thanks for the script, this is essentially what I want to do but I still have a couple of concerns about the command. The file size isn't really in the second column "$2". I was just representing the file size of each file. How do I get the file size involved in the command?.

The second thing is about the sort command. I have multiple files in a Folder, so can I still use a directory instead of "yourfile" in your command?

Many Thanks,

Error404

View Public Profile for Error404

Find all posts by Error404

07-08-2013

Registered User

454, 69

Join Date: Sep 2006

Last Activity: 9 March 2020, 5:17 AM EDT

Location: Bangalore, India

Posts: 454

Thanks Given: 29

Thanked 69 Times in 68 Posts

To sort multiple files, you may give the files with a wildcard as argument to sort as below

Code:

sort file*

If you do not have a specific pattern to use the wildcard then you may need to find a way to provide all filenames as argument or you may redirect all files into a single file and then sort the single file.

Coming to the problem of file size position, do all the records end with <size> kb?

If yes then we can try to grab the number using a sed command, provided kb is fixed.

krishmaths

View Public Profile for krishmaths

Find all posts by krishmaths

07-08-2013

Registered User

26, 0

Join Date: Nov 2012

Last Activity: 7 May 2014, 6:21 AM EDT

Posts: 26

Thanks Given: 15

Thanked 0 Times in 0 Posts

So basically I am unable to work with directories in Linux? This is a very long process to put them into a file and find them again, specially with someone who's a newbie in Linux (like me), the error margin is huge!

The file sizes aren't available anywhere, I just know that they actually differ and I wanted to grab the largest in size. So my files after an ls command would look like this :

Code:

OLUSDN.gf.gif-1.JPEG    LKJFDA01.gf.gif-1.JPEG    LKJFDA01.gf.gif-2.JPEG    LKJFDA01.gif-3.JPEG

and so on. I don't really want to see the files sizes if the Linux is able to decide that in its own memory... I'm happy with an output of file names, just like the input... in some other location.
Cheers,

---------- Post updated at 07:09 AM ---------- Previous update was at 06:47 AM ----------

I know that the

Code:

ls -al

would give me the 5th column as the file size, but I am now sure how to assign that column to the files to make it part of the name...

Error404

View Public Profile for Error404

Find all posts by Error404

07-08-2013

Registered User

14, 3

Join Date: Dec 2011

Last Activity: 1 September 2013, 7:17 AM EDT

Location: Saudi Arabia

Posts: 14

Thanks Given: 0

Thanked 3 Times in 3 Posts

Duplication Analyses

Hi,

I think below duplication analyses can help you.

Code:

[goksel@gokcell 2july]$ cat file1 
Goksel Yangin
Deneme Test
Goksel Yangin
Deneme Test
Ali Veli
Hasan Huseyin
Test 12345
Unix Linux
Linux Unix
Goksel Yangin
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c
      1 Ali Veli
      2 Deneme Test
      3 Goksel Yangin
      1 Hasan Huseyin
      1 Linux Unix
      1 Test 12345
      1 Unix Linux
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}'
Veli    Ali     DUP1
Test    Deneme  DUP2
Yangin  Goksel  DUP3
Huseyin Hasan   DUP1
Unix    Linux   DUP1
12345   Test    DUP1
Linux   Unix    DUP1
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP1"
Veli    Ali     DUP1
Huseyin Hasan   DUP1
Unix    Linux   DUP1
12345   Test    DUP1
Linux   Unix    DUP1
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP2"
Test    Deneme  DUP2
[goksel@gokcell 2july]$ cat file1  | sort | uniq -c | awk '{print $3"\t"$2"\t""DUP"$1}' | grep "DUP3"
Yangin  Goksel  DUP3

Regards,
Goksel Yangin
Computer Engineer

Last edited by Franklin52; 07-08-2013 at 10:53 AM.. Reason: Please use code tags, thank you

gokcell

View Public Profile for gokcell

Find all posts by gokcell

Shell Programming and Scripting

Removing duplicates depending on file size

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Removing duplicates from new file

Discussion started by: sagar_1986

2. Shell Programming and Scripting

Removing duplicates from new file

Discussion started by: sagar_1986

3. UNIX for Dummies Questions & Answers

Grep from pattern file without removing duplicates?

Discussion started by: Mauve

4. UNIX for Dummies Questions & Answers

Removing duplicates from a file

Discussion started by: Sri3001

5. Shell Programming and Scripting

formatting a file and removing duplicates

Discussion started by: kylle345

6. Shell Programming and Scripting

Removing Duplicates from file

Discussion started by: tinufarid

7. Shell Programming and Scripting

Removing duplicates from log file?

Discussion started by: Ilja

8. UNIX for Dummies Questions & Answers

removing duplicates of a pattern from a file

Discussion started by: ashisharora

9. Shell Programming and Scripting

Removing duplicates in a sorted file by field.

Discussion started by: kinksville

10. UNIX for Dummies Questions & Answers

removing duplicates from a file

Discussion started by: trichyselva