Delete lines in file containing duplicate strings, keeping longer strings

09-16-2011

Registered User

82, 0

Join Date: Jun 2008

Last Activity: 24 January 2012, 1:10 PM EST

Posts: 82

Thanks Given: 2

Thanked 0 Times in 0 Posts

Delete lines in file containing duplicate strings, keeping longer strings

The question is not as simple as the title... I have a file, it looks like this

Code:

<string name="string1">RZ-LED</string>
<string name="string2">2.0</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>

I would like to check for duplicate entries of string2, keeping the longer of two lines...

output would ideally be

Code:

<string name="string1">RZ-LED</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>

Is this possible using GNU tools?

raidzero

View Public Profile for raidzero

Find all posts by raidzero

09-16-2011

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Are the duplicate lines always consecutive?
What should happen if more than one line have the same length?

radoulov

View Public Profile for radoulov

Find all posts by radoulov

09-16-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Assuming the XML is as you've shown it and not some slightly different arrangement:

Code:

$ awk -v FS="\"" '{
        # Remember the order tokens come in
        if(!L[$2]) { C[N++]=$2; L[$2]=1; }
        # Save the longest
        if(length($3) > length(A[$2])) { A[$2]=$3; B[$2]=$0 }
}

END { for(M=0; M<N; M++) print B[C[M]] }' < data
<string name="string1">RZ-LED</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>
$

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-16-2011

Registered User

82, 0

Join Date: Jun 2008

Last Activity: 24 January 2012, 1:10 PM EST

Posts: 82

Thanks Given: 2

Thanked 0 Times in 0 Posts

duplicate lines are not always consecutive, and the item names can vary

string1 may be defined at line 18, and then string1 might be defined again at like 818...

---------- Post updated at 04:39 PM ---------- Previous update was at 04:36 PM ----------

but you know what? corona, your solution seems to work

my awk-fu is weak

raidzero

View Public Profile for raidzero

Find all posts by raidzero

09-16-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I kind of cheated. I split on " to get string1/string2/string3 directly(as $2). As long as there's no " anywhere else, $3 is the entire rest of the line, which I use to compare the lengths. I also store the entire line for printing later, and use the C array to remember the order.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-16-2011

Registered User

82, 0

Join Date: Jun 2008

Last Activity: 24 January 2012, 1:10 PM EST

Posts: 82

Thanks Given: 2

Thanked 0 Times in 0 Posts

The string lengths can vary a lot. Actually it causes issues with long strings, it creates new lines in the file, which doesn't fly.

---------- Post updated at 05:02 PM ---------- Previous update was at 05:01 PM ----------

new lines in the strings is what I meant*

---------- Post updated at 05:04 PM ---------- Previous update was at 05:02 PM ----------

here is an example string that gets mangled:

Code:

%1$s\n\nFrom: %2$s\n\nTo: %3$s

---------- Post updated at 05:11 PM ---------- Previous update was at 05:04 PM ----------

and the reason it is mangled is because of those newline characters in the string... the awk script interprets the newlines when in fact the newline is not supposed to show up until application runtime

---------- Post updated at 05:13 PM ---------- Previous update was at 05:11 PM ----------

ignoring the "\n"'s would be ideal, can that be done? I don't really understand any of your function...

---------- Post updated at 05:29 PM ---------- Previous update was at 05:13 PM ----------

I got around the newline thing with sed: sed -i -e 's/\\/\\\\/g'

raidzero

View Public Profile for raidzero

Find all posts by raidzero

09-16-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I did ask if the text was always as shown; apparently not. This is why xml is so hard to awk...

Something like that would've been my suggestion to fix it anyway, though

I don't understand how that string would cause awk to mess up, though! Can you show the actual XML surrounding it?

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Delete lines in file containing duplicate strings, keeping longer strings

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Use strings from nth field from one file to match strings in entire line in another file, awk

Discussion started by: jvoot

2. UNIX for Beginners Questions & Answers

How to pass strings from a list of strings from another file and create multiple files?

Discussion started by: nubie2linux

3. Shell Programming and Scripting

Remove lines containing 2 or more duplicate strings

Discussion started by: martinsmith

4. UNIX for Dummies Questions & Answers

Replace some strings keeping others

Discussion started by: senhia83

5. Shell Programming and Scripting

Delete duplicate strings in a line

Discussion started by: redse171

6. Shell Programming and Scripting

Getting lines between two strings with duplicate set of data

Discussion started by: nariwithu

7. Shell Programming and Scripting

Delete lines starting with these strings

Discussion started by: omega3

8. UNIX for Dummies Questions & Answers

Delete strings in file1 based on the list of strings in file2

Discussion started by: roussine

9. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Discussion started by: mattv

10. Shell Programming and Scripting

Grep and delete lines except the lines with strings

Discussion started by: vj8436