Visit Our UNIX and Linux User Community


Delete lines in file containing duplicate strings, keeping longer strings


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Delete lines in file containing duplicate strings, keeping longer strings
# 1  
Old 09-16-2011
Delete lines in file containing duplicate strings, keeping longer strings

The question is not as simple as the title... I have a file, it looks like this

Code:
<string name="string1">RZ-LED</string>
<string name="string2">2.0</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>

I would like to check for duplicate entries of string2, keeping the longer of two lines...

output would ideally be

Code:
<string name="string1">RZ-LED</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>

Is this possible using GNU tools?
# 2  
Old 09-16-2011
Are the duplicate lines always consecutive?
What should happen if more than one line have the same length?
# 3  
Old 09-16-2011
Assuming the XML is as you've shown it and not some slightly different arrangement:

Code:
$ awk -v FS="\"" '{
        # Remember the order tokens come in
        if(!L[$2]) { C[N++]=$2; L[$2]=1; }
        # Save the longest
        if(length($3) > length(A[$2])) { A[$2]=$3; B[$2]=$0 }
}

END { for(M=0; M<N; M++) print B[C[M]] }' < data
<string name="string1">RZ-LED</string>
<string name="string2">Version 2.0</string>
<string name="string3">BP</string>
$

# 4  
Old 09-16-2011
duplicate lines are not always consecutive, and the item names can vary Smilie


string1 may be defined at line 18, and then string1 might be defined again at like 818...

---------- Post updated at 04:39 PM ---------- Previous update was at 04:36 PM ----------

but you know what? corona, your solution seems to work Smilie

my awk-fu is weak
# 5  
Old 09-16-2011
I kind of cheated. I split on " to get string1/string2/string3 directly(as $2). As long as there's no " anywhere else, $3 is the entire rest of the line, which I use to compare the lengths. I also store the entire line for printing later, and use the C array to remember the order.
# 6  
Old 09-16-2011
The string lengths can vary a lot. Actually it causes issues with long strings, it creates new lines in the file, which doesn't fly.

---------- Post updated at 05:02 PM ---------- Previous update was at 05:01 PM ----------

new lines in the strings is what I meant*

---------- Post updated at 05:04 PM ---------- Previous update was at 05:02 PM ----------

here is an example string that gets mangled:

Code:
%1$s\n\nFrom: %2$s\n\nTo: %3$s

---------- Post updated at 05:11 PM ---------- Previous update was at 05:04 PM ----------

and the reason it is mangled is because of those newline characters in the string... the awk script interprets the newlines when in fact the newline is not supposed to show up until application runtime Smilie

---------- Post updated at 05:13 PM ---------- Previous update was at 05:11 PM ----------

ignoring the "\n"'s would be ideal, can that be done? I don't really understand any of your function...

---------- Post updated at 05:29 PM ---------- Previous update was at 05:13 PM ----------

I got around the newline thing with sed: sed -i -e 's/\\/\\\\/g'
# 7  
Old 09-16-2011
I did ask if the text was always as shown; apparently not. This is why xml is so hard to awk...

Something like that would've been my suggestion to fix it anyway, though Smilie

I don't understand how that string would cause awk to mess up, though! Can you show the actual XML surrounding it?

Previous Thread | Next Thread
Test Your Knowledge in Computers #96
Difficulty: Easy
Unix and Unix-like operating systems are a family of computer operating systems that are derived from the original Unix System developed at Xerox Park.
True or False?

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Use strings from nth field from one file to match strings in entire line in another file, awk

I cannot seem to get what should be a simple awk one-liner to work correctly and cannot figure out why. I would like to use patterns from a specific field in one file as regex to search for matching strings in the entire line ($0) of another file. I would like to output the lines of File2 which... (1 Reply)
Discussion started by: jvoot
1 Replies

2. UNIX for Beginners Questions & Answers

How to pass strings from a list of strings from another file and create multiple files?

Hello Everyone , Iam a newbie to shell programming and iam reaching out if anyone can help in this :- I have two files 1) Insert.txt 2) partition_list.txt insert.txt looks like this :- insert into emp1 partition (partition_name) (a1, b2, c4, s6, d8) select a1, b2, c4, (2 Replies)
Discussion started by: nubie2linux
2 Replies

3. Shell Programming and Scripting

Remove lines containing 2 or more duplicate strings

Within my text file i have several thousand lines of text with some lines containing duplicate strings/words. I would like to entirely remove those lines which contain the duplicate strings. Eg; One and a Two Unix.com is the Best This as a Line Line Example duplicate sentence with the word... (22 Replies)
Discussion started by: martinsmith
22 Replies

4. UNIX for Dummies Questions & Answers

Replace some strings keeping others

I want to replace strings in test2 according to test1 table. In doing so, I`m losing records that I dont need to replace, please suggest modifications. what i have $ cat > test1 a b c d   $ cat > test2 a a a d d   what i tried $ awk ' BEGIN {FS=OFS=" "} FNR==NR{a=$2;next}... (2 Replies)
Discussion started by: senhia83
2 Replies

5. Shell Programming and Scripting

Delete duplicate strings in a line

Hi, i need help to remove duplicates in my file. The problem is i need to delete one duplicate for each line only. the input file as follows and it is not tab delimited:- The output need to remove 2nd word (in red) that duplicate with 1st word (in blue). Other duplicates should remained... (12 Replies)
Discussion started by: redse171
12 Replies

6. Shell Programming and Scripting

Getting lines between two strings with duplicate set of data

if I have the following lines in a file app.log some lines here <AAAA> abc <id>123456789</id> ddd </AAAA>some lines here too <BBBB> abc <id>123456789</id> ddd </BBBB>some lines here too <AAAA> xyz <id>987654321</id> ssss </AAAA>some lines here again... How do I get the... (5 Replies)
Discussion started by: nariwithu
5 Replies

7. Shell Programming and Scripting

Delete lines starting with these strings

Platform : RHEL 5.8 I have text file called myapplication.log . In this file, I have around 800 lines which start with the followng three strings PWRBRKER-3493 PWRBRKER-7834 SCHEDULER-ERROR How can I delete these lines in one go ? (13 Replies)
Discussion started by: omega3
13 Replies

8. UNIX for Dummies Questions & Answers

Delete strings in file1 based on the list of strings in file2

Hello guys, should be a very easy questn for you: I need to delete strings in file1 based on the list of strings in file2. like file2: word1_word2_ word3_word5_ word3_word4_ word6_word7_ file1: word1_word2_otherwords..,word3_word5_others... (7 Replies)
Discussion started by: roussine
7 Replies

9. UNIX for Dummies Questions & Answers

Delete lines with duplicate strings based on date

Hey all, a relative bash/script newbie trying solve a problem. I've got a text file with lots of lines that I've been able to clean up and format with awk/sed/cut, but now I'd like to remove the lines with duplicate usernames based on time stamp. Here's what the data looks like 2007-11-03... (3 Replies)
Discussion started by: mattv
3 Replies

10. Shell Programming and Scripting

Grep and delete lines except the lines with strings

Hi I am writing a script which should read a file and search for certain strings 'approved' or 'removed' and retain only those lines that contain the above strings. Ex: file name 'test' test: approved package waiting for approval package disapproved package removed package approved... (14 Replies)
Discussion started by: vj8436
14 Replies

Featured Tech Videos