Duplicates in an XML file

08-31-2011

Registered User

27, 1

Join Date: Feb 2011

Last Activity: 21 October 2011, 11:16 AM EDT

Posts: 27

Thanks Given: 15

Thanked 1 Time in 1 Post

Duplicates in an XML file

Hi All,

I have an xml file that contains information like this

Code:

 
<ID>574922<COMMENT>TEXT
TEXT
TEXT</COMMENT></ID>
<ID>574922<COMMENT>TEXT
TEXT
TEXT</COMMENT></ID>
<ID>412659<COMMENT>TEXT
TEXT
TEXT
TEXT
TEXT</COMMENT></ID>
<ID>873520<COMMENT>TEXT</COMMENT></ID>
<ID>480622<COMMENT>TEXT</COMMENT></ID>
<ID>873520<COMMENT>TEXT
TEXT</COMMENT></ID>
<ID>480622<COMMENT>TEXT</COMMENT></ID>

I want to remove duplicate entries, the problem is I cannot sort, as due to comment stracture for some entries, they form a new line.

I tried the following code
(called as awk -f script file)

Code:

/^end:/ {   if (! (Record in Records)) {      
Records[Record];      
print RecordLabel ":";      
print Record;      
print $0;        
Record = "";   
}   
next;}$1 ~ /^.*:/ {   sub(/:.*/, "", $1);   
RecordLabel = $1;   
next;}{   Record = (Record ? Record "\n" : "") $0;}

provided by Aigles in another post, which for some reason does not work (modified to my input), it does not change anything. I tried to variate it unsuccsfully so far.

any ideas would be much appreciated

many thanks

Last edited by TasosARISFC; 09-08-2011 at 05:34 AM..

TasosARISFC

View Public Profile for TasosARISFC

Find all posts by TasosARISFC

08-31-2011

Registered User

4,996, 477

Join Date: Dec 2003

Last Activity: 12 June 2016, 11:03 PM EDT

Location: /dev/ph

Posts: 4,996

Thanks Given: 73

Thanked 477 Times in 439 Posts

The usual way to do this is to use a modified form of Muenchian grouping in a XSLT 1.0 stylesheet.

fpmurphy

View Public Profile for fpmurphy

Find all posts by fpmurphy

08-31-2011

Registered User

27, 1

Join Date: Feb 2011

Last Activity: 21 October 2011, 11:16 AM EDT

Posts: 27

Thanks Given: 15

Thanked 1 Time in 1 Post

Hi, sadly I have no XSLT processor nor can I install one as my machine is restricted

---------- Post updated at 03:57 PM ---------- Previous update was at 03:53 PM ----------

Is there a way to check whats between <ID></ID> and check if that exists somewhere else in the file. if it does...delete it? Also how can I bypass "/" in awk? for example I cannot do this:

Code:

/^</ID>/

I want to search for </ID>

TasosARISFC

View Public Profile for TasosARISFC

Find all posts by TasosARISFC

08-31-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Escape it with \, like \/

Unfortunately processing XML isn't trivial. Without a proper recursive parser for it you end up building one yourself, brute-force, character by character, because the record-based language constructs of awk don't help you. That's why tools like XSLT processors exist..

working on something.

---------- Post updated at 10:55 AM ---------- Previous update was at 09:06 AM ----------

Here's a semi-ugly GNU awk solution. It works by breaking apart records on < and fields on >. Meaning, the first token is always a complete tag and the second, if any, is text -- <stuff param=1>text would get split into "stuff param=1", "text". I've tried to make it tolerate improperly nested tags, uppercase vs lowercase, etc, but can't possibly make it perfect.

Code:

$ cat xml.awk
#!/usr/bin/gawk -f

function inside(STR, OFF, N)
{
        for(N=(D-(1+OFF)); N>=0; N--)
        if(TAG[N] == tolower(STR))      return(1);

        return(0);
}

BEGIN {         RS="<"  ;       FS=">"  ;       LAST="C"        }

{
        # closing tag
        if($0 ~ /^[/]/)
        {
                TMP=D;
                D--;
                # Try to deal gracefully with improperly nested tags
                while(tolower(substr($1, 2, length($1)-1)) != TAG[D])
                {
                        if(D > 0)       D--;
                        else    {       D=TMP;  break;          }
                }

                if(!inside("ID", 0))    printf("<%s>\n", $1);
                LAST="C"
        }
        else if($1)     # open tag
        {
                if(!inside("ID", 0))
                {
                        if(LAST == "O") { printf("\n"); }

                        printf("<%s>", $1);
                        LAST="O";
                }

                if(! ($1 ~ /[/]$/))     # ignore self-closing tags
                {
                        split($1, a, "[ \r\n\t]");
                        TAG[D++]=tolower(a[1]);
                }

                if(!inside("ID", 1))
                if(NF > 1)      printf("%s", $2);
        }
}
$ ./xml.awk < data.xml
<ID>574922</ID>
<ID>574922</ID>
<ID>412659</ID>
<ID>873520</ID>
<ID>480622</ID>
<ID>873520</ID>
<ID>480622</ID>
$

Last edited by Corona688; 08-31-2011 at 02:33 PM.. Reason: [edit] improved version with fewer extra newlines

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

09-08-2011

Registered User

27, 1

Join Date: Feb 2011

Last Activity: 21 October 2011, 11:16 AM EDT

Posts: 27

Thanks Given: 15

Thanked 1 Time in 1 Post

Hi Corona, thank you for your effort. As in your example what this did was to create a list of <ID> 0000000</ID> removing all other tags, but still with duplicates.

I thought of sorting and use uniq to get the duplicate IDs from this list, then delete them from the original file (not the list). However this does not resolve the problem that some elements have comments that extent to new lines and those lines do not get removed

eg

Code:

 
<ID> 000000<COMMENT> TEXT
TEXT
TEXT
TEXT </COMMENT></ID>

will only delete the first line

So again I am looking for a way to look for xml tags <ID></ID> and if whats between them already exists in the file delete it.

---------- Post updated at 10:11 AM ---------- Previous update was at 09:29 AM ----------

I can see this is rather complicated... an other way I could do this is by providing the list of duplicate ID's, search the file for them and delete them. I already know the duplicate IDs from a sys out, so I can place them in a text file as a list.However I have two issues

First, how do I delete all but one?
and second how do I define where to start deleting and where to stop?

TasosARISFC

View Public Profile for TasosARISFC

Find all posts by TasosARISFC

09-12-2011

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

You could replace all newlines with spaces, then insert newlines only where you want them:

Code:

$ tr '\n' ' ' < data  | sed 's#</ID>#</ID>\n#g'
<ID>574922<COMMENT>TEXT TEXT TEXT</COMMENT></ID>
 <ID>574922<COMMENT>TEXT TEXT TEXT</COMMENT></ID>
 <ID>412659<COMMENT>TEXT TEXT TEXT TEXT TEXT</COMMENT></ID>
 <ID>873520<COMMENT>TEXT</COMMENT></ID>
 <ID>480622<COMMENT>TEXT</COMMENT></ID>
 <ID>873520<COMMENT>TEXT TEXT</COMMENT></ID>
 <ID>480622<COMMENT>TEXT</COMMENT></ID>
$

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

Shell Programming and Scripting

Duplicates in an XML file

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to pull multiple XML tags from the same XML file in Shell.?

Discussion started by: hungryd

2. UNIX for Beginners Questions & Answers

Grepping multiple XML tag results from XML file.

Discussion started by: shubh752

3. Shell Programming and Scripting

Splitting a single xml file into multiple xml files

Discussion started by: Narendra921631

4. Shell Programming and Scripting

Split xml file into multiple xml based on letterID

Discussion started by: vx04

5. Shell Programming and Scripting

Comparing delta values of one xml file in other xml file

Discussion started by: sharsour

6. Shell Programming and Scripting

XML: parsing of the Google contacts XML file

Discussion started by: ripat

7. Shell Programming and Scripting

Help required in Splitting a xml file into multiple and appending it in another .xml file

Discussion started by: ganesan kulasek

8. Shell Programming and Scripting

How to add the multiple lines of xml tags before a particular xml tag in a file

Discussion started by: mjavalkar

9. Shell Programming and Scripting

How to remove xml namespace from xml file using shell script?

Discussion started by: Gary1978

10. Shell Programming and Scripting

How to parse a XML file using PERL and XML::DOm

Discussion started by: girigopal