Shell Programming and Scripting

View Public Profile for cmccabe

06-18-2014

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I attached a example output file. Thanks.

UBE3A.txt (213 Bytes)

cmccabe

Find all posts by cmccabe

06-18-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Please post short text in code tags instead of attachments.

Here is the content of your attachment:

Code:

TestName	UBE3A sequencing
Offerer	Genetic Services Laboratory University of Chicago
Address	"5841 S. Maryland Ave. Rm G701, MC0077"
City	Chicago
State	Illinois
Method	Bi-directional Sanger Sequence Analysis

Corona688

View Public Profile for cmccabe

06-18-2014

Registered User

1,393, 20

Join Date: Nov 2013

Last Activity: 1 May 2020, 2:35 PM EDT

Location: Chicago

Posts: 1,393

Thanks Given: 901

Thanked 20 Times in 19 Posts

I apologize and that is the desired output, the code I posted is close, but not perfect. Thanks

cmccabe

Find all posts by cmccabe

06-18-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I see what you mean -- you don't just want <string>text</string>, you want the CORRECT <string>text</string>. Unfortunately the difference between that and what you have is code that understands XML versus code which just greps lines... I'll take a gander at it.

Corona688

06-18-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

I'm afraid it's not a one-liner anymore but it is the shortest even marginally-compliant parser I've written:

Code:

$ cat uniqxml.awk

BEGIN {
        FS=">"
        RS="<"
        OFS="\t"
}

NR==1 { next } # The first "line" is blank when RS=<
/^[!?]/ {       next    }               # Skip XML specification junk
{       gsub(/[\r\n]*$/, " ");  }       # Clean up newlines

# Handle open-tags
match($0, /^[^\/ \r\n\t]+/) {
        TAG=substr(toupper($0), RSTART, RLENGTH);
        TAGS=TAG "%" TAGS;
}

# Handle close-tags
/^[\/]/ {
        sub(/^\//, "", $1);
        sub("^.*" toupper($1) "%", "", TAGS);
        next;
}
TAGS ~ /^(TESTNAME|OFFERER|LINE1|CITY|STATE|STRING%METHODLIST%CATEGORY)%/ {
        print $1, $2
}

$ awk -f uniqxml.awk input.xml

TestName        UBE3A sequencing
Offerer Genetic Services Laboratory University of Chicago
Line1   5841 S. Maryland Ave. Rm G701, MC0077
City    Chicago
State   Illinois
string  Bi-directional Sanger Sequence Analysis

$

It processes tag-by-tag instead of line-by-line, and keeps a list of the tags its seen. "<html><body><h1>" would put "H1%BODY%HTML" in TAGS, for example. Then you can check what tags you're inside, and print accordingly.

Corona688