Command for non-unique text


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Command for non-unique text
# 1  
Old 06-18-2014
Command for non-unique text

Code:
 awk -F "[<>]" '/<TestName>|<testname>|<Offerer>|<offerer>|<Line1>|<line1>|<City>|<city>|<State>|<state>/ {print $2, $3}' OFS='\t' UBE3A.xml > UBE3A.txt

Is it possible to use the code above to search for a pattern that is non-unique?

For example, if I wanted to capture the<MethodList>|<string> and not the other 2 occurrences of <string> is this possible? Also, is it possible to fing and match <Line1>|<line1> but output it as "Address". Thanks
# 2  
Old 06-18-2014
Please show the output you want from this input.

Here is the XML from the attachment:

Code:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE eSummaryResult PUBLIC "-//NLM//DTD esummary gtr 20140110//EN" "http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20140110/esummary_gtr.dtd">
<eSummaryResult>
<DocumentSummarySet status="OK">
<DbBuild>Build140618-0600.1</DbBuild>

<DocumentSummary uid="6814">
	<Id>6814</Id>
	<Source>GTR</Source>
	<Accession>GTR000006814</Accession>
	<TestName>UBE3A sequencing</TestName>
	<TestType>Clinical</TestType>
	<ConditionList>
		<Condition>
			<Name>Angelman syndrome</Name>
			<Acronym></Acronym>
			<CUI>C0162635</CUI>
		</Condition>
	</ConditionList>
	<Analytes>
		<Analyte>
			<AnalyteType>Gene</AnalyteType>
			<Name>UBE3A</Name>
			<GeneID>7337</GeneID>
			<Location>15q11.2</Location>
		</Analyte>
	</Analytes>
	<GeneList>
	</GeneList>
	<Offerer>Genetic Services Laboratory University of Chicago</Offerer>
	<OffererLocation>
		<Line1>5841 S. Maryland Ave. Rm G701, MC0077</Line1>
		<Line2></Line2>
		<Line3></Line3>
		<City>Chicago</City>
		<State>Illinois</State>
		<PostCode>60637-6726</PostCode>
		<Country>United States</Country>
	</OffererLocation>
	<OffererID>1238</OffererID>
	<DirectorList>
		<string>Soma Das, PhD, ABMG, Lab Director</string>
	</DirectorList>
	<Summary></Summary>
	<Flags>
	</Flags>
	<Method>
		<TopCategory>
			<Name>Molecular Genetics</Name>
			<CategoriesString>_C_______________</CategoriesString>
			<CategoryList>
				<Category>
					<Name>Sequence analysis of the entire coding region</Name>
					<Code>C</Code>
					<MethodList>
						<string>Bi-directional Sanger Sequence Analysis</string>
					</MethodList>
				</Category>
			</CategoryList>
		</TopCategory>
	</Method>
	<AnalyticalValidity>Analytical Sensitivity 99-100%   Accuracy 100%   Precision 100%</AnalyticalValidity>
	<TargetPopulation>The target population is patients suspected of having a diagnosis of Angelman syndrome.</TargetPopulation>
	<Certifications>
		<Certification>
			<CertificationType>CLIA</CertificationType>
			<id>14D0917593</id>
		</Certification>
		<Certification>
			<CertificationType>CAP</CertificationType>
			<id>18827-49</id>
		</Certification>
	</Certifications>
	<StudyDesc></StudyDesc>
	<TestTargetList>
		<string>UBE3A</string>
	</TestTargetList>
	<ConditionCount>1</ConditionCount>
	<TestTargetCount>1</TestTargetCount>
	<Extra><![CDATA[]]></Extra>
</DocumentSummary>

</DocumentSummarySet>
</eSummaryResult>

# 3  
Old 06-18-2014
I attached a example output file. Thanks.
# 4  
Old 06-18-2014
Please post short text in code tags instead of attachments.

Here is the content of your attachment:

Code:
TestName	UBE3A sequencing
Offerer	Genetic Services Laboratory University of Chicago
Address	"5841 S. Maryland Ave. Rm G701, MC0077"
City	Chicago
State	Illinois
Method	Bi-directional Sanger Sequence Analysis

# 5  
Old 06-18-2014
I apologize and that is the desired output, the code I posted is close, but not perfect. Thanks
# 6  
Old 06-18-2014
I see what you mean -- you don't just want <string>text</string>, you want the CORRECT <string>text</string>. Unfortunately the difference between that and what you have is code that understands XML versus code which just greps lines... I'll take a gander at it.
# 7  
Old 06-18-2014
I'm afraid it's not a one-liner anymore but it is the shortest even marginally-compliant parser I've written:

Code:
$ cat uniqxml.awk

BEGIN {
        FS=">"
        RS="<"
        OFS="\t"
}

NR==1 { next } # The first "line" is blank when RS=<
/^[!?]/ {       next    }               # Skip XML specification junk
{       gsub(/[\r\n]*$/, " ");  }       # Clean up newlines

# Handle open-tags
match($0, /^[^\/ \r\n\t]+/) {
        TAG=substr(toupper($0), RSTART, RLENGTH);
        TAGS=TAG "%" TAGS;
}

# Handle close-tags
/^[\/]/ {
        sub(/^\//, "", $1);
        sub("^.*" toupper($1) "%", "", TAGS);
        next;
}
TAGS ~ /^(TESTNAME|OFFERER|LINE1|CITY|STATE|STRING%METHODLIST%CATEGORY)%/ {
        print $1, $2
}

$ awk -f uniqxml.awk input.xml

TestName        UBE3A sequencing
Offerer Genetic Services Laboratory University of Chicago
Line1   5841 S. Maryland Ave. Rm G701, MC0077
City    Chicago
State   Illinois
string  Bi-directional Sanger Sequence Analysis

$

It processes tag-by-tag instead of line-by-line, and keeps a list of the tags its seen. "<html><body><h1>" would put "H1%BODY%HTML" in TAGS, for example. Then you can check what tags you're inside, and print accordingly.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Programming

find & Replace text using two non-unique delimiters.

I can find and replace text when the delimiters are unique. What I cannot do is replace text using two NON-unique delimiters: Ex., "This html code <text blah >contains <garbage blah blah >. All tags must go,<text > but some must be replaced with <garbage blah blah > without erasing other... (5 Replies)
Discussion started by: bedtime
5 Replies

2. Shell Programming and Scripting

awk to print unique text in field before hyphen

Trying to print the unique values in $2 before the -, currently the count is displayed. Hopefully, the below is close. Thank you :). file chr2:46603668-46603902 EPAS1-902|gc=54.3 253.1 chr2:211471445-211471675 CPS1-1205|gc=48.3 264.7 chr19:15291762-15291983 NOTCH3-1003|gc=68.8 195.8... (3 Replies)
Discussion started by: cmccabe
3 Replies

3. Shell Programming and Scripting

awk to print unique text in field

I am trying to use awk to print the unique entries in $2 So in the example below there are 3 lines but 2 of the lines match in $2 so only one is used in the output. File.txt chr17:29667512-29667673 NF1:exon.1;NF1:exon.2;NF1:exon.38;NF1:exon.4;NF1:exon.46;NF1:exon.47 703.807... (5 Replies)
Discussion started by: cmccabe
5 Replies

4. UNIX for Dummies Questions & Answers

Extract unique combination of rows from text files

Hi Gurus, I have 100 tab-delimited text files each with 21 columns. I want to extract only 2nd and 5th column from each text file. However, the values in both 2bd and 5th column contain duplicate values but the combination of these values in a row are not duplicate. I want to extract only those... (3 Replies)
Discussion started by: Unilearn
3 Replies

5. Shell Programming and Scripting

Extracting several lines of text after a unique string

I'm attempting to write a script to identify users who have sudo access on a server. I only want to extract the ID's of the sudo users after a unique line of text. The list of sudo users goes to the EOF so I only need the script to start after the unique line of text. I already have a script to... (1 Reply)
Discussion started by: bouncer
1 Replies

6. Shell Programming and Scripting

Extracting Text Between Two Unique Lines

Hi all! Im trying to extract a portion of text from a file and put it into a new file. I need all the lines between <Placement> and </Placement> including the Placemark lines themselves. Is there a way to extract all instances of these and not just the first one found? I've tried using sed and... (4 Replies)
Discussion started by: Grizzly
4 Replies

7. UNIX for Dummies Questions & Answers

Copying Text between two unique text patterns

Dear Colleagues: I have .rtf files of a collection of newspaper articles. Each newspaper article starts with a variation of the phrase "Document * of 20" and is separated from the next article with the character string "===================" I would like to be able to take the text composing... (3 Replies)
Discussion started by: spindoctor
3 Replies

8. Shell Programming and Scripting

comparing 2 text files to get unique values??

Hi all, I have got a problem while comparing 2 text files and the result should contains the unique values(Non repeatable). For eg: file1.txt 1 2 3 4 file2.txt 2 3 So after comaping the above 2 files I should get only 1 and 4 as the output. Pls help me out. (7 Replies)
Discussion started by: smarty86
7 Replies

9. Shell Programming and Scripting

extracting unique lines from text file

I have a file with 14million lines and I would like to extract all the unique lines from the file into another text file. For example: Contents of file1 happy sad smile happy funny sad I want to run a command against file one that only returns the unique lines (ie 1 line for happy... (3 Replies)
Discussion started by: soliberus
3 Replies

10. Shell Programming and Scripting

how to read all the unique words in a text file

How can i read all the unique words in a file, i used - cat comment_file.txt | /usr/xpg6/bin/tr -sc 'A-Za-z' '/012' and cat comment_file.txt | /usr/xpg6/bin/tr -sdc 'A-Za-z' '/012' but they didnt worked..... (5 Replies)
Discussion started by: aditya.ece1985
5 Replies
Login or Register to Ask a Question