Why Parsing Can't be Done With sed ( or similar tools)


Login or Register to Reply

 
Thread Tools Search this Thread
# 1  
Why Parsing Can't be Done With sed ( or similar tools)

Regularly we have questions like: i have an XML (C, C++, ...) file with this or that property and i want to extract the content of this or that tag (function, ...). How do i do it in sed?

Yes, in some (very limited) cases this is possible, but in general this can't be done. That is: you can do it, but whatever your solution will look like there will be a chance that it fails. In a theoretical sense this impossible and hence every solution will have one (or more) shortcomings. It will - in a strict sense - not be a solution, just an approximation thereof.

Here is why:

sed (and similar pattern recognition engines) searches for text patterns. Let us do an example: this regexp searches for text in (round) brackets:

Code:
([^)]*)

It would find (this text) and (that text too), even this: (). We could with some effort tweak it to search for the pattern spanning multiple lines (like
this) or even
Code:
(
     like
     this
)

if we want.

Now, to some extent, language features are patterns too, no? Consider the last text example and the similarity strikes:

Code:
(                 if <clause> ; then
     this              this
     that              that
)                 fi

Yes... but the problem is that language constructs (unlike patterns) can contain themselves as parts - they can be nested:

Code:
if <clause> ; then
     this              
     if <other clause> ; then
          something_else
     fi
     that              
fi

What would our text-in-brackets regexp do with this input:

Code:
text (more text (and even more) text) and so on

Answer: it would collect the text from the first opening bracket up to the first closing bracket: (more text (and even more). In all likeliness this is not what we want. We would probably want to get either the content of the outer bracket pair or the content of the inner bracket pair.

Exactly this ability - to understand and follow the level of nesting - is what sets apart a parser from a regexp engine.

In fact it would be possible to put provisions for a certain amount of nesting into the regexp. But we would have to anticipate the maximum level of nesting and so our (theoretical) regexp parser would still be limited to this maximum. Real world parsers work recursively. This means, they would work similar to a regexp engine, but when they encounter a nested construct (like the second opening bracket in the input stream above) they halt and spawn another instance of themselves which will parse the nested part. Once this spawned instance is finished they resume their work.

If you are interested in the theoretical background of all this: the theory of formal languages knows four types of languages (the so-called Chomsky hierarchy or "Chomsky-Schützenberger hierarchy", after its developers Noam Chomsky and Marcel-Paul Schützenberger).

Natural languages like English are of the type 0. It is possible to create a set of rules which will produce all sorts of sentences: they will always be grammatically correct although they might not say something meaningful or even coherent. Chomsky himself gave

Colorless green ideas sleep furiously.

as an example for such a sentence, which is (grammatically) correct, but meaningless.

The subsequent types 1,2 and 3 are continually more restricted than the type 0 which has no restrictions and each type adding some restrictions to the ones alreadyin place. Each member of a hierarchy can do approximately the same (you can say the same things in any language, for instance) but as you descend down the hierarchies the possibilities become more and ore limited.

Programming languages (well, most of them, there are exceptions) are of type 2 while regular expressions (or their counterpart "regular grammars") are of type 3. Because it is possible to something further up inthe hierarchy to create something lower in that hierarchy one can write books (type 0) about programming (type 2) but not a program which writes books and for the same reasons you canuse a programming language (type 2) to create a regexp engine (type 3) but not vice versa a regexp engine that reads (and "understands") a programming language.

bakunin

Last edited by bakunin; 09-18-2015 at 10:07 AM..
These 3 Users Gave Thanks to bakunin For This Post:
Login or Register to Reply

|
Thread Tools Search this Thread
Search this Thread:
Advanced Search

More UNIX and Linux Forum Topics You Might Find Helpful
Isolate text with sed or similar utility
ejianu
All, I'm getting a list like the following and I'd like to kill each PID in turn. pid (17797) pid (21748) pid (21754) pid (21704) pid (2199) pid (2159) pid (17809) pid (21769) pid (21778) pid (21715) ...... Shell Programming and Scripting
3
Shell Programming and Scripting
Editing files with sed or something similar
SkySmart
{ "AFafa": "FAFA","AFafa": "FAFA" "baseball":"soccer","wrestling":"dancing" "rhinos":"crocodiles","roles":"foodchain" } I need to insert a new line before the closing brackets "}" so that the final output looks like this: { "AFafa": "FAFA","AFafa": "FAFA"...... Shell Programming and Scripting
6
Shell Programming and Scripting
Help with parsing mailbox folder list (identify similar folders)
spacegoose
List sample: user/xxx/Archives/2010 user/xxx/BLARG user/xxx/BlArG user/xxx/Burton user/xxx/DAY user/yyy/Trainees/Nutrition interns user/yyy/Trainees/Primary Care user/yyy/Trainees/Psychiatric NP interns user/yyy/Trainees/Psychiatric residents user/yyy/Trainees/Psychology...... Shell Programming and Scripting
4
Shell Programming and Scripting
parsing file names and then grouping similar files
EAGL€
Hello Friends, I have .tar files which exists under different directories after the below code is run: find . -name "*" -type f -print | grep .tar > tmp.txt cat tmp.txt ./dir1/subdir1/subdir2/database-db1_28112009.tar ./dir2/subdir3/database-db2_28112009.tar...... Shell Programming and Scripting
2
Shell Programming and Scripting
Using sed (or similar) to rename variable headings
redseventyseven
Hello, I'm rather new to the world of regular expressions and sed, though am excited by its possibilities. I have a particular task I'd like to achieve, and have googled the topic quite a bit. However, having found some codes that perform a task very similar to what I'd like to do, I can't for...... Shell Programming and Scripting
2
Shell Programming and Scripting

Featured Tech Videos