Why Parsing Can't be Done With sed ( or similar tools)
Regularly we have questions like: i have an XML (C, C++, ...) file with this or that property and i want to extract the content of this or that tag (function, ...). How do i do it in sed?
Yes, in some (very limited) cases this is possible, but in general this can't be done. That is: you can do it, but whatever your solution will look like there will be a chance that it fails. In a theoretical sense this impossible and hence every solution will have one (or more) shortcomings. It will - in a strict sense - not be a solution, just an approximation thereof.
Here is why:
sed (and similar pattern recognition engines) searches for text patterns. Let us do an example: this regexp searches for text in (round) brackets:
It would find (this text) and (that text too), even this: (). We could with some effort tweak it to search for the pattern spanning multiple lines (like
this) or even
if we want.
Now, to some extent, language features are patterns too, no? Consider the last text example and the similarity strikes:
Yes... but the problem is that language constructs (unlike patterns) can contain themselves as parts - they can be nested:
What would our text-in-brackets regexp do with this input:
Answer: it would collect the text from the first opening bracket up to the first closing bracket: (more text (and even more). In all likeliness this is not what we want. We would probably want to get either the content of the outer bracket pair or the content of the inner bracket pair.
Exactly this ability - to understand and follow the level of nesting - is what sets apart a parser from a regexp engine.
In fact it would be possible to put provisions for a certain amount of nesting into the regexp. But we would have to anticipate the maximum level of nesting and so our (theoretical) regexp parser would still be limited to this maximum. Real world parsers work recursively. This means, they would work similar to a regexp engine, but when they encounter a nested construct (like the second opening bracket in the input stream above) they halt and spawn another instance of themselves which will parse the nested part. Once this spawned instance is finished they resume their work.
If you are interested in the theoretical background of all this: the theory of formal languages knows four types of languages (the so-called Chomsky hierarchy or "Chomsky-Schützenberger hierarchy", after its developers Noam Chomsky and Marcel-Paul Schützenberger).
Natural languages like English are of the type 0. It is possible to create a set of rules which will produce all sorts of sentences: they will always be grammatically correct although they might not say something meaningful or even coherent. Chomsky himself gave
as an example for such a sentence, which is (grammatically) correct, but meaningless.
The subsequent types 1,2 and 3 are continually more restricted than the type 0 which has no restrictions and each type adding some restrictions to the ones alreadyin place. Each member of a hierarchy can do approximately the same (you can say the same things in any language, for instance) but as you descend down the hierarchies the possibilities become more and ore limited.
Programming languages (well, most of them, there are exceptions) are of type 2 while regular expressions (or their counterpart "regular grammars") are of type 3. Because it is possible to something further up inthe hierarchy to create something lower in that hierarchy one can write books (type 0) about programming (type 2) but not a program which writes books and for the same reasons you canuse a programming language (type 2) to create a regexp engine (type 3) but not vice versa a regexp engine that reads (and "understands") a programming language.
bakunin
Last edited by bakunin; 09-18-2015 at 10:07 AM..
These 3 Users Gave Thanks to bakunin For This Post:
All,
I'm getting a list like the following and I'd like to kill each PID in turn.
pid (17797)
pid (21748)
pid (21754)
pid (21704)
pid (2199)
pid (2159)
pid (17809)
pid (21769)
pid (21778)
pid (21715)
... (3 Replies)
{
"AFafa": "FAFA","AFafa": "FAFA"
"baseball":"soccer","wrestling":"dancing"
"rhinos":"crocodiles","roles":"foodchain"
}
I need to insert a new line before the closing brackets "}" so that the final output looks like this:
{
"AFafa": "FAFA","AFafa": "FAFA"... (6 Replies)
Hello guys,
I guess you are fed up with sed command and parse questions, but after a while researching the forum, I could not get an answer to my doubt. I know it must be easy done with sed command, but unfortunately, I never get right syntax of this command
OK, this is what I have in my... (3 Replies)
All,
Can somebody provide me with some sed expertise on how to parse the
following line.
27-MAR-2011 10:28:01 * (CONNECT_DATA=(SID=dmart)(CID=(PROGRAM=sqlplus)(HOST=mtasnprod1)(USER=mtasnord))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.197.7.47)(PORT=54881)) * establish * dmart * 0
I would like... (3 Replies)
Hello
I want to convert my cron list into a csv
Can you please help me with sed ?
eg:
Convert
#06,21,36,51 * * 1,2 * (. ~/.profile ; timex /some/path/script -30 -15) >> /some/path/logfile2 2>&1
* * * * * (. ~/.profile ; timex /some/path/script2) > /some/path/logfile2
To:... (1 Reply)
I want to delete corrupt records from a file through awk or sed.
Can anyone help me with this
Thanks
Striker
Change subject to a descriptive one, ty. (1 Reply)
Hello Friends,
I have .tar files which exists under different directories after the below code is run:
find . -name "*" -type f -print | grep .tar > tmp.txt
cat tmp.txt
./dir1/subdir1/subdir2/database-db1_28112009.tar
./dir2/subdir3/database-db2_28112009.tar... (2 Replies)
Hello,
I'm rather new to the world of regular expressions and sed, though am excited by its possibilities. I have a particular task I'd like to achieve, and have googled the topic quite a bit. However, having found some codes that perform a task very similar to what I'd like to do, I can't for... (2 Replies)
I have a log file that for some reason, once or two time a month, line foods are missing.
This log is generated from vmstat everyminute. I dont know why sometimes it does this.
Each line in the log should have 18 columns separated by one or more spaces.
Good Log: (not actual log)
1 1... (8 Replies)