Why Parsing Can't be Done With sed ( or similar tools)


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
# 1  
Why Parsing Can't be Done With sed ( or similar tools)

Regularly we have questions like: i have an XML (C, C++, ...) file with this or that property and i want to extract the content of this or that tag (function, ...). How do i do it in sed?

Yes, in some (very limited) cases this is possible, but in general this can't be done. That is: you can do it, but whatever your solution will look like there will be a chance that it fails. In a theoretical sense this impossible and hence every solution will have one (or more) shortcomings. It will - in a strict sense - not be a solution, just an approximation thereof.

Here is why:

sed (and similar pattern recognition engines) searches for text patterns. Let us do an example: this regexp searches for text in (round) brackets:

Code:
([^)]*)

It would find (this text) and (that text too), even this: (). We could with some effort tweak it to search for the pattern spanning multiple lines (like
this) or even
Code:
(
     like
     this
)

if we want.

Now, to some extent, language features are patterns too, no? Consider the last text example and the similarity strikes:

Code:
(                 if <clause> ; then
     this              this
     that              that
)                 fi

Yes... but the problem is that language constructs (unlike patterns) can contain themselves as parts - they can be nested:

Code:
if <clause> ; then
     this              
     if <other clause> ; then
          something_else
     fi
     that              
fi

What would our text-in-brackets regexp do with this input:

Code:
text (more text (and even more) text) and so on

Answer: it would collect the text from the first opening bracket up to the first closing bracket: (more text (and even more). In all likeliness this is not what we want. We would probably want to get either the content of the outer bracket pair or the content of the inner bracket pair.

Exactly this ability - to understand and follow the level of nesting - is what sets apart a parser from a regexp engine.

In fact it would be possible to put provisions for a certain amount of nesting into the regexp. But we would have to anticipate the maximum level of nesting and so our (theoretical) regexp parser would still be limited to this maximum. Real world parsers work recursively. This means, they would work similar to a regexp engine, but when they encounter a nested construct (like the second opening bracket in the input stream above) they halt and spawn another instance of themselves which will parse the nested part. Once this spawned instance is finished they resume their work.

If you are interested in the theoretical background of all this: the theory of formal languages knows four types of languages (the so-called Chomsky hierarchy or "Chomsky-Schützenberger hierarchy", after its developers Noam Chomsky and Marcel-Paul Schützenberger).

Natural languages like English are of the type 0. It is possible to create a set of rules which will produce all sorts of sentences: they will always be grammatically correct although they might not say something meaningful or even coherent. Chomsky himself gave

Colorless green ideas sleep furiously.

as an example for such a sentence, which is (grammatically) correct, but meaningless.

The subsequent types 1,2 and 3 are continually more restricted than the type 0 which has no restrictions and each type adding some restrictions to the ones alreadyin place. Each member of a hierarchy can do approximately the same (you can say the same things in any language, for instance) but as you descend down the hierarchies the possibilities become more and ore limited.

Programming languages (well, most of them, there are exceptions) are of type 2 while regular expressions (or their counterpart "regular grammars") are of type 3. Because it is possible to something further up inthe hierarchy to create something lower in that hierarchy one can write books (type 0) about programming (type 2) but not a program which writes books and for the same reasons you canuse a programming language (type 2) to create a regexp engine (type 3) but not vice versa a regexp engine that reads (and "understands") a programming language.

bakunin

Last edited by bakunin; 09-18-2015 at 10:07 AM..
These 3 Users Gave Thanks to bakunin For This Post:
Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Science: Gadgets
Difficulty: Medium
The Western Electric Model 500 telephone uses tone dialing to dial phone numbers.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Isolate text with sed or similar utility

All, I'm getting a list like the following and I'd like to kill each PID in turn. pid (17797) pid (21748) pid (21754) pid (21704) pid (2199) pid (2159) pid (17809) pid (21769) pid (21778) pid (21715) ... (3 Replies)
Discussion started by: ejianu
3 Replies

2. UNIX for Dummies Questions & Answers

sed or Grep Parsing

I would like to parse two strings from lines in a file only when both strings appear on the same line. For example, if I have the following line: string1 string2 string3 string4 string5 string6 string7 string8 string9 I would like the output to be: string2: string7 Can someone give me... (5 Replies)
Discussion started by: ARBlue79
5 Replies

3. Shell Programming and Scripting

Editing files with sed or something similar

{ "AFafa": "FAFA","AFafa": "FAFA" "baseball":"soccer","wrestling":"dancing" "rhinos":"crocodiles","roles":"foodchain" } I need to insert a new line before the closing brackets "}" so that the final output looks like this: { "AFafa": "FAFA","AFafa": "FAFA"... (6 Replies)
Discussion started by: SkySmart
6 Replies

4. Shell Programming and Scripting

sed (parsing value)

All, Can somebody provide me with some sed expertise on how to parse the following line. 27-MAR-2011 10:28:01 * (CONNECT_DATA=(SID=dmart)(CID=(PROGRAM=sqlplus)(HOST=mtasnprod1)(USER=mtasnord))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.197.7.47)(PORT=54881)) * establish * dmart * 0 I would like... (3 Replies)
Discussion started by: BeefStu
3 Replies

5. Shell Programming and Scripting

Help with parsing mailbox folder list (identify similar folders)

List sample: user/xxx/Archives/2010 user/xxx/BLARG user/xxx/BlArG user/xxx/Burton user/xxx/DAY user/yyy/Trainees/Nutrition interns user/yyy/Trainees/Primary Care user/yyy/Trainees/Psychiatric NP interns user/yyy/Trainees/Psychiatric residents user/yyy/Trainees/Psychology... (4 Replies)
Discussion started by: spacegoose
4 Replies

6. Shell Programming and Scripting

Parsing cron with sed

Hello I want to convert my cron list into a csv Can you please help me with sed ? eg: Convert #06,21,36,51 * * 1,2 * (. ~/.profile ; timex /some/path/script -30 -15) >> /some/path/logfile2 2>&1 * * * * * (. ~/.profile ; timex /some/path/script2) > /some/path/logfile2 To:... (1 Reply)
Discussion started by: drbiloukos
1 Replies

7. Shell Programming and Scripting

Parsing with awk or sed

I want to delete corrupt records from a file through awk or sed. Can anyone help me with this Thanks Striker Change subject to a descriptive one, ty. (1 Reply)
Discussion started by: Rahul_us
1 Replies

8. Shell Programming and Scripting

parsing file names and then grouping similar files

Hello Friends, I have .tar files which exists under different directories after the below code is run: find . -name "*" -type f -print | grep .tar > tmp.txt cat tmp.txt ./dir1/subdir1/subdir2/database-db1_28112009.tar ./dir2/subdir3/database-db2_28112009.tar... (2 Replies)
Discussion started by: EAGL€
2 Replies

9. Shell Programming and Scripting

Using sed (or similar) to rename variable headings

Hello, I'm rather new to the world of regular expressions and sed, though am excited by its possibilities. I have a particular task I'd like to achieve, and have googled the topic quite a bit. However, having found some codes that perform a task very similar to what I'd like to do, I can't for... (2 Replies)
Discussion started by: redseventyseven
2 Replies

10. Shell Programming and Scripting

awk, sed or similar log repair help

I have a log file that for some reason, once or two time a month, line foods are missing. This log is generated from vmstat everyminute. I dont know why sometimes it does this. Each line in the log should have 18 columns separated by one or more spaces. Good Log: (not actual log) 1 1... (8 Replies)
Discussion started by: Ikon
8 Replies

Featured Tech Videos