String matching question

10-20-2007

Registered User

157, 0

Join Date: Oct 2007

Last Activity: 12 July 2018, 11:34 PM EDT

Posts: 157

Thanks Given: 0

Thanked 0 Times in 0 Posts

String matching question

Folks;
I need help with this:
I have a text file has a lot of lines, each line is a string consists of tree of directries, i would like to ignore any lines starting with "#" then grep an exact match of a string, then if i find a matching string with a child directory print it out. Below is the details:

The text file looks something like:

/new/tree/xxx/yyy/zzz
#/new/free/opt/yyy
/aaa/bbb/ccc/
/aaa/bbb/ccc/ddd/eee
/aaa/bbb/ccc/ddd

Now, i want first to ignore any line starts with "#"

Second, i want to do the following for EACH line starting with the first:
- look for exact string matching that line
then
1. if the matching string has any extra children, Ignore it. "directories under the string"
2. If there's no child directory under the string, print out the string then add this phrase to it "/hello/every/one" and redirect the output to a new text file.
This process should do that for each line in the original text file.

Thanks in advance

Katkota

View Public Profile for Katkota

Find all posts by Katkota

10-20-2007

Registered User

6,384, 2,214

Join Date: May 2005

Last Activity: 28 October 2019, 4:59 PM EDT

Location: In the leftmost byte of /dev/kmem

Posts: 6,384

Thanks Given: 143

Thanked 2,214 Times in 1,548 Posts

Usually the recipe for writing good regular expressions is to phrase the problem correctly - most of the times this alone is providing the solution.

In your case you were *almost* there already, so this is simple:

First we filter out all the lines starting with "#". This is done by a special regexp device: "^" if used at the beginning of a regexp, means "start of line". That is: "^#" doesn't mean a caret-char followed by a octothorpe, but an octothorpe as the first char of a line. Here is the script with some sample text, everything filtered out is marked blue:

Code:

sed '/^#/d' file > file.changed

this line goes through
# this line is blocked
this line goes through even if it has a # in it

Now for the next problem: match a line with an exact content and print it (to a file). Your problem with the child directories could be stated as "match a line with a content and no additional content". We achieve this by using a similar device as above: the "$" at the end of a regexp means "end of line". That is: "x$" means not "x followed by a dollar sign", but "x as the last character of a line".

By the way, as we are just searching for specific lines and ignore all the others we could simply skip the filtering out of the lines starting with an octothorpe ("#"), as we won't find them anyways. we can simply turn off any output of sed (the -n option) and only explicitly print the found lines. I let the filter for the commentary lines in there, but it is redundant.

Here it is with a sample text, i marked blue what is printed out:

Code:

sed -n '/^#/d;/^this is my text to find$/p' file > file.changed

# this line is blocked by rule 1
this is my text to find but with additional text
this is my text to find

now for the last part, the adding of the additional parts: we simply change the rule 2 which finds and prints the text to a substitution. We use here the sed-capability to provide the matched part of the text in the output. The "&" in the substitution contains what we have really matched in the search expression:

Code:

sed -n '/^#/d;s/^this is my text to find$/& with added text/p' file > file.changed

# this line is blocked by rule 1
this is my text to find but with additional text
this is my text to find

The content of file.changed should be a single line "this is my text to find with added text".

We get back to your problem again: in your text there are slash-characters and as "/" is a part of the sed-syntax too you will have to "escape" it by putting a "\" in front of it: to match "/usr/bin" use the expression "\/usr\/bin".

Furthermore, it is most of the times a good idea to clear any unnecessary whitespace from a line prior to matching it. Most of the times we do NOT want to get trailing or leading blanks, tabs, etc. in the way and "match" and "<tab><blank>match" are quite the same. So I would write it that way ("<spc>" is a literal space, "<tab>" is a TAB character):

Code:

sed -n 's/^[<tab><spc>]*//
        s/[<tab><spc>]*$//
        /^#*/d
        s/^\/the\/directory\/to\/find$/&/hello\/every\/one/p' > file.changed

Here is a last tip: when you prepare regexps, test them against short texts Prepare the most difficult examples you can think of. Notice four kinds of lines and try to provide one in each category:

The ones that are matched and should be matched;
the ones that are matched but shouldn't be matched;
The ones that are not matched but should be matched;
the ones that are not matched and correctly so.

bakunin

bakunin

View Public Profile for bakunin

Find all posts by bakunin

10-21-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

A possible solution with sort and awk :

Code:

sort katkota.dat | \
awk '

   function print_if_no_child(curr_path) {
      if (match(curr_path, "^" path "/") == 0)
         print path, "/hello/every/one";
      else print path, "directories under the string";
   }

   /^#/         { next }
   path_cnt > 0 { print_if_no_child($0) }
                { gsub(/\/*$/, ""); path = $0 ; path_cnt++ }
   END          { print_if_no_child("") }
'

Input File:

Code:

/new/tree/xxx/yyy/zzz
#/new/free/opt/yyy
/aaa/bbb/ccc/
/aaa/bbb/ccc/ddd/eee
/aaa/bbb/ccc/ddd
/aaa/xxx
/aaa/xxx/yyy/zzz
#end of datas

Output:

Code:

/aaa/bbb/ccc directories under the string
/aaa/bbb/ccc/ddd directories under the string
/aaa/bbb/ccc/ddd/eee /hello/every/one
/aaa/xxx directories under the string
/aaa/xxx/yyy/zzz /hello/every/one
/new/tree/xxx/yyy/zzz  /hello/every/one

Jean-Pierre.

aigles

View Public Profile for aigles

Find all posts by aigles

10-21-2007

Registered User

157, 0

Join Date: Oct 2007

Last Activity: 12 July 2018, 11:34 PM EDT

Posts: 157

Thanks Given: 0

Thanked 0 Times in 0 Posts

Folks;
I very much appreciate your help, but now there's some changes to the requirements (I apologize for the confusion), but i would appreciate if i can get some help with it:

Now i need look for each line (lines consist of a directory trees), then for each tree, i need to search throughout the file to find the shortest one "the tree with no children", then append a text phrase to it & redirect the output to a new text file:
in details:

Let's say the text file looks like:

/aa/bb/cc/dd/ee
/xxx/yyy/zzz
/aa/bb/cc
/xxx/yyy/zzz/fff/nnn
/aa/bb/cc/dd
/mm/uu/ss/tt/rr
/mm/uu/ss/tt

for this sample, i should search the first line, then find a similar tree but keep looking until i find the one with the shortest tree, which in this example is "/aa/bb/cc" which has only three directories, since the other two lines in the file have longer paths trees (one is /aa/bb/cc/dd/ee & the other is /aa/bb/cc/dd).
so after i extract the shortest "/aa/bb/cc" append a phrase or another folder like "plus" to look like "/aa/bb/cc/plus" then redirect this result to a new text file.
Now i go to the second line & do the same thing.

i hope i explained it well.

Once again, i appreciate the help.

Katkota

View Public Profile for Katkota

Find all posts by Katkota

10-22-2007

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

If I understand correctly and sorting is acceptable:

Code:

sort file|awk '!x[$2]++&&$0=$0"/plus"' FS="/">new_text_file

Use nawk or /usr/xpg4/bin/awk on Solaris.

radoulov

View Public Profile for radoulov

Find all posts by radoulov

10-22-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Try and adapt the following script :

Code:

sort katkota.dat | \
awk '
   /^#/ { next }
   { gsub(/\/*[[:space:]]*$/, ""); if (! root) root=$0}
   root { if (match($0 "/", "^" root "/")==0) {
        print root "/file"
        root = $0
     }
   }
   END { print "root "/file" }
'

Input:

Code:

/a
/b
/c
/usr
/new/tree
/new/tree/xxx/yyy/zzz
#/new/free/opt/yyy
/aaa/bbb/ccc/
/aaa/bbb/ccc/ddd/eee
/aaa/bbb/ccc/ddd
/aaa/xxx
/aaa/xxx/yyy/zzz
#end of datas

Output:

Code:

k2.sh
/a/file
/aaa/bbb/ccc/file
/aaa/xxx/file
/b/file
/c/file
/new/tree/file
/usr/file

Jean-Pierre.

aigles

View Public Profile for aigles

Find all posts by aigles

10-22-2007

Registered User

92, 0

Join Date: Nov 2004

Last Activity: 30 November 2007, 12:33 PM EST

Location: USA

Posts: 92

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks a lot.
But Aigles, could you please explain your code to me, i'm a little puzzled with it?

moe2266

View Public Profile for moe2266

Find all posts by moe2266

Shell Programming and Scripting

String matching question

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Matching string and assembling

Discussion started by: Xterra

2. UNIX for Dummies Questions & Answers

Matching string

Discussion started by: abdul.irfan2

3. Shell Programming and Scripting

Matching string from input to string of file

Discussion started by: a_smith

4. Shell Programming and Scripting

String matching

Discussion started by: nram_krishna@ya

5. Shell Programming and Scripting

Help Required For String Matching

Discussion started by: abhigrkist

6. Shell Programming and Scripting

matching a string

Discussion started by: dsdev_123

7. UNIX for Dummies Questions & Answers

Matching string

Discussion started by: nehaquick

8. Shell Programming and Scripting

String matching

Discussion started by: mpang_

9. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Discussion started by: amangeles

10. Shell Programming and Scripting

matching alphanumeric string

Discussion started by: sskb