Word boundaries in GAWK?

Login or Register to Ask a Question and Join Our Community

Word boundaries in GAWK?

Tags

Login to Discuss or Reply to this Discussion in Our Community

Top Forums Shell Programming and Scripting Word boundaries in GAWK?

06-11-2009

Registered User

50, 1

Join Date: Jun 2009

Last Activity: 19 October 2010, 9:04 PM EDT

Posts: 50

Thanks Given: 6

Thanked 1 Time in 1 Post

Word boundaries in GAWK?

I wanted to use GAWK's 'word boundary' feature but can't
get it to work. Doesn't GAWK support \<word\>?

Sample record:

Code:

Title                   Bats in the fifth act of Chushingura (top);
                        the world of the bell - the story of Anchin and Kiyohime (bottom)                               
Series Title            Sketches by Yoshitoshi     
Title-Alternative       Yoshitoshi ryakuga: Komori no godanme (top); Kane no sekai (bottom)

Shouldn't /^\<Title\>/ work to remove "Title-Alternative"? It doesn't. I have to use this:
$1 ~ /^Title$/

Bubnoff

Last edited by Bubnoff; 06-11-2009 at 11:55 PM.. Reason: formatting issue

Bubnoff

View Public Profile for Bubnoff

Find all posts by Bubnoff

06-12-2009

Registered User

2,669, 20

Join Date: Sep 2006

Last Activity: 28 January 2015, 8:30 AM EST

Posts: 2,669

Thanks Given: 0

Thanked 20 Times in 20 Posts

because you are escaping \> and \< whereas in your data, there is no < >

ghostdog74

View Public Profile for ghostdog74

Find all posts by ghostdog74

06-12-2009

Registered User

50, 1

Join Date: Jun 2009

Last Activity: 19 October 2010, 9:04 PM EDT

Posts: 50

Thanks Given: 6

Thanked 1 Time in 1 Post

Update on GAWK boundaries.

Thanks for answering ghostdog74, however, I'm still a bit unclear on
what you mean. I am aware that I do not have the gt lt characters in the data, I was trying to use GAWK's word boundary operators.

According to the documentation ( GAWK: Effective ...etc. )the regex operators:

\< and \> can be used to indicate word boundaries. They do, but they
use a space as the delimiter ( if I would've RTFMed a bit closer I
would've saved myself this confusion ).

eg. "Title-Alternative" will be true but "TitleAlternative" will be false.

This still makes no sense. How is this working?

I originally thought I could remove "Title-Alternative" by using the word
boundary operators like:

\<Title\>

But since Title-Alternative has a hyphen it's still legal ( why exactly I can't say ). This regex will
remove "titleAlternative" which is closer to the example in the docs, but won't remove "Title-Alternative".

So I think my problem was not fully understanding the way GAWK's W.B.
operators worked ( still don't ).

I am new to AWK and am wondering how others would pull "Title" from
a record that looks similar to what is in my above post.

Code:

 gawk '$1 ~ /^Title$/{print}'

The above works. Any insight into these crazy Word Boundary operators in
GAWK would be much appreciated.

Thanks -

Bub

Bubnoff

View Public Profile for Bubnoff

Find all posts by Bubnoff

06-12-2009

Registered User

1,801, 116

Join Date: Oct 2003

Last Activity: 15 May 2015, 11:55 AM EDT

Location: 54.23, -4.53

Posts: 1,801

Thanks Given: 1

Thanked 116 Times in 101 Posts

From GNU Regexp Operators - The GNU Awk User's Guide

Quote:

a word is a sequence of one or more letters, digits, or underscores

So hyphens are out.

Ygor

View Public Profile for Ygor

Find all posts by Ygor

06-12-2009

Registered User

50, 1

Join Date: Jun 2009

Last Activity: 19 October 2010, 9:04 PM EDT

Posts: 50

Thanks Given: 6

Thanked 1 Time in 1 Post

GAWK boundaries

Thanks Ygor!

I'm embarrassed to say I read this section at least twice, today alone, and didn't catch that. Its times like these when a person should just step
away from the screen, grab a cup o' joe and go for a walk.

Bub

Bubnoff

View Public Profile for Bubnoff

Find all posts by Bubnoff

06-12-2009

Registered User

2,669, 20

Join Date: Sep 2006

Last Activity: 28 January 2015, 8:30 AM EST

Posts: 2,669

Thanks Given: 0

Thanked 20 Times in 20 Posts

Quote:

Originally Posted by Bubnoff

Thanks for answering ghostdog74, however, I'm still a bit unclear on
what you mean.

my bad. didn't see your requirement properly. if you want to get whole words, there is no need for regular expression. Just go through each word and test for it

Code:

awk '{
 for(i=1;i<=NF;i++){
   if( $i == "Title"){ # or ~ /^Title$/
         ........
   }
 }
}

'

ghostdog74

View Public Profile for ghostdog74

Find all posts by ghostdog74

06-12-2009

Registered User

50, 1

Join Date: Jun 2009

Last Activity: 19 October 2010, 9:04 PM EDT

Posts: 50

Thanks Given: 6

Thanked 1 Time in 1 Post

Forgot to mention the case possibilities -

Each test has to take into account possible capitalization ( or lack thereof ). So actually, I've been using:

Code:

/^[Tt]itle$/

The background to this is that I have around one hundred Dublin Core
records to analyze and the elements I'm testing for are always in
field $1 with the values in fields $2 or $3. "Title" is one of around 15 DC elements I'm testing for plus or minus the screwy ones people insist on adding. Some catalogers capitalize and other do not.

I could use another Gnu Awk feature though:

Code:

 gawk -v IGNORECASE=1 '$1 == "title"{print $1}' test.notes

- as you suggest, instead of with regex -

 gawk '$1 ~ /^[Tt]itle$/{print}' test.notes

Regex spelling is quicker though.

To distinguish title from:
"Title Alternative" or "title alternative", I am using:

Code:

 gawk '$1 ~ /^[Tt]itle$/&&$2 !~ /[Aa]lternative/{print $1}' test.notes

Thanks for the replies -

Bub

Last edited by Bubnoff; 06-12-2009 at 04:16 AM.. Reason: Forgot case.

Bubnoff

View Public Profile for Bubnoff

Find all posts by Bubnoff

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to search for a word in column header that fully matches the word not partially in awk?

I have a multicolumn text file with header in the first row like this The headers are stored in an array called . which contains I want to search for each elements of this array from that multicolumn text file. And I am using this awk approach for ii in ${hdr} do gawk -vcol="$ii" -F...

2. Shell Programming and Scripting

Find a word and increment the number in the word & save into new files

Hi All, I am looking for a perl/awk/sed command to auto-increment the numbers line in file, P1.tcl: run_build_model sparc_ifu_dec run_drc set_faults -model path_delay -atpg_effectiveness -fault_coverage add_delay_paths P1 set_atpg -abort_limit 1000 run_atpg -ndetects 1000 I would like...

3. Shell Programming and Scripting

awk word boundaries not working

Hi, I am trying below code but the word boundaries not seem to be working. What am I doing incorrectly? echo " ECHO " | awk '{ q="ECHO" ; if ( $0 ~ /\bq\b/) print "HELLO" ; }' OR echo " ECHO " | awk '{ q="ECHO" ; if ( $0 ~ /\b'$q'\b/) print "HELLO" ; }' Or echo " ECHO " | awk...

4. Shell Programming and Scripting

awk Script: removing periodic boundaries

SOLVED, thank you! Edit2: Good news everyone, I managed to get it down to a "simple" problem, but I still have some syntax issues. Here is the code which troubles me: awk 'BEGIN{x2=0;x1=0;crit=0;} $1 < 1000000 {x2=$4; diffx=x2-x1; x1=x2; diffx > 3.6 ? {crit=1} : {crit=0};...

5. Shell Programming and Scripting

Search for the word and exporting 35 characters after that word using shell script

I have a file input.txt which have loads of weird characters, html tags and useful materials. I want to display 35 characters after the word "description" excluding weird characters like $&lmp and without html tags in the new file output.txt. Help me. Thanx in advance. I have attached the input...

6. UNIX for Dummies Questions & Answers

Find EXACT word in files, just the word: no prefix, no suffix, no 'similar', just the word

I have a file that has the words I want to find in other files (but lets say I just want to find my words in a single file). Those words are IDs, so if my word is ZZZ4, outputs like aaZZZ4, ZZZ4bb, aaZZZ4bb, ZZ4, ZZZ, ZyZ4, ZZZ4.8 (or anything like that) WON'T BE USEFUL. I need the whole word...

7. Programming

key_t type max length or boundaries value

Hello, In shared memory, when using shmget function, first parameter is ket_t key. I know it is an integer type, but length of it is system dependent. That means may not be have integer's ranges. What is range of key_t in Linux? Is it different in distros, for example in ubuntu & fedora?

8. Shell Programming and Scripting

To read data word by word from given file & storing in variables

File having data in following format : file name : file.txt -------------------- 111111;name1 222222;name2 333333;name3 I want to read this file so that I can split these into two paramaters i.e. 111111 & name1 into two different variables(say value1 & value2). i.e val1=11111 &...

9. UNIX for Dummies Questions & Answers

regular expression for replacing the fist word with a last word in line

I have a File with the below contents File1 I have no prior experience in unix. I have just started to work in unix. My experience in unix is 0. My Total It exp is 3 yrs. I need to replace the first word in each line with the last word for example unix have no prior experience in...

10. Shell Programming and Scripting

Can a shell script pull the first word (or nth word) off each line of a text file?

Greetings. I am struggling with a shell script to make my life simpler, with a number of practical ways in which it could be used. I want to take a standard text file, and pull the 'n'th word from each line such as the first word from a text file. I'm struggling to see how each line can be...

Login or Register to Ask a Question