Word boundaries in GAWK?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Word boundaries in GAWK?
# 1  
Old 06-11-2009
PHP Word boundaries in GAWK?

I wanted to use GAWK's 'word boundary' feature but can't
get it to work. Doesn't GAWK support \<word\>?

Sample record:


Code:
Title                   Bats in the fifth act of Chushingura (top);
                        the world of the bell - the story of Anchin and Kiyohime (bottom)                               
Series Title            Sketches by Yoshitoshi     
Title-Alternative       Yoshitoshi ryakuga: Komori no godanme (top); Kane no sekai (bottom)

Shouldn't /^\<Title\>/ work to remove "Title-Alternative"? It doesn't. I have to use this:
$1 ~ /^Title$/

Bubnoff

Last edited by Bubnoff; 06-11-2009 at 11:55 PM.. Reason: formatting issue
# 2  
Old 06-12-2009
because you are escaping \> and \< whereas in your data, there is no < >
# 3  
Old 06-12-2009
Update on GAWK boundaries.

Thanks for answering ghostdog74, however, I'm still a bit unclear on
what you mean. I am aware that I do not have the gt lt characters in the data, I was trying to use GAWK's word boundary operators.

According to the documentation ( GAWK: Effective ...etc. )the regex operators:

\< and \> can be used to indicate word boundaries. They do, but they
use a space as the delimiter ( if I would've RTFMed a bit closer I
would've saved myself this confusion ).

eg. "Title-Alternative" will be true but "TitleAlternative" will be false.

This still makes no sense. How is this working?

I originally thought I could remove "Title-Alternative" by using the word
boundary operators like:

\<Title\>

But since Title-Alternative has a hyphen it's still legal ( why exactly I can't say ). This regex will
remove "titleAlternative" which is closer to the example in the docs, but won't remove "Title-Alternative".

So I think my problem was not fully understanding the way GAWK's W.B.
operators worked ( still don't ).

I am new to AWK and am wondering how others would pull "Title" from
a record that looks similar to what is in my above post.

Code:
 gawk '$1 ~ /^Title$/{print}'

The above works. Any insight into these crazy Word Boundary operators in
GAWK would be much appreciated.

Thanks -

Bub
# 4  
Old 06-12-2009
From GNU Regexp Operators - The GNU Awk User's Guide
Quote:
a word is a sequence of one or more letters, digits, or underscores
So hyphens are out.
# 5  
Old 06-12-2009
GAWK boundaries

Thanks Ygor!

I'm embarrassed to say I read this section at least twice, today alone, and didn't catch that. Its times like these when a person should just step
away from the screen, grab a cup o' joe and go for a walk.

Bub
# 6  
Old 06-12-2009
Quote:
Originally Posted by Bubnoff
Thanks for answering ghostdog74, however, I'm still a bit unclear on
what you mean.
my bad. didn't see your requirement properly. if you want to get whole words, there is no need for regular expression. Just go through each word and test for it
Code:
awk '{
 for(i=1;i<=NF;i++){
   if( $i == "Title"){ # or ~ /^Title$/
         ........
   }
 }
}

'

# 7  
Old 06-12-2009
Forgot to mention the case possibilities -

Each test has to take into account possible capitalization ( or lack thereof ). So actually, I've been using:

Code:
/^[Tt]itle$/

The background to this is that I have around one hundred Dublin Core
records to analyze and the elements I'm testing for are always in
field $1 with the values in fields $2 or $3. "Title" is one of around 15 DC elements I'm testing for plus or minus the screwy ones people insist on adding. Some catalogers capitalize and other do not.

I could use another Gnu Awk feature though:

Code:
 gawk -v IGNORECASE=1 '$1 == "title"{print $1}' test.notes

- as you suggest, instead of with regex -

 gawk '$1 ~ /^[Tt]itle$/{print}' test.notes

Regex spelling is quicker though.

To distinguish title from:
"Title Alternative" or "title alternative", I am using:

Code:
 gawk '$1 ~ /^[Tt]itle$/&&$2 !~ /[Aa]lternative/{print $1}' test.notes

Thanks for the replies -

Bub

Last edited by Bubnoff; 06-12-2009 at 04:16 AM.. Reason: Forgot case.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

How to search for a word in column header that fully matches the word not partially in awk?

I have a multicolumn text file with header in the first row like this The headers are stored in an array called . which contains I want to search for each elements of this array from that multicolumn text file. And I am using this awk approach for ii in ${hdr} do gawk -vcol="$ii" -F... (1 Reply)
Discussion started by: Atta
1 Replies

2. Shell Programming and Scripting

Find a word and increment the number in the word & save into new files

Hi All, I am looking for a perl/awk/sed command to auto-increment the numbers line in file, P1.tcl: run_build_model sparc_ifu_dec run_drc set_faults -model path_delay -atpg_effectiveness -fault_coverage add_delay_paths P1 set_atpg -abort_limit 1000 run_atpg -ndetects 1000 I would like... (6 Replies)
Discussion started by: jypark22
6 Replies

3. Shell Programming and Scripting

awk word boundaries not working

Hi, I am trying below code but the word boundaries not seem to be working. What am I doing incorrectly? echo " ECHO " | awk '{ q="ECHO" ; if ( $0 ~ /\bq\b/) print "HELLO" ; }' OR echo " ECHO " | awk '{ q="ECHO" ; if ( $0 ~ /\b'$q'\b/) print "HELLO" ; }' Or echo " ECHO " | awk... (6 Replies)
Discussion started by: ahmedwaseem2000
6 Replies

4. Shell Programming and Scripting

awk Script: removing periodic boundaries

SOLVED, thank you! Edit2: Good news everyone, I managed to get it down to a "simple" problem, but I still have some syntax issues. Here is the code which troubles me: awk 'BEGIN{x2=0;x1=0;crit=0;} $1 < 1000000 {x2=$4; diffx=x2-x1; x1=x2; diffx > 3.6 ? {crit=1} : {crit=0};... (2 Replies)
Discussion started by: Consti
2 Replies

5. Shell Programming and Scripting

Search for the word and exporting 35 characters after that word using shell script

I have a file input.txt which have loads of weird characters, html tags and useful materials. I want to display 35 characters after the word "description" excluding weird characters like $&lmp and without html tags in the new file output.txt. Help me. Thanx in advance. I have attached the input... (4 Replies)
Discussion started by: sachit adhikari
4 Replies

6. UNIX for Dummies Questions & Answers

Find EXACT word in files, just the word: no prefix, no suffix, no 'similar', just the word

I have a file that has the words I want to find in other files (but lets say I just want to find my words in a single file). Those words are IDs, so if my word is ZZZ4, outputs like aaZZZ4, ZZZ4bb, aaZZZ4bb, ZZ4, ZZZ, ZyZ4, ZZZ4.8 (or anything like that) WON'T BE USEFUL. I need the whole word... (6 Replies)
Discussion started by: chicchan
6 Replies

7. Programming

key_t type max length or boundaries value

Hello, In shared memory, when using shmget function, first parameter is ket_t key. I know it is an integer type, but length of it is system dependent. That means may not be have integer's ranges. What is range of key_t in Linux? Is it different in distros, for example in ubuntu & fedora? (2 Replies)
Discussion started by: pronetin
2 Replies

8. Shell Programming and Scripting

To read data word by word from given file & storing in variables

File having data in following format : file name : file.txt -------------------- 111111;name1 222222;name2 333333;name3 I want to read this file so that I can split these into two paramaters i.e. 111111 & name1 into two different variables(say value1 & value2). i.e val1=11111 &... (2 Replies)
Discussion started by: sjoshi98
2 Replies

9. UNIX for Dummies Questions & Answers

regular expression for replacing the fist word with a last word in line

I have a File with the below contents File1 I have no prior experience in unix. I have just started to work in unix. My experience in unix is 0. My Total It exp is 3 yrs. I need to replace the first word in each line with the last word for example unix have no prior experience in... (2 Replies)
Discussion started by: kri_swami
2 Replies

10. Shell Programming and Scripting

Can a shell script pull the first word (or nth word) off each line of a text file?

Greetings. I am struggling with a shell script to make my life simpler, with a number of practical ways in which it could be used. I want to take a standard text file, and pull the 'n'th word from each line such as the first word from a text file. I'm struggling to see how each line can be... (5 Replies)
Discussion started by: tricky
5 Replies
Login or Register to Ask a Question