penalty for case insensitive grep | Unix Linux Forums | UNIX for Advanced & Expert Users

  Go Back    


UNIX for Advanced & Expert Users Expert-to-Expert. Learn advanced UNIX, UNIX commands, Linux, Operating Systems, System Administration, Programming, Shell, Shell Scripts, Solaris, Linux, HP-UX, AIX, OS X, BSD.

penalty for case insensitive grep

UNIX for Advanced & Expert Users


Tags
grep, grep case insensitive penalty

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 02-24-2011
phil518 phil518 is offline
Registered User
 
Join Date: Mar 2009
Last Activity: 15 May 2014, 9:03 PM EDT
Posts: 43
Thanks: 15
Thanked 0 Times in 0 Posts
penalty for case insensitive grep

I just found out there were a big performance penalty for case insensitive "grep" on big files.

It would be understandable, except that the penalty seems to be exaggerated out of proportion.

A real example, if I only grep a single letter "V" (or "v") , without "-i", on a big file, (file is doctored so only a few "V" or "v" exist). It takes 0.157 user time to finish.

Then I grep the same letter "V", but with "-i" option, it takes 32.0 user time to finish. That is 200 times longer than without "-i" for a single character.

Can someone provide some insight why this is the case?

Thanks.

NB Phil
Sponsored Links
    #2  
Old 02-24-2011
frank_rizzo frank_rizzo is offline Forum Advisor  
Resident BOFH
 
Join Date: Dec 2007
Last Activity: 21 August 2014, 4:30 PM EDT
Posts: 1,136
Thanks: 2
Thanked 88 Times in 85 Posts
probably because it has to convert the case of every line. try this and see what happens.
Code:
grep -e V -e v ...

The Following User Says Thank You to frank_rizzo For This Useful Post:
phil518 (02-25-2011)
Sponsored Links
    #3  
Old 02-24-2011
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 28 August 2014, 4:23 PM EDT
Location: Saskatchewan
Posts: 19,269
Thanks: 774
Thanked 3,236 Times in 3,034 Posts
Quote:
Originally Posted by phil518 View Post
Can someone provide some insight why this is the case?
Is this GNU grep? The GNU utilities, unusually, try to handle your character set as appropriate. This means when you tell it to be case insensitive, that busts out some pretty heavy-duty routines in order to do so properly.
The Following User Says Thank You to Corona688 For This Useful Post:
phil518 (02-25-2011)
    #4  
Old 02-24-2011
phil518 phil518 is offline
Registered User
 
Join Date: Mar 2009
Last Activity: 15 May 2014, 9:03 PM EDT
Posts: 43
Thanks: 15
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by Corona688 View Post
Is this GNU grep? The GNU utilities, unusually, try to handle your character set as appropriate. This means when you tell it to be case insensitive, that busts out some pretty heavy-duty routines in order to do so properly.
It is GNU grep. Wow, 200 times longer for case insensitive grep. Those routines sound very heavy.

Is this a known fact?

is x200 the upper bound of the performance penalty, regardless the length of the input string? (thinking n-character long string will have 2^n case permutations).
Sponsored Links
    #5  
Old 02-25-2011
binlib binlib is offline
Registered User
 
Join Date: Aug 2009
Last Activity: 15 March 2013, 10:40 AM EDT
Location: New Jersey
Posts: 380
Thanks: 7
Thanked 90 Times in 75 Posts
Again, it's the fault of I18N. Set LC_ALL to C, you will see that the -i run is only twice as long.
The Following User Says Thank You to binlib For This Useful Post:
phil518 (02-25-2011)
Sponsored Links
    #6  
Old 02-25-2011
methyl methyl is offline Forum Advisor  
Advisor
 
Join Date: Mar 2008
Last Activity: 18 April 2014, 5:13 AM EDT
Posts: 6,396
Thanks: 287
Thanked 672 Times in 642 Posts
This problem has cropped up a few times recently.

The default "locale" for GNU utility programs including "grep" and "sort" has changed to "UTF" which means that mapping one "character" can take more than one character. If your file definitely does not contain "UTF" characters you can massively improve performance by changing your locale back to the basic value of "C".

To check how your system is now, type and check the output from this enquiry:

Code:
locale

The Following User Says Thank You to methyl For This Useful Post:
phil518 (02-25-2011)
Sponsored Links
    #7  
Old 02-25-2011
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 28 August 2014, 4:23 PM EDT
Location: Saskatchewan
Posts: 19,269
Thanks: 774
Thanked 3,236 Times in 3,034 Posts
Quote:
Originally Posted by phil518 View Post
It is GNU grep. Wow, 200 times longer for case insensitive grep. Those routines sound very heavy.
Imagine every possible language UTF8 supports, including ones where a letter's "case" has strict and complicated rules. All of that's what you're asking grep to check for when doing case-insensitive on UTF8.
The Following User Says Thank You to Corona688 For This Useful Post:
phil518 (02-25-2011)
Sponsored Links
Closed Thread

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Using sed for case insensitive search sbhuvana20 UNIX for Dummies Questions & Answers 7 02-19-2011 07:05 AM
Case Insensitive search gregarion Shell Programming and Scripting 3 01-19-2010 10:31 AM
How to make sed case insensitive vickylife Shell Programming and Scripting 2 01-06-2009 04:57 AM
case insensitive ROOZ Shell Programming and Scripting 4 03-11-2008 12:40 PM
awk case-insensitive forever_49ers Shell Programming and Scripting 7 09-13-2006 05:15 PM



All times are GMT -4. The time now is 10:03 AM.