Unix/Linux Go Back    


UNIX for Advanced & Expert Users Expert-to-Expert. Learn advanced UNIX, UNIX commands, Linux, Operating Systems, System Administration, Programming, Shell, Shell Scripts, Solaris, Linux, HP-UX, AIX, OS X, BSD.

penalty for case insensitive grep

UNIX for Advanced & Expert Users


Tags
grep, grep case insensitive penalty

Closed Linux or Unix Question    
 
Thread Tools Search this Thread Display Modes
    #1  
Old Unix and Linux 02-24-2011
phil518 phil518 is offline
Registered User
 
Join Date: Mar 2009
Last Activity: 15 May 2014, 9:03 PM EDT
Posts: 43
Thanks: 15
Thanked 0 Times in 0 Posts
penalty for case insensitive grep

I just found out there were a big performance penalty for case insensitive "grep" on big files.

It would be understandable, except that the penalty seems to be exaggerated out of proportion.

A real example, if I only grep a single letter "V" (or "v") , without "-i", on a big file, (file is doctored so only a few "V" or "v" exist). It takes 0.157 user time to finish.

Then I grep the same letter "V", but with "-i" option, it takes 32.0 user time to finish. That is 200 times longer than without "-i" for a single character.

Can someone provide some insight why this is the case?

Thanks.

NB Phil
Sponsored Links
    #2  
Old Unix and Linux 02-24-2011
frank_rizzo frank_rizzo is offline Forum Advisor  
Resident BOFH
 
Join Date: Dec 2007
Last Activity: 23 May 2015, 9:51 AM EDT
Posts: 1,141
Thanks: 2
Thanked 90 Times in 87 Posts
probably because it has to convert the case of every line. try this and see what happens.
Code:
grep -e V -e v ...

The Following User Says Thank You to frank_rizzo For This Useful Post:
phil518 (02-25-2011)
Sponsored Links
    #3  
Old Unix and Linux 02-24-2011
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 30 June 2015, 2:18 PM EDT
Location: Saskatchewan
Posts: 20,740
Thanks: 901
Thanked 3,690 Times in 3,449 Posts
Quote:
Originally Posted by phil518 View Post
Can someone provide some insight why this is the case?
Is this GNU grep? The GNU utilities, unusually, try to handle your character set as appropriate. This means when you tell it to be case insensitive, that busts out some pretty heavy-duty routines in order to do so properly.
The Following User Says Thank You to Corona688 For This Useful Post:
phil518 (02-25-2011)
    #4  
Old Unix and Linux 02-24-2011
phil518 phil518 is offline
Registered User
 
Join Date: Mar 2009
Last Activity: 15 May 2014, 9:03 PM EDT
Posts: 43
Thanks: 15
Thanked 0 Times in 0 Posts
Quote:
Originally Posted by Corona688 View Post
Is this GNU grep? The GNU utilities, unusually, try to handle your character set as appropriate. This means when you tell it to be case insensitive, that busts out some pretty heavy-duty routines in order to do so properly.
It is GNU grep. Wow, 200 times longer for case insensitive grep. Those routines sound very heavy.

Is this a known fact?

is x200 the upper bound of the performance penalty, regardless the length of the input string? (thinking n-character long string will have 2^n case permutations).
Sponsored Links
    #5  
Old Unix and Linux 02-25-2011
binlib binlib is offline
Registered User
 
Join Date: Aug 2009
Last Activity: 15 March 2013, 10:40 AM EDT
Location: New Jersey
Posts: 380
Thanks: 7
Thanked 90 Times in 75 Posts
Again, it's the fault of I18N. Set LC_ALL to C, you will see that the -i run is only twice as long.
The Following User Says Thank You to binlib For This Useful Post:
phil518 (02-25-2011)
Sponsored Links
    #6  
Old Unix and Linux 02-25-2011
methyl methyl is offline Forum Advisor  
Advisor
 
Join Date: Mar 2008
Last Activity: 22 June 2015, 6:47 PM EDT
Posts: 6,399
Thanks: 288
Thanked 675 Times in 644 Posts
This problem has cropped up a few times recently.

The default "locale" for GNU utility programs including "grep" and "sort" has changed to "UTF" which means that mapping one "character" can take more than one character. If your file definitely does not contain "UTF" characters you can massively improve performance by changing your locale back to the basic value of "C".

To check how your system is now, type and check the output from this enquiry:

Code:
locale

The Following User Says Thank You to methyl For This Useful Post:
phil518 (02-25-2011)
Sponsored Links
    #7  
Old Unix and Linux 02-25-2011
Corona688 Corona688 is offline Forum Staff  
Mead Rotor
 
Join Date: Aug 2005
Last Activity: 30 June 2015, 2:18 PM EDT
Location: Saskatchewan
Posts: 20,740
Thanks: 901
Thanked 3,690 Times in 3,449 Posts
Quote:
Originally Posted by phil518 View Post
It is GNU grep. Wow, 200 times longer for case insensitive grep. Those routines sound very heavy.
Imagine every possible language UTF8 supports, including ones where a letter's "case" has strict and complicated rules. All of that's what you're asking grep to check for when doing case-insensitive on UTF8. Linux
The Following User Says Thank You to Corona688 For This Useful Post:
phil518 (02-25-2011)
Sponsored Links
Closed Linux or Unix Question

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Linux More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Using sed for case insensitive search sbhuvana20 UNIX for Dummies Questions & Answers 7 02-19-2011 07:05 AM
Case Insensitive search gregarion Shell Programming and Scripting 3 01-19-2010 10:31 AM
How to make sed case insensitive vickylife Shell Programming and Scripting 2 01-06-2009 04:57 AM
case insensitive ROOZ Shell Programming and Scripting 4 03-11-2008 12:40 PM
awk case-insensitive forever_49ers Shell Programming and Scripting 7 09-13-2006 05:15 PM



All times are GMT -4. The time now is 08:57 PM.