Sponsored Content
Top Forums Shell Programming and Scripting PHP: preg_match_all with multibyte characters? Post 302425953 by Ilja on Monday 31st of May 2010 06:08:07 AM
Old 05-31-2010
PHP: preg_match_all with multibyte characters?

Hi! I'm trying to separate text into sentences, like this:
Code:
$pattern = "/[A-Z]([a-z]|[[:space:]]|,)*[\.\!\?:]*/";
preg_match_all($pattern, $text, $matches);

This works fine unless the text contains multibyte characters, like "åäö". How can I make this work with these exotic characters?

An example phrase that doesn't match:
"Detta är ett test!"
The character 'ä' prevents a match, but I would also like to match those characters.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

split string with multibyte delimiter

Hi, I need to split a string, either using awk or cut or basic unix commands (no programming) , with a multibyte charectar as a delimeter. Ex: abcd-efgh-ijkl split by -efgh- to get two segments abcd & ijkl Is it possible? Thanks A.H.S (1 Reply)
Discussion started by: azmathshaikh
1 Replies

2. Shell Programming and Scripting

Multibyte characters to ASCII

Hello, Is there any UNIX utility/command/executable that will convert mutlibyte characters to standard single byte ASCII characters in a given file? and Is there any UNIX utility/command/executable that will recognize multibyte characters in a given file name? The typical multibyte... (8 Replies)
Discussion started by: jerardfjay
8 Replies

3. Shell Programming and Scripting

PHP: preg_match_all with multibyte characters?

Hi! I'm trying to separate text into sentences, like this: $pattern = "/(|]|,)**/"; preg_match_all($pattern, $text, $matches); This works fine unless the text contains multibyte characters, like "åäö". How can I make this work with these exotic characters? (2 Replies)
Discussion started by: Ilja
2 Replies

4. Shell Programming and Scripting

How to replace characters with random characters

I've got a file (numbers.txt) filled with numbers and I want to replace each one of those numbers with a new random number between 0 and 9. This is my script so far: #!/bin/bash rand=$(($RANDOM % 9)) sed -i s//$rand/g numbers.txtThe problem that I have is that it replaces each number with just... (2 Replies)
Discussion started by: hellocatfood
2 Replies

5. Programming

How will the behaviour of multibyte char differ because of different LC_CTYPE locale?

I am comparing two multibyte characters in two different platforms having different LC_CTYPE variables, they are returning different values. One of the variable is sigma initialised to "\317\203" and the other one is empty string i.e, "" Below is the scenario of the two platforms: In... (4 Replies)
Discussion started by: baig_1988
4 Replies

6. Shell Programming and Scripting

Replace special characters with Escape characters?

i need to replace the any special characters with escape characters like below. test!=123-> test\!\=123 !@#$%^&*()-= to be replaced by \!\@\#\$\%\^\&\*\(\)\-\= (8 Replies)
Discussion started by: laknar
8 Replies

7. Shell Programming and Scripting

sed replacing specific characters and control characters by escaping

sed -e "s// /g" old.txt > new.txt While I do know some control characters need to be escaped, can normal characters also be escaped and still work the same way? Basically I do not know all control characters that have a special meaning, for example, ?, ., % have a meaning and have to be escaped... (11 Replies)
Discussion started by: ijustneeda
11 Replies

8. Shell Programming and Scripting

Positional insertion for multibyte characters

Hi I have a requirement to insert a dot "." after a position in each line, say 110th position. For which, I have written the below command. cat filename | sed 's/./&\./110' > new_filename The code is working fine, but when we have multi byte (2 or 3) characters in the input file, the... (3 Replies)
Discussion started by: tostay2003
3 Replies

9. Shell Programming and Scripting

Remove first 2 characters and last two characters of each line

here's what im trying to do. i have a file containing lines similar to this: data.txt: 1hsRmRsbHRiSFZNTTA1dlEyMWFkbU5wUW5CSlIyeDFTVU5SYjJOSFRuWmpia0ZuWXpKV2FHTnRU 1lKUnpWMldrZFZaMG95V25oYQpSelEyWTBka2QyRklhSHBrUjA1b1kwUkJkd3BOVXpWM1lVaG5k... (5 Replies)
Discussion started by: SkySmart
5 Replies

10. Shell Programming and Scripting

Outputting characters after a given string and reporting the characters in the row below --sed

I have this fastq file: @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86 GGGGGGGGGGGGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCA +test-1 GGGGGGGGGGGGGGGGGCCGGGGGFF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8... (10 Replies)
Discussion started by: Xterra
10 Replies
match(1)                                                        Mail Avenger 0.8.3                                                        match(1)

NAME
match - Match strings against glob paterns SYNOPSIS
match [-gilrqs] [-n <n>] [-c cmd] [-x code] {[-p] pattern | -f <file>} str1 [str2 ...] DESCRIPTION
match checks strings against pattern, which should be a shell-like glob pattern. pattern may contain the following special characters: ? A "?" character in pattern matches any single character in the string, except that the "/" character is only matched if match was given the -s option. * A "*" character in pattern matches zero or more characters in the string. The exception is that it will only match "/" characters if match was given the -s option. [...] A set of characters between square brackets matches any character in the set. In addition, the "-" character can be used to specify a range. For example "[+e0-3]" would match any of the characters "+", "e", 0, 1, 2, or 3 in the input string. To include a hyphen ("-") in the set of characters matched, either include the hyphen first or last, or escape it with a "". [!...] A character class preceded by a "!" matches any character but those specified in the class. The exception is that the negated character class will match a "/" only if match was given the -s option. c The backslash character escapes the next character c. Thus, to match a literal "*", you would use the pattern "*". match prints each string that matches pattern, one per line, and exits 0 if one or more strings matched. If no string matches, match exits with status 67 (or whatever alternate status was specified by the -x flag). If the -n n flag was specified, match prints only the text that matched the nth occurrence of "*" in the patten. OPTIONS -f file Specifies that the pattern should be read from file. match will read each line of the file and consider it as pattern to match against the argument strings. For each argument string, match stops when it hits the first matching line of the file. If file does not exist, match exits 67, or whatever code was specified by -x. -g Normally, the -n option selects text matching particular "*" characters in the patern. -g changes this behavior to use parentheses for grouping. Thus, for instance, the text "foo.c" would match pattern "*(.[ch])", and the output with option -n 1 would be ".c". To include a literal "(" or ")" in the pattern with the -g option, you must precede the character with a "". -i Makes the match case insensitive. str will be considered to match if any variation on its capitalization would match. For example, string "G" would match pattern "[f-h]". -l When a pattern matches the string in more than one way, the -l flag says to assign as much text as possible to the leftmost "*"s in the pattern. For example, pattern "*+*" would match text "a+b+c", and the first "*" would match "a+b". This behavior is the default, thus -l's effect is only to undo a previous -r flag. -n n With this flag, match prints the text that matched the nth "*" in the pattern, as opposed to printing the whole string. The leftmost "*" corresponds to -n 1. Specifying -n 0 causes match to print the whole matching string. Specifying -n -1 or using a value greater than the number of "*"s in the pattern causes match not to print anything, in which case you can still use the exit status to see if there is a match. The default value for n is 0, unless -g has also been specified, in which case the default is 1. -c command When -c is specified, match runs command with the system shell (/bin/sh), giving it as argument $0 the full string that matched, and as arguments $1, $2, etc., the parts of the string that matched any "*"s in pattern. If the command does not exit with status 0, match will exit immediately, before processing further matches, with whatever status command returned. The -c and -n flags are mutually exclusive. -p pattern Specifies the pattern to match against. The -p flag is optional; you can specify pattern as the first argument following the options. However, if you want to try matching the same input string against multiple patterns, then you must specify each pattern with a -p flag. -q This option is synonymous with -n -1; it suppresses output when there is a match. You can still determine whether a match occurred by the exit status. -r When a pattern matches the string in more than one way, the -r flag says to assign as much text as possible to the rightmost "*"s in the pattern. For example, with -r, pattern "*+*" would match text "a+b+c" with the "*" matching "a", and the second matching "b+c". -s Ordinarily, "*", "?", and negated character classes ("[!...]") do not match "/" characters. -s changes this behavior to match slashes. -x code By default, when there is no match, match exits with status 67. With this option, match exits with status code, instead. EXAMPLES
Suppose you have a directory with a bunch of files ending .c and .o. If, for each file named foo.c you want to attempt to delete the file foo.o, you can run the following command: match -p '*.c' -c 'rm -f $1.o' *.c Servers running the mailman list manager often send mail from bounce addresses of the form listname-bounces@host.com. If you subscribe to multiple lists on the same server, the mailman interface makes it easier if you subscribe under the same address. To split the mail into multiple folders based on the bounce address in the environment variable SENDER, you might chose a mailbox with the following shell code: name=`match -n1 "*-bounces@host.com" "$SENDER"` && echo "$HOME/Mail/incoming/host-$name.spool" SEE ALSO
avenger(1), avenger.local(8) The Mail Avenger home page: <http://www.mailavenger.org/>. AUTHOR
David Mazieres Mail Avenger 0.8.3 2012-04-05 match(1)
All times are GMT -4. The time now is 12:52 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy