Sponsored Content
Top Forums Shell Programming and Scripting PHP: preg_match_all with multibyte characters? Post 302302462 by Ilja on Tuesday 31st of March 2009 04:37:28 AM
Old 03-31-2009
PHP: preg_match_all with multibyte characters?

Hi! I'm trying to separate text into sentences, like this:
Code:
$pattern = "/[A-Z]([a-z]|[[:space:]]|,)*[\.\!\?:]*/";
preg_match_all($pattern, $text, $matches);

This works fine unless the text contains multibyte characters, like "едц". How can I make this work with these exotic characters?
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

split string with multibyte delimiter

Hi, I need to split a string, either using awk or cut or basic unix commands (no programming) , with a multibyte charectar as a delimeter. Ex: abcd-efgh-ijkl split by -efgh- to get two segments abcd & ijkl Is it possible? Thanks A.H.S (1 Reply)
Discussion started by: azmathshaikh
1 Replies

2. Shell Programming and Scripting

Multibyte characters to ASCII

Hello, Is there any UNIX utility/command/executable that will convert mutlibyte characters to standard single byte ASCII characters in a given file? and Is there any UNIX utility/command/executable that will recognize multibyte characters in a given file name? The typical multibyte... (8 Replies)
Discussion started by: jerardfjay
8 Replies

3. Shell Programming and Scripting

PHP: preg_match_all with multibyte characters?

Hi! I'm trying to separate text into sentences, like this: $pattern = "/(|]|,)**/"; preg_match_all($pattern, $text, $matches); This works fine unless the text contains multibyte characters, like "едц". How can I make this work with these exotic characters? An example phrase that doesn't match:... (1 Reply)
Discussion started by: Ilja
1 Replies

4. Shell Programming and Scripting

How to replace characters with random characters

I've got a file (numbers.txt) filled with numbers and I want to replace each one of those numbers with a new random number between 0 and 9. This is my script so far: #!/bin/bash rand=$(($RANDOM % 9)) sed -i s//$rand/g numbers.txtThe problem that I have is that it replaces each number with just... (2 Replies)
Discussion started by: hellocatfood
2 Replies

5. Programming

How will the behaviour of multibyte char differ because of different LC_CTYPE locale?

I am comparing two multibyte characters in two different platforms having different LC_CTYPE variables, they are returning different values. One of the variable is sigma initialised to "\317\203" and the other one is empty string i.e, "" Below is the scenario of the two platforms: In... (4 Replies)
Discussion started by: baig_1988
4 Replies

6. Shell Programming and Scripting

Replace special characters with Escape characters?

i need to replace the any special characters with escape characters like below. test!=123-> test\!\=123 !@#$%^&*()-= to be replaced by \!\@\#\$\%\^\&\*\(\)\-\= (8 Replies)
Discussion started by: laknar
8 Replies

7. Shell Programming and Scripting

sed replacing specific characters and control characters by escaping

sed -e "s// /g" old.txt > new.txt While I do know some control characters need to be escaped, can normal characters also be escaped and still work the same way? Basically I do not know all control characters that have a special meaning, for example, ?, ., % have a meaning and have to be escaped... (11 Replies)
Discussion started by: ijustneeda
11 Replies

8. Shell Programming and Scripting

Positional insertion for multibyte characters

Hi I have a requirement to insert a dot "." after a position in each line, say 110th position. For which, I have written the below command. cat filename | sed 's/./&\./110' > new_filename The code is working fine, but when we have multi byte (2 or 3) characters in the input file, the... (3 Replies)
Discussion started by: tostay2003
3 Replies

9. Shell Programming and Scripting

Remove first 2 characters and last two characters of each line

here's what im trying to do. i have a file containing lines similar to this: data.txt: 1hsRmRsbHRiSFZNTTA1dlEyMWFkbU5wUW5CSlIyeDFTVU5SYjJOSFRuWmpia0ZuWXpKV2FHTnRU 1lKUnpWMldrZFZaMG95V25oYQpSelEyWTBka2QyRklhSHBrUjA1b1kwUkJkd3BOVXpWM1lVaG5k... (5 Replies)
Discussion started by: SkySmart
5 Replies

10. Shell Programming and Scripting

Outputting characters after a given string and reporting the characters in the row below --sed

I have this fastq file: @M04961:22:000000000-B5VGJ:1:1101:9280:7106 1:N:0:86 GGGGGGGGGGGGCATGAAAACATACAAACCGTCTTTCCAGAAATTGTTCCAAGTATCGGCAACAGCTTTATCAATACCATGAAAAATATCAACCACACCA +test-1 GGGGGGGGGGGGGGGGGCCGGGGGFF,EDFFGEDFG,@DGGCGGEGGG7DCGGGF68CGFFFGGGG@CGDGFFDFEFEFF:30CGAFFDFEFF8CAF;;8... (10 Replies)
Discussion started by: Xterra
10 Replies
PREG_MATCH_ALL(3)							 1							 PREG_MATCH_ALL(3)

preg_match_all - Perform a global regular expression match

SYNOPSIS
int preg_match_all PREG_PATTERN_ORDER (string $pattern, string $subject, [array &$matches], [int $flags], [int $offset]) DESCRIPTION
Searches $subject for all matches to the regular expression given in $pattern and puts them in $matches in the order specified by $flags. After the first match is found, the subsequent searches are continued on from end of the last match. PARAMETERS
o $pattern - The pattern to search for, as a string. o $subject - The input string. o $matches - Array of all matches in multi-dimensional array ordered according to $flags. o $flags - Can be a combination of the following flags (note that it doesn't make sense to use PREG_PATTERN_ORDER together with PREG_SET_ORDER): o PREG_PATTERN_ORDER - Orders results so that $matches[0] is an array of full pattern matches, $matches[1] is an array of strings matched by the first parenthesized subpattern, and so on. <?php preg_match_all("|<[^>]+>(.*)</[^>]+>|U", "<b>example: </b><div align=left>this is a test</div>", $out, PREG_PATTERN_ORDER); echo $out[0][0] . ", " . $out[0][1] . " "; echo $out[1][0] . ", " . $out[1][1] . " "; ?> The above example will output: <b>example: </b>, <div align=left>this is a test</div> example: , this is a test So, $out[0] contains array of strings that matched full pattern, and $out[1] contains array of strings enclosed by tags. o PREG_SET_ORDER - Orders results so that $matches[0] is an array of first set of matches, $matches[1] is an array of second set of matches, and so on. <?php preg_match_all("|<[^>]+>(.*)</[^>]+>|U", "<b>example: </b><div align="left">this is a test</div>", $out, PREG_SET_ORDER); echo $out[0][0] . ", " . $out[0][1] . " "; echo $out[1][0] . ", " . $out[1][1] . " "; ?> The above example will output: <b>example: </b>, example: <div align="left">this is a test</div>, this is a test o PREG_OFFSET_CAPTURE - If this flag is passed, for every occurring match the appendant string offset will also be returned. Note that this changes the value of $matches into an array where every element is an array consisting of the matched string at offset 0 and its string offset into $subject at offset 1. If no order flag is given, PREG_PATTERN_ORDER is assumed. o $offset - Normally, the search starts from the beginning of the subject string. The optional parameter $offset can be used to specify the alternate place from which to start the search (in bytes). Note Using $offset is not equivalent to passing substr($subject, $offset) to preg_match_all(3) in place of the subject string, because $pattern can contain assertions such as ^, $ or (?<=x). See preg_match(3) for examples. RETURN VALUES
Returns the number of full pattern matches (which might be zero), or FALSE if an error occurred. CHANGELOG
+--------+---------------------------------------------------+ |Version | | | | | | | Description | | | | +--------+---------------------------------------------------+ | 5.4.0 | | | | | | | The $matches parameter became optional. | | | | | 5.3.6 | | | | | | | Returns FALSE if $offset is higher than $subject | | | length. | | | | | 5.2.2 | | | | | | | Named subpatterns now accept the syntax | | | (?<name>) and (?'name') as well as (?P<name>). | | | Previous versions accepted only (?P<name>). | | | | +--------+---------------------------------------------------+ EXAMPLES
Example #1 Getting all phone numbers out of some text. <?php preg_match_all("/(? (d{3})? )? (?(1) [-s] ) d{3}-d{4}/x", "Call 555-1212 or 1-800-555-1212", $phones); ?> Example #2 Find matching HTML tags (greedy) <?php // The \2 is an example of backreferencing. This tells pcre that // it must match the second set of parentheses in the regular expression // itself, which would be the ([w]+) in this case. The extra backslash is // required because the string is in double quotes. $html = "<b>bold text</b><a href=howdy.html>click me</a>"; preg_match_all("/(<([w]+)[^>]*>)(.*?)(</\2>)/", $html, $matches, PREG_SET_ORDER); foreach ($matches as $val) { echo "matched: " . $val[0] . " "; echo "part 1: " . $val[1] . " "; echo "part 2: " . $val[2] . " "; echo "part 3: " . $val[3] . " "; echo "part 4: " . $val[4] . " "; } ?> The above example will output: matched: <b>bold text</b> part 1: <b> part 2: b part 3: bold text part 4: </b> matched: <a href=howdy.html>click me</a> part 1: <a href=howdy.html> part 2: a part 3: click me part 4: </a> Example #3 Using named subpattern <?php $str = <<<FOO a: 1 b: 2 c: 3 FOO; preg_match_all('/(?P<name>w+): (?P<digit>d+)/', $str, $matches); /* This also works in PHP 5.2.2 (PCRE 7.0) and later, however * the above form is recommended for backwards compatibility */ // preg_match_all('/(?<name>w+): (?<digit>d+)/', $str, $matches); print_r($matches); ?> The above example will output: Array ( [0] => Array ( [0] => a: 1 [1] => b: 2 [2] => c: 3 ) [name] => Array ( [0] => a [1] => b [2] => c ) [1] => Array ( [0] => a [1] => b [2] => c ) [digit] => Array ( [0] => 1 [1] => 2 [2] => 3 ) [2] => Array ( [0] => 1 [1] => 2 [2] => 3 ) ) SEE ALSO
PCRE Patterns, preg_quote(3), preg_match(3), preg_replace(3), preg_split(3), preg_last_error(3). PHP Documentation Group PREG_MATCH_ALL(3)
All times are GMT -4. The time now is 10:15 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy