5 years ago I blogged about…

Regular Expression to match Multi-Byte

Here's something I figured out today: How to match a UTF-8 multibyte char with a regular expression without enabling the Unicode support.

The task was to match a single character following the ✆ symbol. The initial approach would look like this:

preg_match('/✆./', $input, $match);

And that works fine for inputs like ✆A. But what if your input is ✆Ф? Then $match will not look like you expected.

That's because you didn't take UTF-8 into account. A dot only matches a single byte. That Ф character is two bytes!

The obvious solution is using PCRE's /u modifier:

preg_match('/✆\X/u', $input, $match);

However, using the /u modifier is very slow. That doesn't matter for this simple example, but for my use case it did.

But there is another way. We simply want to identify a single UTF-8 character. Their length varies between 1 and 4 bytes. The first 127 bits are just like ASCII. If a byte is above the 127 range it indicates a multibyte sequence. Since we can match on the byte level, it should be possible to match a multibyte marker and the right amount of bytes following it.

This is what I came up with:

$multi2 = '(?:[\xC2-\xDF].)';                                                   
$multi3 = '(?:[\xE0-\xEF]..)';                                                  
$multi4 = '(?:[\xF0-\xF4]...)';                                                 
$latin = '[0-9A-Za-z]';                                                         
$anychar = "(?:$multi4|$multi3|$multi2|$latin)";                                
 
preg_match("/✆$anychar/", $input, $match);

Seems to work fine in my limited testing so far.

Update: as Bruno points out in the comments, we can simplify this some more by taking into account that the follow-up bytes of UTF-8 characters are always in the 128-191 range. So we can simply match any byte optionally followed by bytes in that range. No need to “count” the follower bytes. Nice.

preg_match("/✆.[\x80-\xbf]*/", $input, $match);

Tags:: php, regex, regexp, pcre, utf-8, programming

Similar posts:

splitbrain.org

electronic brain surgery since 2001

Regular Expression to match Multi-Byte

Comments