Mbstring and PHP Must Use in Web Application Development

Bytes and Bits are two units for storing logical information. A bit is can be thought as one hole, which can be filled with one of two values: 0 or 1.

A byte is a grouping of eight bits. In terms of math, a byte is capable of representing 256 different values (28).

Let’s think about a language, say English. It has some characters (a, b, c, … etc.) which are represented in a computer by bytes. The total number of characters in English is not more than 256, so every character can be represented by using a different 8-bit sequence.

Bytes and Bits

Strings are simply a collection of characters. Normally in PHP string operations operate on strings of single-byte character. For example: you may want to compare the strings “Hello” and “Hi”. With strcmpr(), the two strings will be compared assuming each every character in the string takes one byte.

But think about a language which has more than 256 characters (for example Japanese), or when we want to represent characters from multiple languages at the same time. One byte storage for each character is not enough. This is where the multi-byte concept comes in.

A string of Japanese text may cause the strcmpr() function to return a wrong or garbage value since the assumption that one byte represents one character no longer holds true. When we work with multibyte-encoded strings, the manipulation of these strings needs special functions rather than the common single-byte string functions. To deal with multi byte strings in PHP, mbstring provides the multi byte specific string functions.

Understanding UTF-8

UTF stands for Unicode Transformation Format and is an encoding system that aims to represent every character in every language in one character set. There are different versions of UTF, some of which are shown below:

Encoding Format Description
UTF- 1 Compatible with ISO-2022, obsolete from the Unicode Standard.
UTF-7 7-bit encoding system, mainly was used in e-mail but not part of the Unicode standard.
UTF-8 8-bit encoding system, variable-width, and is ASCII-compatible.
UTF-EBCDIC 8-bit encoding system, variable-width, and is EBCDIC-compatible.
UTF-16 16-bit encoding system, variable width.
UTF-32 32-bit encoding system, fixed-width.

We find ourselves using UTF-8 most of the time when working with multibyte text, so let’s focus on that for a moment. UTF-8 encodes characters in multiple bytes using the following scheme:

Design of UTF-8

So, how does it know whether it a character is stored in one byte or multiple bytes? For this it looks at the high-order bit of the first byte.

Code Meaning
0xxxxxxx A Single byte code
110xxxxx One more byte follows this byte
1110xxxx Two more byte follows this byte
11110xxx Three more byte follows this byte
111110xx Four more byte follows this byte
1111110x Five more byte follows this byte
10xxxxxx Continuation of multi byte character

Each continued byte in a multiple-byte sequence then starts with 1 and 0 in its two most high-order bits to provide a way to detect corrupt data.

Multibyte Equivalents of Common String Functions

For commonly used string functions, like strlen(), strops(), and substr(), there are multibyte equivalent functions. You should use the equivalent functions when working with multibyte strings.

Table 4: Single byte equivalent multi byte string functions

Single byte Multibyte Description
strlen() mb_strlen() Get string length
strpos() mb_strpos() Find position of first occurrence of string in a string
substr() mb_substr() Return part of a string
strtolower() mb_strtolower() Make a string lowercase
strtoupper() mb_strtoupper() Make a string uppercase
substr_count() mb_substr_count() Count the number of substring occurrences
split() mb_split() Split string into array by regular expression
mail() mb_send_mail() Send encoded mail
ereg() mb_ereg() Regular expression match
eregi() mb_eregi() Case insensitive regular expression match
Let me give an example of using multibyte function:
  • Function name: int mb_strlen ( string $str [, string $encoding ] )
  • Description: Get the string length.
  • Parameters: str (input string of which length should be determined)

    encoding (Character encoding)

  • Return Value: Number of character of the input string str with character encoding encoding
  • Return type: int

Example Code: Here is an example code of how to use mb_strlen function. Here input string is a Chinese word and three different character encoding options are used.

$ str =”大大”;
echo mb_strlen ( $ str , 'utf8' ). 
echo mb_strlen ( $ str , ‘gbk’ ). 
echo mb_strlen ( $ str , ' gb2312').

Constraints: UTF-8 has some constraints, like-

  • Theoretically UTF-8 encoded characters’ highest length is six bytes.
  • 0xFE and 0xFF are never used in this encoding.

Enable mbstring from php.ini:

  • Confirm existence of php_mbstring.dll in ext folder.
  • Uncomment ;extension=php_mbstring.dll from php.ini (i.e extension=php_mbstring.dll)
  • Restart Server.

Runtime Configuration: To enable some mbstring functions, some more setting should be changed.

Table 5: Configurations in php.ini

Name Default Value Changable Option
mbstring.language neutral PHP_INI_SYSTEM | PHP_INI_PERDIR
mbstring.detect_order NULL PHP_INI_ALL
mbstring.http_input pass PHP_INI_ALL
mbstring.http_output pass PHP_INI_ALL
mbstring.internal_encoding NULL PHP_INI_ALL
mbstring.script_encoding NULL PHP_INI_ALL
mbstring.substitute_character NULL PHP_INI_ALL
mbstring.func_overload 0 PHP_INI_SYSTEM | PHP_INI_PERDIR
mbstring.encoding_translation 0 PHP_INI_SYSTEM | PHP_INI_PERDIR

Explanation of the configuration options:

The “Changeable option” determines the changeable mode value. It describes how and from where the mbstring options can be changed. Here goes the meaning for the mode values:

Table 6: Different change mode

Mode Meaning
PHP_INI_SYSTEM We can set the entry using php.ini or httpd.conf
PHP_INI_PERDIR We can set the entry using php.ini, .htaccess, httpd.conf or .user.ini
PHP_INI_ALL We can set the entry from anywhere
PHP_INI_USER We can set the entry using user script.

How to change from user script:

We can use the following code to set internal encoding of mbstring from user script:

<?php
ini_set('mbstring.internal_encoding', 'UTF-8');
?>

How to change from php.ini:

We can edit php.ini file to set some mbstring options.

; Set default language
mbstring.language = Neutral; Set default language to Neutral(UTF-8) (default)
mbstring.language = English; Set default language to English 

; Enabled HTTP input encoding translation.
mbstring.encoding_translation = On


; Set default HTTP input character encoding

mbstring.http_input = pass ; No conversion. 
mbstring.http_input = auto ; Set HTTP input to auto
Some issues related to mbstring:

Using mbstring functions sometimes may cause some harassment to you. I will discuss here some problems of using multibyte function overload. Let us think a scenario.

You have enabled mbstring.func_overload option in your php.ini file. Your work is going fine. You are overloading single byte string function by multi byte string functions. But what will happen if you need an external library which frequently uses some string function?

There is a solution of this problem. You can use mbstring.internal_coding. When you call some external library, it will use single byte encoding and when back to your project, multibytes encoding will be implemented. But what happen if there is a callback between your project and external library? It fails here.

So, you have to keep in mind these issues while using mbstring options.

Importance of mbstring for web development:

To develop any international web application, use of mbstring is a must. Otherwise your application will be limited to some certain nations and languages. As a developer, I suggest you to get some knowledge on this domain and make yourself efficient as a web programmer.

Like the article? Share it.

LinkedIn Pinterest

Leave a Comment Yourself

Your email address will not be published. Required fields are marked *