By using this site you agree to the use of cookies by Brugbart and our partners.

Learn more

Block non-latin characters with PHP

How to block unreadable characters from other languages, such as russian chinese, and arabic.

Edited: 2013-04-24 20:46

You might only want to allow common characters, in comments and other textural content, provided by users on your website. This can easily be done with PHP and regular expressions.

To do this, we need to check for unicode code points from other languages. Each code point is part of a larger Unicode script, which either belongs to a single human language, or spans across multiple languages.

Why you may want to limit allowed characters

So why should you limit the characters that you allow? Now that you might have gone trough the trouble of shifting to UTF-8? Well some languages got their own characters, and are practically unreadable to most your users. Besides, you are not really limiting the characters that your site can handle, just the ones that you allow users to use.

Brugbart has been under attack by russian spam, all this spam where stopped when we started to only accept common characters (Latin, punctuation and whitespace. Etc.), and we only allow that users comment in English anyway, so this was implemented with no loss on our part.

Only allow latin characters

Common characters,.

if (preg_match('/[^\p{Common}\p{Latin}]/u', $_POST['text'])) {
 // Post your data to the database.

The /u modifier at the end, turns on unicode matching. Assuming that its a Unicode string you are testing.

The \p{Common} matches common characters, such as punctuation and whitespace, as well as symbols that are shared by many languages. The \p{Latin} part matches Latin characters, which includes certain Danish letters that we wanted to allow on Brugbart.

See also

  1. PHP Regular Expressions Tutorial
  2. List of Unicode Scripts