Question
For a general website built with PHP and MySQL, what character set and collation should be used when you are not fully sure what users may enter?
I understand that encoding should be consistent across the stack, including MySQL, Apache or HTTP headers, HTML, and PHP output. In the past, I have configured PHP to output UTF-8, but I am unsure which MySQL collation corresponds to that.
I have previously used collations such as:
utf8_unicode_ci
utf8_general_ci
utf8_bin
Which MySQL collation best matches UTF-8 output from PHP, and what is the most appropriate choice for a typical website?
Short Answer
By the end of this page, you will understand the difference between a character set and a collation in MySQL, how they relate to PHP UTF-8 output, and how to choose a sensible default for a typical web application. You will also learn why utf8mb4 is usually the safest modern choice, and when to use case-insensitive versus binary collations.
Concept
A common source of confusion is that UTF-8 and collation are not the same thing.
- A character set defines how text is stored.
- A collation defines how text is compared and sorted.
In MySQL:
utf8orutf8mb4= character setsutf8_general_ci,utf8_unicode_ci,utf8_bin= collations
If PHP outputs UTF-8, that does not map to one single MySQL collation. It only means your application is sending text encoded in UTF-8. MySQL still needs:
- a character set to store the bytes
- a collation to decide how string comparisons behave
For example, a collation controls questions like:
- Should
Aequala? - Should accented letters be treated similarly when sorting?
- How should words be ordered alphabetically?
This matters in real programming because collations affect:
ORDER BYGROUP BY- equality checks in
WHERE - unique indexes on text columns
Mental Model
Think of text handling like a library system:
- The character set is the alphabet the library knows how to store.
- The collation is the rulebook for how books are alphabetized.
Two libraries may store the same letters, but sort them differently.
In the same way, PHP outputting UTF-8 means your app is speaking a text language MySQL can store, but MySQL still needs instructions for how to compare words.
So:
- UTF-8 = the letters available
- collation = the sorting and comparison rules
Syntax and Examples
The core idea is to define both the character set and the collation explicitly.
Creating a database
CREATE DATABASE app_db
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Creating a table
CREATE TABLE users (
id INT PRIMARY KEY AUTO_INCREMENT,
username VARCHAR(50) NOT NULL,
display_name VARCHAR(100) NOT NULL
)
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci;
Setting the connection character set from PHP
Using PDO:
<?php
$pdo = new PDO(
'mysql:host=localhost;dbname=app_db;charset=utf8mb4',
'username',
'password',
[
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
PDO::ATTR_DEFAULT_FETCH_MODE => PDO::FETCH_ASSOC,
]
);
Column-level collation when needed
Step by Step Execution
Consider this table:
CREATE TABLE demo (
word VARCHAR(20)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
And these rows:
INSERT INTO demo (word) VALUES
('apple'),
('Apple'),
('ápple');
Now run this query:
SELECT * FROM demo WHERE word = 'apple';
What happens step by step
- MySQL reads the value
'apple'using the connection character set. - The
wordcolumn is stored usingutf8mb4. - The column collation is
utf8mb4_unicode_ci. - The
_cipart means case-insensitive comparison. - MySQL compares
'apple'with each stored value according to that collation.
Real World Use Cases
General website content
For usernames, article titles, comments, and profile names, developers usually want text stored in full Unicode and compared in a human-friendly way.
Typical choice:
utf8mb4_unicode_ci
Exact identifiers
For API keys, hashes, case-sensitive codes, or internal identifiers, exact comparison matters more than friendly sorting.
Typical choice:
utf8mb4_bin
Search and filtering
If users search for names or titles, a case-insensitive collation often gives more expected results.
International applications
If your app stores multiple languages, Unicode-aware collations help sorting and comparisons behave more consistently than older simplified collations.
E-commerce
- product names: Unicode-aware collation
- SKUs: binary or exact matching collation
- customer names: Unicode-aware collation
Real Codebase Usage
In real projects, developers usually set defaults at multiple levels so text handling is consistent.
Common pattern: set database and table defaults
- database default character set:
utf8mb4 - database default collation: Unicode-aware collation
- override at column level only when necessary
Common pattern: set connection charset in PHP
A very common mistake is storing UTF-8 data in UTF-8 tables but forgetting to tell the MySQL connection to use UTF-8. In PDO, developers typically use:
$pdo = new PDO('mysql:host=localhost;dbname=app_db;charset=utf8mb4', 'user', 'pass');
Common pattern: binary for exact fields
Real codebases often use binary collations for:
- tokens
- invitation codes
- case-sensitive login identifiers
- checksum-like values
Validation and guard clauses
Applications often validate user input before insert, but they still rely on the database collation for consistent comparison behavior.
Example:
if ($username === '') {
throw new InvalidArgumentException('Username is required.');
}
Common Mistakes
1. Confusing character set with collation
Broken assumption:
UTF-8 output from PHP means MySQL automatically knows the correct collation.
Why it is wrong:
- UTF-8 is a character encoding
- collation controls sorting and comparison
2. Using MySQL utf8 and assuming it means full UTF-8
Broken setup:
CHARACTER SET utf8
Why to avoid it:
- MySQL's older
utf8is limited - it may not support all Unicode characters
Prefer:
CHARACTER SET utf8mb4
3. Forgetting the connection charset
Broken PHP example:
<?php
$pdo = new PDO('mysql:host=localhost;dbname=app_db', 'user', 'pass');
Better:
= (, , );
Comparisons
| Option | Type | Meaning | Good for | Notes |
|---|---|---|---|---|
utf8mb4 | Character set | Full Unicode storage | Modern applications | Preferred over MySQL utf8 |
utf8 | Character set | Older limited UTF-8 support | Legacy systems | Usually avoid for new work |
utf8mb4_unicode_ci | Collation | Unicode-aware, case-insensitive | General website text | Good practical default |
utf8mb4_bin | Collation | Binary, exact comparison |
Cheat Sheet
- Character set = how text is stored
- Collation = how text is compared and sorted
- PHP outputting UTF-8 does not choose a MySQL collation automatically
- For new projects, prefer:
CHARACTER SET utf8mb4
COLLATE utf8mb4_unicode_ci
- Use
utf8mb4_binfor exact matching fields - Avoid relying on MySQL's old
utf8for new applications - Set the connection charset in PHP:
$pdo = new PDO('mysql:host=localhost;dbname=app_db;charset=utf8mb4', 'user', 'pass');
_cimeans case-insensitive_binmeans binary comparison- Collation affects:
WHEREORDER BYGROUP BYUNIQUEconstraints
Quick default rule:
FAQ
Does PHP UTF-8 output correspond to one specific MySQL collation?
No. UTF-8 output only describes the text encoding. MySQL still needs a character set and a collation defined separately.
What is the difference between utf8 and utf8mb4 in MySQL?
utf8mb4 supports full Unicode. MySQL's older utf8 is limited and should usually be avoided for new projects.
Which MySQL collation is best for a general website?
For most modern websites, a Unicode-aware case-insensitive collation on utf8mb4 is a good default, such as utf8mb4_unicode_ci.
When should I use a binary collation?
Use a binary collation for values that must match exactly, such as tokens, hashes, codes, or other case-sensitive identifiers.
Does collation affect unique usernames?
Yes. With a case-insensitive collation, Alice and alice may be treated as the same value in a unique column.
Do I need to set UTF-8 in both PHP and MySQL?
Yes. Your HTML or HTTP output, PHP database connection, and MySQL schema should all be configured consistently.
Is utf8_general_ci still a good default?
It is mostly a legacy choice. Unicode-aware collations are usually a better default for general text.
Mini Project
Description
Build a small PHP and MySQL example for storing user profiles safely with full Unicode support. The project demonstrates how to configure a database, table, and PHP connection so names like José, 李雷, and emoji-containing bios can be stored correctly while usernames are compared in a predictable way.
Goal
Create a PHP script and MySQL schema that use utf8mb4 correctly and apply a sensible collation strategy for user-facing text.
Requirements
- Create a MySQL database using
utf8mb4and a Unicode-aware collation. - Create a
userstable withusername,display_name, andbiocolumns. - Connect to MySQL from PHP using PDO with
charset=utf8mb4. - Insert at least one row containing non-ASCII characters.
- Query and display the saved data correctly.
Keep learning
Related questions
Convert a PHP Object to an Associative Array
Learn how to convert a PHP object to an associative array, including quick methods, recursion, pitfalls, and practical examples.
Convert a Postman Request to cURL and PHP cURL
Learn how to convert a Postman POST request into a cURL command and use the same request in PHP cURL with headers and body.
Converting HTML and CSS to PDF in PHP: Core Concepts, Limits, and Practical Approaches
Learn how HTML-to-PDF conversion works in PHP, why CSS support varies, and how to choose practical approaches for reliable PDF output.