MySQL Duplicate Entry, But Not Really

I chased an interesting bug today. MySQL was complaining about “Integrity constraint violation: 1062 Duplicate entry”. I had all the necessary safeguards in my code to prevent duplicates in tha column.

I gave up on logic and simply dumped the contents of the problematic column for every record. I found that there was a record with and without an accent on one of the characters. PHP saw each as a unique value, but MySQL did not make a distinction, which is why it complained about a duplicate value. It’s a good thing too, because based on my goal, these should have been treated as duplicates.

My solution was to substitute accented characters before filtering duplicates in the code. This way, similar records were rejected before they were sent to the database.

Happy coding!

[Edit 2016-08-26] As pointed out in the comments, a more robust and versatile solution would be to check the collation on the column. Thanks!

8 thoughts on “MySQL Duplicate Entry, But Not Really

  1. This depends on your choice of collation and it’s important to be aware of this.

    mysql> SELECT “ö” = “o” COLLATE utf8_bin;
    +—————————–+
    | “ö” = “o” COLLATE utf8_bin |
    +—————————–+
    | 0 |
    +—————————–+
    1 row in set (0.00 sec)

    mysql> SELECT “ö” = “o” COLLATE utf8_unicode_ci;
    +————————————+
    | “ö” = “o” COLLATE utf8_unicode_ci |
    +————————————+
    | 1 |
    +————————————+
    1 row in set (0.00 sec)

    mysql> SELECT “ö” = “oe” COLLATE utf8_unicode_ci;
    +————————————-+
    | “ö” = “oe” COLLATE utf8_unicode_ci |
    +————————————-+
    | 0 |
    +————————————-+
    1 row in set (0.01 sec)

    mysql> SELECT “ö” = “oe” COLLATE utf8_german2_ci;
    +————————————-+
    | “ö” = “oe” COLLATE utf8_german2_ci |
    +————————————-+
    | 1 |
    +————————————-+
    1 row in set (0.00 sec)

    • Good point. I was just checking that out actually. I’m using utf8_unicode_ci. Which one do you use and in which scenario? Or perhaps you can point me to a good article on the matter. Thanks!

  2. Kine

    Indexes use collated values. Unique constraints use indexes. This is the standard SQL behavior, it’s not specific to MySQL. See Postgresql’s documentation for another example. Any good book teaching MySQL or SQL will state this. Note that this behavior is the only sane one, since all comparisons on a column are expected to apply the column’s collation by default: the unique constraint is no exception.

    You’ll find explanation of the collations utf8_general, utf8_unicode, utf8mb4_unicode… in the official MySQL documentation. Note the latter is the recommended one, especially if you handle multi-lingual texts. The “ci” suffix is of course for “case independant”.

    You really shouldn’t use two different ways of validating a unique constraint. Will you still get a crash when your PHP code considers the value is unique while MySQL doesn’t? When you “remove the accents”, how do you handle “ß”, “æ”, “ţ”, “É” or “ff” (ligature)? Do you normalize the Unicode input? Either remove the constraint on this column, or use an INSERT to check if the value exists, or simply handle the successfulness of INSERT (you do check the return value, of course?).

    • shouldn’t use two different ways of validating

      I save data in large batches so filtering out duplicates before sending them to the DB is desirable in my case. Single inserts would kill performance, which I can’t afford. It’s always a tradeoff based on what is more important, but that’s a different topic.

      how do you handle “ß”, “æ”, “ţ”, “É” or “ff”

      Where I live, I only need to deal with English and occasionally French. However, with the information gathered in the comments, I would solve that problem through appropriate collations rather than normalizing the input. This is literally the first time that I encountered this issue since 1997.

  3. youp in some scenario for me its compeltly different value “choj” != “chój” and i need see the differents on database level. So push coders to skip accents on code level its not best idea and when u code in international apps cause u’ll have probably big problem…

    • Right. As I mentioned in the post, my scenario made it perfectly valid to remove the accents and this is the solution I used on short notice. This was just a small utility I wrote for internal needs.

      So yes, in an application with languages other than English and French, I would definitely check my collation.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">