Monday, July 15, 2013

EPFImporter with error [WARNING]: Incorrect string value: '\xE6\xB0\x91\xE8\xA8\xB4...' for column 'title' at row 15

When using EPFImporter provided by Apple .Inc to import EPF data into table, it reported the following errors.

[WARNING]: Incorrect string value: '\xE6\xB0\x91\xE8\xA8\xB4...' for column 'title' at row 15

The reason is that the encoding of the column title is utf8 (assuming it is utf8) which cannot store characters larger than 3 bytes. For mysql, utf8 is designed as 3 bytes. After MySQL 5.5, they add another type utf8mb4 to deal with such problem  which is 4-byte encoding.

The solution is to make sure your table is look like this:
CREATE TABLE `epf_application_detail_tmp` (
`export_date` BIGINT(20) NULL DEFAULT NULL,
`application_id` INT(11) NOT NULL DEFAULT '0',
`language_code` VARCHAR(20) NOT NULL DEFAULT '' COLLATE 'utf8mb4_unicode_ci',
`title` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`description` LONGTEXT NULL COLLATE 'utf8mb4_unicode_ci',
`release_notes` LONGTEXT NULL COLLATE 'utf8mb4_unicode_ci',
`company_url` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`support_url` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`screenshot_url_1` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`screenshot_url_2` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`screenshot_url_3` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`screenshot_url_4` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`screenshot_width_height_1` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`screenshot_width_height_2` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`screenshot_width_height_3` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`screenshot_width_height_4` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`ipad_screenshot_url_1` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`ipad_screenshot_url_2` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`ipad_screenshot_url_3` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`ipad_screenshot_url_4` VARCHAR(1000) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`ipad_screenshot_width_height_1` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`ipad_screenshot_width_height_2` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`ipad_screenshot_width_height_3` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
`ipad_screenshot_width_height_4` VARCHAR(20) NULL DEFAULT NULL COLLATE 'utf8mb4_unicode_ci',
PRIMARY KEY (`application_id`, `language_code`)
)
COLLATE='utf8mb4_unicode_ci'
ENGINE=InnoDB;

For convenience, modify the default setting of mysql.

add the following to my.cnf
[mysqld]
collation-server=utf8mb4_unicode_ci
character-set-server=utf8mb4

Then, restart mysql and then create database e.g., epf to store EPF data. If you create epf already, make sure to change its default setting.

2 comments:

  1. Hello Zhen Xie,

    Thanks for this post, I got the same warnings.

    So I modified the my.cnf like you suggested and my CREATE statement is similar to yours now but I still get these warnings.
    Did you also used the original Apple EPFImporter from 2010 or did you modified something there?

    If I’m using the EPFImporter without source changes the warnings are the same (with and without your suggested modifications), if I modify the EPFIngester.py like this:

    def connect(self):
    conn = MySQLdb.connect(
    charset='utf8mb4', # use 'utf8mb4' instead of'utf8'
    host=self.dbHost,
    user=self.dbUser,
    passwd=self.dbPassword,
    db=self.dbName)
    conn.autocommit(True)
    return conn

    I get the Error: “LookupError: unknown encoding: utf8mb4”

    Did you had the same problems? Do you have some handy tips?
    Thanks in advance...

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete