I'm scraping data from multiple pages and inserting to my MySQL database. There could be duplicates; I only want to store unique entries. Just in case my primary key isn't sufficient, I put in a test which is checked when I get a MySQL 1062 error* (duplicate entry on primary key**). The test checks that all of the pieces of the tuple to be inserted are identical to the stored tuple. What I found is that the when I get the 1062 error that the stored tuple and the scraped tuple are only different by one element/field, a TEXT field.
First, I retrieved the already stored entry and passed them both into htmlspecialchars() to compare the output visually; they looked identical.
According to strlen(), the string retrieved from the DB was 304 characters in length but the newly scraped string was 305. similar_text() backed that up by returning 304***.
So then I looped through one string comparing character for character with the other string, stopping when there was a mismatch. The problem was the first character. In the string coming from the DB it was N yet both strings appear to start with I (even in their output from htmlspecialchars()). Plus the DB string was supposedly one character shorter, not longer.
I then checked the output (printing htmlspecialchars()) and the strlen() again, but this time before the original string (the one that ends up in the DB) is inserted, and before the duplicated is inserted. They looked the same as before and strlen() returned 305 for both.
So this made me think their must be something happening between my PHP and my MySQL. So instead of comparing the newly scraped string to the string in the database with the same primary key (the ID), I try to retrieve a tuple where every single field is equal to their respective parts in newly scraped section like SELECT * FROM table WHERE value1='{$MYSQL_ESCAPED['value1']}' .... AND valueN='{$MYSQL_ESCAPED['valueN']}'; and the tuple is returned. Therefore they are identical in every way including that problematic TEXT field.
What's going on here?
Straight away when I see N in front of string I think of NVARCHAR, etc. from MSSQL but as I know that's not a part of MySQL, but...
Could it have anything to do with the fact that "Each TEXT value is stored using a two-byte length prefix that indicates the number of bytes in the value."?
Or does this just point to a character encoding problem?
Edit:
There are no multi-byte characters stored in the database.
mb_strlen() returns the same results as strlen() where mentioned above.
Using utf8_encode() or mb_convert_encoding() before inserting to the DB makes no difference; an invisible N is still prefixing the string retrieved from the DB.
Notes:
Before inserting any string into my database I pass it through mysql_real_escape_string(trim(preg_replace('/\s\s+/', ' ', $str))) which replaces double spaces with single spaces, removes leading & tailing spaces and escapes it for MySQL insertion.
The page I print the output & testing to is UTF-8.
Upon creation, my DB has its character set set to utf8, its collation to utf8_general_ci and I use the SET NAMES 'utf8' COLLATE 'utf8_general_ci'; command too, as a precaution.
Foot notes:
* I force an exit from the scraping then also.
** The primary key is just a ID (VARCHAR(10)) which I scrape from the pages.
*** Number of common characters
TEXT fields are subject to character set conversion as/when MySQL sees fit. However, MySQL will not randomly add/remove data without a reason. While text fields DO store the length of the data as 2 extra bytes at the head of the on-disk data blob containing the text field data, those 2 bytes are NEVER exposed to the end user. Assuming character set settings are the same throughout the client->database->on-disk->database->client pipeline, there should never be a change in string length anywhere.
Related
I am performing an insert on db table that has a column called json_props - when a word that has special characters like Everyone's this appears in the json column (and back on the frontend) like this
{"col1": "Everyone's"}
I am using the Laravel framework what is the best/recommended Laravel way to insert/escape this correctly into the database.
-- Expected output --
Insert string with special character (like an apostrophe) and it should look normal on the frontend like this e.g
term's
-- Actual Output --
term's
when a word that has special characters like Everyone's this appears in the json column (and back on the frontend) like this {"col1": "Everyone's"}
This is the client software which replaces a quote char with according entity during inserting the value into the table. Not MySQL.
Find the place in your code where this substitution is performed, and replace it with quote char doubling. Then the value will be successfully saved and retrieved with MySQL.
CREATE TABLE test (json_props JSON);
INSERT INTO test VALUES ('{"col1": "Everyone''s"}');
SELECT CAST(json_props AS CHAR) FROM test;
CAST(json_props AS CHAR)
{"col1": "Everyone's"}
db<>fiddle here
I am returning a short array via:
return $stmt->fetchAll( PDO::FETCH_ASSOC );
and running into a very strange issue. In my test case, I have three rows of data. Each row has a unique ID, a processed flag, and a content varchar:
`incomingstring` VARCHAR(5000) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NULL DEFAULT NULL;
This varchar contains a multiline string, delimited by CRLF terminated with message suffix:
define( "MESSAGE_SUFFIX", '\x7c\x1c\x0d' );
I parse this string for message parts. The problem is that not everything is returning as expected. The first row returns fine. The second and third row return, but only part (the last part) of the message returns. I know the data is there, I can see it if I look into the database directly, but what comes into the array does not contain the whole varchar.
My best guess is that the termination character is messing up the associative array in some way? Maybe? Any suggestions for a fix? I can't tear the control characters out of the DB record itself because I need them there for other processes, this parser needs to be able to read the data without altering it.
Using MySQL 8.0.11 and PHP 7.2.7
Thanks!
Here's my situation.
I'm migrating from one server to another. As part of this, I'm moving across database data.
The migration method involved running the same CREATE TABLE query on the new server, then using a series of INSERT commands to insert the data row by row. It's possible this resulted in different data, however, the CHECKSUM command was used to validate the contents. CHECKSUM was done on the whole table after the transfer, on a new table with that row isolated, and after truncation of the string by applying the LEFT operator. Every time, the result was identical between the old and new server, indicating the raw data should be exactly identical at the byte level.
CHECKSUM TABLE `test`
I've checked the structure and it's exactly the same as well.
SHOW CREATE TABLE `test`
Here is the structure:
CREATE TABLE test ( item varchar(32) COLLATE utf8_unicode_ci NOT NULL, amount mediumint(5) NOT NULL ) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
The field is of type:
`item` varchar(32) COLLATE utf8_unicode_ci NOT NULL
Here is my connection code in PHP:
$sql = new mysqli($db_host, $db_user, $db_pass, $db_name);
if ($sql->connect_error) {
die('Connect Error ('.$sql->connect_errno.') '.$sql->connect_error);
}
When I go to retrieve the data in PHP with a simple query:
SELECT * FROM `test`
The data displays like this:
§lO
On the old server/host, I get this sequence of raw bytes:
Decimal: -194-167-108-79-
HEX: -C2-A7-6C-4F-
And on the new server, I get a couple of extra bytes at the beginning:
Decimal: -195-130-194-167-108-79-
HEX: -C3-82-C2-A7-6C-4F-
Why might the exact same raw data, table structure, and query, return a different result between the two servers? What should I do to ensure that results are as consistent as possible in the future?
§lO is "Mojibake" for §lO. I presume the latter (3-character) is "correct"?
The raw data looks like this (in both cases when I display it)
is bogus because the technique used for displaying it probably messed with the encoding.
Since the 3 characters became 4 and then became 6, you probably have "double-encoding".
This discusses how "double encoding" can occur: Trouble with UTF-8 characters; what I see is not what I stored
If you provide some more info (CREATE TABLE, hex, method of migrating the data, etc), we may be able to further unravel the mess you have.
More
When using mysqli, do $sql->set_charset('utf8');
(The HEX confirms my analysis.)
The migration method involved running the same CREATE TABLE query on the new server
Was it preceded by some character set settings, as in mysqldump?
then using a series of INSERT commands to insert the data row by row.
Can you get the HEX of some accented character in the file?
... CHECKSUM ...
OK, being the same rules out one thing.
CHECKSUM was done on ... a new table with that row isolated
How did you do that? SELECTing the row could have modified the text, thereby invalidating the test.
indicating the raw data should be exactly identical at the byte level.
For checking the data in the table, SELECT HEX(col)... is the only way to bypass all possible character set conversions that could happen. Please provide the HEX for some column with a non-ascii character (such as the example given). And do the CHECKSUM against the HEX output.
And provide SHOW VARIABLES LIKE 'char%';
i have a table field type varchar(36) and i want to generate it dynamically by mysql so i used this code:
$sql_code = 'insert into table1 (id, text) values (uuid(),'some text');';
mysql_query($sql_code);
how can i retrieve the generated uuid immediately after inserting the record ?
char(36) is better
you cannot. The only solution is to perform 2 separated queries:
SELECT UUID()
INSERT INTO table1 (id, text) VALUES ($uuid, 'text')
where $uuid is the value retrieved on the 1st step.
You can do everything you need to with SQL triggers. The following SQL adds a trigger on tablename.table_id to automatically create the primary key UUID when inserting, then stores the newly created ID into an SQL variable for retrieval later:
CREATE TRIGGER `tablename_newid`
AFTER INSERT ON `tablename`
FOR EACH ROW
BEGIN
IF ASCII(NEW.table_id) = 0 THEN
SET NEW.table_id = UNHEX(REPLACE(UUID(),'-',''));
END IF;
SET #last_uuid = NEW.table_id;
END
As a bonus, it inserts the UUID in binary form to a binary(16) field to save storage space and greatly increase query speed.
edit: the trigger should check for an existing column value before inserting its own UUID in order to mimic the ability to provide values for table primary keys in MySQL - without this, any values passed in will always be overridden by the trigger. The example has been updated to use ASCII() = 0 to check for the existence of the primary key value in the INSERT, which will detect empty string values for a binary field.
edit 2: after a comment here it has since been pointed out to me that using BEFORE INSERT has the effect of setting the #last_uuid variable even if the row insert fails. I have updated my answer to use AFTER INSERT - whilst I feel this is a totally fine approach under general circumstances it may have issues with row replication under clustered or replicated databases. If anyone knows, I would love to as well!
To read the new row's insert ID back out, just run SELECT #last_uuid.
When querying and reading such binary values, the MySQL functions HEX() and UNHEX() will be very helpful, as will writing your query values in hex notation (preceded by 0x). The php-side code for your original answer, given this type of trigger applied to table1, would be:
// insert row
$sql = "INSERT INTO table1(text) VALUES ('some text')";
mysql_query($sql);
// get last inserted UUID
$sql = "SELECT HEX(#last_uuid)";
$result = mysql_query($sql);
$row = mysql_fetch_row($result);
$id = $row[0];
// perform a query using said ID
mysql_query("SELECT FROM table1 WHERE id = 0x" . $id);
Following up in response to #ina's comment:
A UUID is not a string, even if MySQL chooses to represent it as such. It's binary data in its raw form, and those dashes are just MySQL's friendly way of representing it to you.
The most efficient storage for a UUID is to create it as UNHEX(REPLACE(UUID(),'-','')) - this will remove that formatting and convert it back to binary data. Those functions will make the original insertion slower, but all following comparisons you do on that key or column will be much faster on a 16-byte binary field than a 36-character string.
For one, character data requires parsing and localisation. Any strings coming in to the query engine are generally being collated automatically against the character set of the database, and some APIs (wordpress comes to mind) even run CONVERT() on all string data before querying. Binary data doesn't have this overhead. For the other, your char(36) is actually allocating 36 characters, which means (if your database is UTF-8) each character could be as long as 3 or 4 bytes depending on the version of MySQL you are using. So a char(36) can range anywhere from 36 bytes (if it consists entirely of low-ASCII characters) to 144 if consisting entirely of high-order UTF8 characters. This is much larger than the 16 bytes we have allocated for our binary field.
Any logic performed on this data can be done with UNHEX(), but is better accomplished by simply escaping data in queries as hex, prefixed with 0x. This is just as fast as reading a string, gets converted to binary on the fly and directly assigned to the query or cell in question. Very fast.
Reading data out is slightly slower - you have to call HEX() on all binary data read out of a query to get it in a useful format if your client API doesn't deal well with binary data (PHP in paricular will usually determine that binary strings === null and will break them if manipulated without first calling bin2hex(), base64_encode() or similar) - but this overhead is about as minimal as character collation and more importantly is only being called on the actual cells SELECTed, not all cells involved in the internal computations of a query result.
So of course, all these small speed increases are very minimal and other areas result in small decreases - but when you add them all up binary still comes out on top, and when you consider use cases and the general 'reads > writes' principle it really shines.
... and that's why binary(16) is better than char(36).
Its pretty easy actually
you can pass this to mysql and it will return the inserted id.
set #id=UUID();
insert into <table>(<col1>,<col2>) values (#id,'another value');
select #id;
Depending on how the uuid() function is implemented, this is very bad programming practice - if you try to do this with binary logging enabled (i.e. in a cluster) then the insert will most likely fail. Ivan's suggestion looks it might solve the immediate problem - however I thought this only returned the value generated for an auto-increment field - indeed that's what the manual says.
Also what's the benefit of using a uuid()? Its computationally expensive to generate, requires a lot of storage, increases the cost of querying the data and is not cryptographically secure. Use a sequence generator or autoincrement instead.
Regardless if you use a sequence generator or uuid, if you must use this as the only unique key on the database, then you'll need to assign the value first, read it back into phpland and embed/bind the value as a literal to the subsequent insert query.
When I am searching my MySQL database with some query like:
SELECT * FROM mytable WHERE mytable.title LIKE '%副教授%';
("副教授" are three Chinese characters, whose decimal numeric character reference, NCR, is "副教授"), I got no result.
By looking into the phpMyadmin and browsing "mytable", the should-be-found entry is shown as "副教授". I think that is the reason for the failure of search.
Not all the entries in the same column are numeric character reference and some of them are just normal. Here is one pic of the table column shown in phpMySQLAdmin.
I wonder how I could search for all entries in my table in MySQL using one format regardless if there are shown in NCR or not. Or should I convert the NCR entries by running some script? Thanks.
your database table encoding should be utf-8 and when you insert new data you should run set names 'utf-8' query before insertion and this will contain all your data.