Detecting spammers with MySQL - php

I see an ever increasing number of users signing up on my site to just send duplicate SPAM messages to other users. I've added some server side code to detect duplicate messages with the following mysql query:
SELECT count(content) as msgs_sent
FROM messages
WHERE sender_id = '.$sender_id.'
GROUP BY content having count(content) > 10
The query works well but now they're getting around this by changing a few charctersr in their messages. Is there a way to detect this with MySQL or do I need to look at each grouping returned from MySQL and then use PHP to determine the percentage of similarity?
Any thoughts or suggestions?

Fulltext Match
You could look at implementing something similar to the MATCH example here:
mysql> SELECT id, body, MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root') AS score
-> FROM articles WHERE MATCH (title,body) AGAINST
-> ('Security implications of running MySQL as root');
+----+-------------------------------------+-----------------+
| id | body | score |
+----+-------------------------------------+-----------------+
| 4 | 1. Never run mysqld as root. 2. ... | 1.5219271183014 |
| 6 | When configured properly, MySQL ... | 1.3114095926285 |
+----+-------------------------------------+-----------------+
2 rows in set (0.00 sec)
So for your example, perhaps:
SELECT id, MATCH (content) AGAINST ('your string') AS score
FROM messages
WHERE MATCH (content) AGAINST ('your string')
AND score > 1;
Note that to use these functions your content column would need to be a FULLTEXT index.
What is score in this example?
It is a relevance value. It is computed through the process described below:
Every correct word in the collection and in the query is weighted
according to its significance in the collection or query.
Consequently, a word that is present in many documents has a lower
weight (and may even have a zero weight), because it has lower
semantic value in this particular collection. Conversely, if the word
is rare, it receives a higher weight. The weights of the words are
combined to compute the relevance of the row.
From the documentation page.

Related

MySQL Like function

On my html page, the user has the option to either enter a text string, check mark options, or do both. This data is then placed inside a mysql query which displays the data.
The fact that the user is allowed to enter a string means that I am using the LIKE function in the mysql query.
Correct me if I am wrong, but I believe the LIKE function can slow the query down a lot.
In relation to the above statement, I would like to know whether an empty string in the LIKE function would make a difference, so for example:
select * from hello;
select * from hello where name like "%%";
If it does make a significant difference (I believe this database will be growing larger) what are your ideas on how to deal with this.
My first idea was that I will have 2 queries:
One with the like functionality
and one without the like functionality. Depending on what the user enters, the correct query will be called.
So for example if the user leaves the search box empty, the like function will not be needed, there fore it will send a null character, and an if statement will select the other option (without the like functionality) when it sees there is a null character.
Is there a better way of doing this?
In general, the LIKE function will be slow unless it begins with a fixed string and the column has an index. If you do LIKE 'foo%', it can use the index to find all rows that begin with foo, because MySQL indexes use B-trees. But LIKE '%foo' cannot make use of an index, because B-trees only optimize looking for prefixes; this has to do a sequential scan of the entire table.
And even when you use the version with a prefix, the performance improvement depends on how much that prefix reduces the number of rows that have to be searched. If you do LIKE 'foo%bar', and 90% of your rows begin with foo, this will still have to scan 90% of the table to test whether they end with bar.
Since LIKE '%%' doesn't have a fixed prefix, it will perform a full scan of the table, even though there isn't actually anything to search for. It would be best if your PHP script tested whether the user provided a search string, and omit the LIKE test if there's nothing to search for.
I believe the LIKE function can slow the query down a lot
I would expect that not to be the case. How hard would it be to test it?
Regardless which version of the query you run, the DBMS still has to examine every row in the table. That will require some extra work by the CPU, but for large tables, disk I/O will be the limiting factor. LIKE '%%' will discard rows with null values - hence potentially reducing the amount of data the DBMS needs to retain in the result set / transfer to the client which may be significant saving.
As Barbar says, providing an expression without a leading wildcard will allow the DBMS to use an index (if one is available) which will have a big impact on performance.
Its hard to tell from your question (you didn't provide much in the way of example queries/data nor any detail of what the application does) but the solution to your problem might be full text indexing
Using the World database sample from the mysql software distribution, I first did a simple explain on queries with and without where clauses without filtering effects:
mysql> explain select * from City;
mysql> explain select * from City where true;
mysql> explain select * from City where Name = Name;
In these first three cases, the result is as follow:
+----+--------------+-------+------+----------------+-----+---------+-----+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------+------+----------------+-----+---------+-----+------+-------+
| 1 | SIMPLE | City | ALL | NULL | NULL | NULL | NULL | 4080 | |
+----+--------------+-------+------+----------------+-----+---------+-----+------+-------+
While for the last query, I got the following:
mysql> explain select * from City where Name like "%%";
+----+--------------+-------+------+----------------+-----+---------+-----+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+-------+------+----------------+-----+---------+-----+------+-------+
| 1 | SIMPLE | City | ALL | NULL | NULL | NULL | NULL | 4080 | Using where |
+----+--------------+-------+------+----------------+-----+---------+-----+------+-------+
You can see that for this particular query, the where condition was not optimized away.
I also performed a couple of measurements, to check if indeed there would be a sensible difference, but:
the table having only 4080 rows, I used a self cross join to render longer computation times
I used having clauses to cut down on display overhead (1).
Measurement results:
mysql> select c1.Name, c2.Name from City c1, City c2 where concat(c1.Name,c2.Name) = concat(c1.Name,c2.Name) having c1.Name = "";
Empty set (5.22 sec)
The above query, as well as one with true or c1.Name = c1.Name performed sensibly the same, within less than a 0.1 sec margin.
mysql> reset query cache;
mysql> select c1.Name, c2.Name from City c1, City c2 where concat(c1.Name,c2.Name) like "%%" having c1.Name = "";
Empty set (13.80 sec)
This one also took around the same amount of time when run several times (in between query cache resets) (2).
Clearly the query optimizer doesn't see an opportunity for the later case. The conclusion is that you should try to avoid as much as possible the use of that clause, even if it doesn't change the result set.
(1): having clause filtering happening after data consolidation from the query, I assumed it shouldn't change the actual query computation load ratio.
(2): interestingly, I initially tried a simple where c1.Name like ”%%", and got around 5.0 sec. timing results, which led me to try out with a more elaborate clause. I don't think that result changes the overall conclusion; it could be that in that very specific case, the filtering actually has a beneficial effect. Hopefully a mysql guru will explain that result.

MySQL SELECT SUM(Column) and SELECT * Cardinality violation: 1241 Operand should contain 1 column(s)

Trying to write statement where in single statement select all (*) and sum one column from the same database and the same table, depending on conditions.
Wrote such statement (based on this Multiple select statements in Single query)
SELECT ( SELECT SUM(Amount) FROM 2_1_journal), ( SELECT * FROM 2_1_journal WHERE TransactionPartnerName = ? )
I understand that SELECT SUM(Amount) FROM 2_1_journal will sum all values in column Amount (not based on codition).
But at first want to understand what is correct statement
With above statement get error SQLSTATE[21000]: Cardinality violation: 1241 Operand should contain 1 column(s)
Can not understand error message. From advice here MySQL - Operand should contain 1 column(s) understand that subquery SELECT * FROM 2_1_journal WHERE TransactionPartnerName = ? must select only one column?
Tried to change statement to this SELECT ( SELECT * FROM 2_1_journal WHERE TransactionPartnerName = ? ), ( SELECT SUM(Amount) FROM 2_1_journal), but get the same error...
What would be correct statement?
SELECT *, (SELECT SUM(Amount) FROM 2_1_journal)
FROM 2_1_journal
WHERE TransactionPartnerName = ?
This selects sums up Amount from the entire table and "appends" all rows where TransactionPartnerName is the parameter you bind in the client code.
If you want to limit the sum to the same criteria as the rows you select, just include it:
SELECT *, (SELECT SUM(Amount) FROM 2_1_journal WHERE TransactionPartnerName = ?)
FROM 2_1_journal
WHERE TransactionPartnerName = ?
A whole different thing: table names like 2_1_journal are strong indicators of a broken database design. If you can redo it, you should look into how to normalize the database properly. It is most likely pay back many times over.
With regard to normalization (added later):
Since the current design uses keys in table names (such as the 2 and 1 in 2_1_journal), I'll quickly illustrate how I think you can vastly improve that design. Lets say that the table 2_1_journal has the following data (I'm just guessing here because the tables haven't been described anywhere yet):
title | posted | content
------+------------+-----------------
Yes! | 2013-01-01 | This is just a test
2nd | 2013-01-02 | Another test
This stuff belongs to user 2 in company 1. But hey! If you look at the rows, the fact that this data belongs to user 2 in company 1 is nowhere to be found.
The problem is that this design violates one of the most basic principles of database design: don't use keys in object (here: table) names. A clear indication that something is very wrong is if you have to create new tables if something new is added. In this case, adding a new user or a new company requires adding new tables.
This issue is easilly fixed. Create one table named journal. Next, use the same columns, but add another two:
company | user | title | posted | content
--------+------+-------+------------+-----------------
1 | 2 | Yes! | 2013-01-01 | This is just a test
1 | 2 | 2nd | 2013-01-02 | Another test
Doing it like this means:
You never add or modify tables unless the application changes.
Doing joins across companies or users (and anything else that used to be part of the table naming scheme is now possible with a single, fairly simple select statement).
Enforcing integrity is easy - if you upgrade the application and want to change the tables, the changes doesn't have to be repeated for each company and user. More importantly, this lowers the risk of having the application get out of sync with the tables in the database (such as adding the field comments to all x_y_journal tables, but forgetting 5313_4324_journal causing the application to break only when user 5313 logs in. This is the kind of problem you don't want to deal with.
I am not writing this because it is a matter of personal taste. Databases are just designed to handle tables that are laid out as I describe above. The design where you use object keys as part of table names has a host of other problems associated with it that are very hard to deal with.

Performance and sorting, and distinct unique between mysql and php

In situations like this which method or mix of methods performs the quickest?
$year = db_get_fields("select distinct year from car_cache order by year desc");
Or
$year = db_get_fields("select year from car_cache");
$year = array_unique($year);
sort($year);
I've heard the distinct on mysql is a real big performance hit for large queries and this table can have a million rows or more. I wondered what combination of database types, Innodb or MyISAM, would work best too. I know many optimizations are very query dependent. Year is an unsigned number, but other fields are varchar of different lengths I know that may make a difference too. Such as:
$line = db_get_fields("select distinct line from car_cache where year='$postyear' and make='$postmake' order by line desc");
I read that using the new innodb multiple keys method can make queries like this one very very quick. But the distinct and order by clauses are red flags to me.
Have MySQL do as much work as possible. If it isn't being efficient at what its doing, then things likely aren't set up correctly (whether it is proper indexing for the query you are trying to run, or settings with sort buffers).
If you have an index on the year column, then using DISTINCT should be efficient. If you do not, then a full table scan is necessary in order to fetch the distinct rows. If you try to sort out the distinct rows in PHP rather than MySQL, then you transmit (potentially) much more data from MySQL to PHP, and PHP consumes much more memory to store all that data before eliminating the duplicates.
Here is some sample output from a dev database I have. Also note that this database is on a different server on the network from where the queries are being executed.
SELECT COUNT(SerialNumber) FROM `readings`;
> 97698592
SELECT SQL_NO_CACHE DISTINCT `SerialNumber`
FROM `readings`
ORDER BY `SerialNumber` DESC
LIMIT 10000;
> Fetched 10000 records. Duration: 0.801 sec, fetched in: 0.082 sec
> EXPLAIN *above_query*
+----+-------------+----------+-------+---------------+---------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+---------------+---------+---------+------+------+-----------------------------------------------------------+
| 1 | SIMPLE | readings | range | NULL | PRIMARY | 18 | NULL | 19 | Using index for group-by; Using temporary; Using filesort |
+----+-------------+----------+-------+---------------+---------+---------+------+------+-----------------------------------------------------------+
If I attempt the same query, except replace the SerialNumber column with one that is non-indexed, then it takes forever to run because MySQL has to examine all 97 million rows.
Some of the efficiency has to do with how much data you expect to get back. If I slightly modify the above queries to operate on the time column (the timestamp of the reading), then it takes 1 min 40 seconds to get a distinct list of 273,505 times, most of the overhead there is in transferring all the records over the network. So keep in mind the limits on how much data you are getting back, you want to keep that as low as possible for the data you are trying to fetch.
As for your final query:
select distinct line from car_cache
where year='$postyear' and make='$postmake'
order by line desc
There should be no problem with that either, just make sure you have a compound index on year and make and possibly an index on line.
On a final note, the engine I am using for the readings table is InnoDB, and my server is: 5.5.23-55-log Percona Server (GPL), Release 25.3 which is a version of MySQL by Percona Inc.
Hope that helps.

Optimizing mysql fulltext search

I want to make a search with fulltext in my web. I need the search with a pagination. my database have 50,000+ rows/per table. I have alter my table and make (title,content,date) to be index. the table is always update, there still have a column id which is automatic increase. and the latest date is always at the end of table.
date varchar(10)
title text
content text
but whole query time will cost 1.5+ seconds. I search the many articles via google, some wrote that only limit Index field word length can help the search more quickly. but as a text type, it can not alter a certain length like that( i have tried ALTER TABLE table_1 CHANGEtitletitleTEXT(500) CHARACTER SET utf8 COLLATE utf8_unicode_ci NOT NULL, not work)
date varchar(10)
title text(500)
content text(1000)
so, Except Sphinx and third part script. how to optimization fulltext search with only sql? query code here:
(SELECT
title,content,date
FROM table_1
WHERE MATCH (title,content,date)
AGAINST ('+$Search' IN BOOLEAN MODE))
UNION
(SELECT
title,content,date
FROM table_2
WHERE MATCH (title,content,date)
AGAINST ('+$Search' IN BOOLEAN MODE))
Order By date DESC
Thanks.
Based on the question's follow-up comments, you've a btree index on your columns rather than a full text index.
For MATCH (title,content) against search, you would need:
CREATE FULLTEXT INDEX index_name ON tbl_name (title,content);
I'm not sure it'll accept the date field in there (the latter is probably not relevant anyway).
I have a comprehensive plan for you to optimize MySQL for FULLTEXT indexing as thoroughly as possible
The first thing you should do is : Get rid of the stopword list
This has annoyed some people over the years because of being unaware that over 600 words are excluded from a FULLTEXT index.
Here is tabular view of those stopwords.
There are two ways to bypass this
Bypass Option 1) Create a custom stopword list.
You can actually submit to mysql a list of your preferred stopwords. Here is the default:
mysql> show variables like 'ft%';
+--------------------------+----------------+
| Variable_name | Value |
+--------------------------+----------------+
| ft_boolean_syntax | + -><()~*:""&| |
| ft_max_word_len | 84 |
| ft_min_word_len | 4 |
| ft_query_expansion_limit | 20 |
| ft_stopword_file | (built-in) |
+--------------------------+----------------+
5 rows in set (0.00 sec)
OK, not let's create our stopword list. I usually set the English articles as the only stopwords.
echo "a" > /var/lib/mysql/stopwords.txt
echo "an" >> /var/lib/mysql/stopwords.txt
echo "the" >> /var/lib/mysql/stopwords.txt
Next, add the option to /etc/my.cnf plus allowing 1-letter, 2-letter, and 3 letter words
[mysqld]
ft_min_word_len=1
ft_stopword_file=/var/lib/mysql/stopwords.txt
Finally, restart mysql
service mysql restart
If you have any tables with FULLTEXT indexes already in place, you must drop those FULLTEXT indexes and create them again.
Bypass Option 2) Recompile the source code
The filename is storage/myisam/ft_static.c. Just alter the C structure that holds the 600+ words so that it is empty. Having fun recompiling !!!
Now that the FULLTEXT config is solidified, here is another major aspect to consider:
Write proper refactored queries so that the MySQL Query Optimizer works right !!!
What I am now mentioning is really undocumented: Whenever you perform queries that do JOINs and the WHERE clause contains the MATCH function for FULLTEXT searching, it tends to cause the MySQL Query Optimizer to treat the query like a full table scan when it comes to searching the columns invoved in the FULLTEXT index. If you plan to query a table using a FULLTEXT index, ALWAYS refactor your query to have the FULLTEXT search return only keys in a subquery and connect those keys to your main table. Otherwise, the FULLTEXT index will put the MySQL Query Optimizer in a tailspin.
For further ideas regarding full-text search optimization in MySQL, see How to optimize MySQL Boolean Full-Text Search? (Or what to replace it with?) - C#

Trouble combining Two sql queries into one

I have a table which contains due dates for individual member records. Each row contains four fields:
ID | Next_Due | Next_Due_Missed | Amount
=============================================================
123 | 2010-12-05 | NULL | 41.32
456 | 2010-12-10 | 2010-12-05 | 21.44
789 | 2010-12-20 | 2010-12-10 | 39.99
ID is the unique id of each MEMBER
Next Due - is the next due day of their regular subscription period
Next_Due_Missed is populated ONLY if there was an error collecting the first round of subscription payment.
Amount is amount owned for subscription.
My goal is to create a sql query that checks if next_due_missed exists and is not null. If it does, use that value as the '$date'. If not, set $date = value of next_due
this is done easily enough except my results are grouped by Next_Due in normal circumstances and will omit next_due_missed if I combine the way I currently am.
Every payment period, there may be 600+ records with next_due equal to the desired date (and 10-15 equal to next_due_missed).
My current query is:
$stmt = $db->prepare("SELECT next_due, next_due_missed FROM table_name WHERE (next_due > CURDATE() OR next_due_missed > CURDATE()) GROUP BY next_due ASC");
This only returns results for next_due however. Omitting the GROUP BY clause returns hundreds of results (while I need to group in this stage).
Similarly at a later point, I will need to break out those individual records and actually create payment records based on the 'next_due' and 'next_due_missed' values.
Any ideas what I am missing?
I am not sure the purpose of your GROUP BY other than to get DISTINCT values, but left it in in case you provided a partial query:
SELECT coalesce(next_due_missed, next_due) as EffectiveNextDue
FROM table_name
WHERE coalesce(next_due_missed, next_due) > CURDATE()
GROUP BY coalesce(next_due_missed, next_due)

Categories