I have been given the task of creating a "Mass Crawler" which completely relies on proxies inside a database. Here's a simple overview in what I'm attempting to achieve :
1 x CronJob Bootstrap file - This is the file which sends 50 parallel curl requests to the individual crawler file
1 x Individual Crawler file - This is supposed to grab a UNIQUE row (proxy) from the database which another process hasn't selected.
I've had a look into the TRANSACTIONS with mySQL, but I still believe doing this wouldn't help as the query would be getting executed at the exact same time for each individual crawler process.
Here's kind of the idea I had in my head for the individual crawler file :
$db = new MysqliDb("localhost", "username", "password", "database");
$db->connect();
$db->startTransaction();
$db->where("last_used", array("<" => "DATE_SUB(NOW(),INTERVAL 30 SECOND)"));
$proxies = $db->get("proxies", 1);
if(count($proxies) == 1) {
//complete any scraping that needs to be done
//update the database to say the proxy has just been used
$db->where("id", $accounts[0]['id']);
$db->update("proxies", array("last_used", date("Y-m-d H:i:s")));
//commit the complete transaction
$db->commit();
}
$db->disconnect();
Would that above example be the correct way to use the mySQL TRANSACTION feature and assure ALL parallel queries selected different rows?
You need a column in the table that indicates that the row is in use by one of the crawler processes. Your first SELECT should look for WHERE in_use = 0; it needs to use the FOR UPDATE clause to lock the rows that are processed, though.
SELECT *
FROM proxies
WHERE in_use = 0
LIMIT 1
FOR UPDATE;
I don't know how to write that query with the DB API you're using; you may need to use its function for performing raw queries.
Then updates that row to SET in_use = 1. By doing both operations in a transaction, you ensure that no other process will get that row.
When it's done processing the row, it can SET in_use = 0.
Related
I have a query which finds the aggregate sum and counts from a table and I need to update these details in another table. If I use the query without any conditions that limit the number of rows fetched at a time, it is failing (timed out) and so I am planning to fetch 10000 records at a time and insert it into the database. Now I have some confusions regarding this.
If I use database transactions and include all the operations from fetching the data to insert/update it in db, it is a good idea ? I would like to know whether this will lock the table which I am fetching until the transaction is completed. If so I can't use that as there are other API queries which is acting upon this data in real time.
The better way for this operation is through loops ? I mean fetching 1000 records and update 1000 records in DB or I need to use one more layer in between like Redis file storage, which will store the entire data and update operation will be performed at once.
My operation flow in the application level is as follows:
while ($i <= $maxCounr) {
$limit = 1000;
$results = fetchResults($i, $limit);
$formattedResilts = formatResults($results);
updateRecords($formattedResults);
$i = $i + 1000;
}
Is there any downside in doing something like this ? One thing I noted is that it needs to hit database multiple times. If I have 100000 entries to be processed, it needs to hit database 100000/1000 = 100 times. Is there any other way I can do this ? I am using this in a background worker which is running daily basis.
This code works but the problem is that if several people use it simultaneously it will cause problems in the sense that some people wont be registered. So I need to rewrite this in a way that all queries per person are executed and finished before the queries of the next person start.
First, the code reads from the database in order to get to the string of all the people that are registered so far.
$sql_s = $con -> query("select * from schedule where date='$date'");
$row_schedule = $sql_s->fetch_array(MYSQLI_BOTH);
$participants = $row_schedule['participants'];
$participants is a string that looks something like "'Sara':'Richard':'Greg'"
Now the current user (Fredrik) wants to add its name to the string like this
$current_user='Fredrik';
$participants_new=add_participant($participants,$current_user);
add_participant is a php function that adds 'Fredrik' to the participant string. Then I want to replace the old participant string with the new one in the SQL database like this
$sql = $con->query("UPDATE schedule SET participants='{$participants_new}' where date='{$date}'");
The specific problem is that if another person (Linda) reads the database before Fredrik executes
$sql = $con->query("UPDATE schedule SET participants='{$participants_new}' where date='{$date}'");
Linda won't get a string that includes Fredrik, she will get "'Sara':'Richard':'Greg'". And when she has added her name it will look like "'Sara':'Richard':'Greg':'Linda'" and when she updates the database like this
$sql = $con->query("UPDATE schedule SET participants='{$participants_new}' where date='{$date}'");
The string including Fredrik ("'Sara':'Richard':'Greg':'Fredrik'") will be overwritten with ("'Sara':'Richard':'Greg':'Linda'") and noone will ever know that Fredrik registered in the class.
Thus, how can I rewrite this code such that all Fredrik's queries are executed before Linda's queries start?
Your question is very good example, showing why one should always learn database design basics and always follow them.
A separator-delimited string in a database is a deadly sin. For many reasons, but we are interesting in this particular case.
Had you designed your database properly, adding participants into separate rows, there would be not a single problem.
So, just change your design by adding a table with participants, and there will be not a single problem adding or removing any number of them.
Here is an approach to do it :
Theoritically Explanation :
Something like this could work.That everytime when user executes the query so it should check for time the request was made to update the query so.Now there must be time difference between user requests for updation query.
Note : Still It's not guaranteed that it will work as because when you will be having internet problems and the user who submitted the request at first but having internet problems and that's why his update query execution is delayed during that time and the other user comes and he sent request late for the updation query but he was having no internet connection problem so his query will be updated before and I think hence that way first user query will get failed..!
Here is the Code :
<?php
// You need to add another column for saving time of last query execution
$current_time=time();
$current_date=date("Y-m-d",$t);
$query_execution_new_time = $current_time.":".$current_date;
if (empty($row_schedule['query_execution_time'])) {
$sql = $con->query("UPDATE schedule SET query_execution_time='{$query_execution_new_time}' where date='{$date}'");
} else {
$query_execution_time = explode(":",$row_schedule['query_execution_time']);
if ($query_execution_time[0] < $current_time) {
$con->query("UPDATE schedule SET participants='{$participants_new}' where date='{$date}'");
$sql = $con->query("UPDATE schedule SET query_execution_time='{$query_execution_new_time}' where date='{$date}'");
}
}
?>
Try this
No need to fetch first all participants and then update.
only update new participant user.
you can concat result of previous one result saved in database column field.
update schedule
set participants = case when participants is null or participants =''
then CONCAT(participants,'Fredrik') // assume Fredrik a new participant
else CONCAT(participants,':','Fredrik')
end
where date='$date';
That way even if you have multiple participants came at the same time the queries won't run at exactly the same time and so you'll get the correct user at the end.
you don't need to worry about multiple users clicking on them unless you've got millions of users
I have a process that selects the next item to process from a MySQL InnoDB Table based on some criteria. When a row has been selected as the next to process, it's processing field is set to 1 while processing is happening outside the database. I do this so that many processors can be run at once, and they won't process the same row.
If I use transactions to execute the following queries, are they guaranteed to be executed together ( eg. Without any other MySQL connections executing queries. )? If they are not, then multiple processors could get the same id from the SELECT query and then processing will be redundant.
Pseudo Code Example
Prepare Transaction...
$id = SELECT id
FROM companies
WHERE processing = 0
ORDER BY last_crawled ASC
LIMIT 1;
UPDATE companies
SET processing = 1
WHERE id = $id;
Execute Transaction
I've been struggling to accomplish this fast enough using a single UPDATE query ( see this question ). Assume that is not an option for the purposes of this question.
You still have a possibility of a race condition, even though you execute the SELECT followed by the UPDATE in a single transaction. SELECT by itself does not lock anything, so you could have two concurrent sessions both SELECT and get the same id. Then both would attempt to UPDATE, but only one would "win" - the other would have to wait.
To get around this, use the SELECT...FOR UPDATE clause, which creates a lock on the rows it returns.
Prepare Transaction...
$id = SELECT id
FROM companies
WHERE processing = 0
ORDER BY last_crawled ASC
LIMIT 1
FOR UPDATE;
This means that the lock is created as the row is selected. This is atomic, which means no other session can sneak in and get a lock on the same row. If they try, their transaction will block on the SELECT.
UPDATE companies
SET processing = 1
WHERE id = $id;
Commit Transaction
I changed your "execute transaction" pseudocode to "commit transaction." Statements within a transaction execute immediately, which means they create locks and so on. Then when you COMMIT, the locks are released and any changes are committed. Committed means they can't be rolled back, and they are visible to other transactions.
Here's a quick example of using mysqli to accomplish this:
$mysqli = new mysqli(...);
$mysqli->report_mode = MYSQLI_REPORT_STRICT; /* throw exception on error */
$mysqli->begin_transaction();
$sql = "SELECT id
FROM companies
WHERE processing = 0
ORDER BY last_crawled ASC
LIMIT 1
FOR UPDATE";
$result = $mysqli->query($sql);
while ($row = $result->fetch_array(MYSQLI_ASSOC)) {
$id = $row["id"];
}
$sql = "UPDATE companies
SET processing = 1
WHERE id = ?";
$stmt = $mysqli->prepare($sql);
$stmt->bind_param("i", $id);
$stmt->execute();
$mysqli->commit();
Re your comment:
I tried an experiment and created a table companies, filled it with 512 rows, then started a transaction and issues the SELECT...FOR UPDATE statement above. I did this in the mysql client, no need to write PHP code.
Then, before committing my transaction, I examined the locks reported:
mysql> show engine innodb status\G
=====================================
2013-12-04 16:01:28 7f6a00117700 INNODB MONITOR OUTPUT
=====================================
...
---TRANSACTION 30012, ACTIVE 2 sec
2 lock struct(s), heap size 376, 513 row lock(s)
...
Despite using LIMIT 1, this report shows transaction appears to lock every row in the table (plus 1, for some reason).
So you're right, if you have hundreds of requests per second, it's likely that the transactions are queuing up. You should be able to verify this by watching SHOW PROCESSLIST and seeing many processes stuck in a state of Locked (i.e. waiting for access to rows that another thread has locked).
If you have hundreds of requests per second, you may have outgrown the ability for an RDBMS to function as a fake message queue. This isn't what an RDBMS is good at.
There are a variety of scalable message queue frameworks with good integration with PHP, like RabbitMQ, STOMP, AMQP, Gearman, Beanstalk.
Check out http://www.slideshare.net/mwillbanks/message-queues-a-primer-international-php-conference-fall-2012
That depends. There are (in general) differet isolation levels in SQL. In MySQL you can change which one to use using SET TRANSACTION ISOLATION LEVEL.
While "SERIALIZABLE" (which is the strictest one) still doesn't imply that no other actions are executed in between the ones from your transaction, it DOES make sure that there is no difference if simultanious transactions are executed one after another or not - if it would make a difference, on transaction is rolled back and executed later.
Note however that the stricter the isolation is, the more locking and rollbacks has to be done. So makre sure you really need that before using it.
I have a php script that executes mysql pdo queries. There are a few reads and writes to the same table in this script.
For sake of example let's say that there are 4 queries, a read, write, another read, another write, each read takes 10 second to execute, and each write takes .1 seconds to execute.
If I execute this script from the cli nohup php execute_queries.php & twice in 1/100th of a second, what would the execution order of the queries be?
Would all the queries from the first instance of the script need to finish before the queries from the 2nd instance begin to run, or would the first read from both instances start and finish before the table is locked by the write?
NOTE: assume that I'm using myisam and that the write is an update to a record (IE, entire table gets locked during the write.)
Since you are not using transactions, then no, the won't wait for all the queries in one script to finish an so the queries may get overlaped.
There is an entire field of study called concurrent programming that teaches this.
In databases it's about transactions, isolation levels and data locks.
Typical (simple) race condition:
$visits = $pdo->query('SELECT visits FROM articles WHERE id = 44')->fetch()[0]['visits'];
/*
* do some time-consuming thing here
*
*/
$visits++;
$pdo->exec('UPDATE articles SET visits = '.$visits.' WHERE id = 44');
The above race condition can easily turn sour if 2 PHP processes read the visits from the database one millisecond after the other, and assuming the initial value of visits was 6, both would increment it to 7 and both would write 7 back into the database even though the desired effect was that 2 visits increment the value by 2 (final value of visits should've been 8).
The solution to this is using atomic operations (because the operation is simple and can be reduced to one single atomic operation).
UPDATE articles SET visits = visits+1 WHERE id = 44;
Atomic operations are guaranteed by the database engines to take place uninterrupted by other processes/threads. Usually the database has to queue incoming updates so that they don't affect each other. Queuing obviously slows things down because each process has to wait for all processes before it until it gets the chance to be executed.
In a less simple operation we need more than one statement:
SELECT #visits := visits FROM articles WHERE ID = 44;
SET #visits = #visits+1;
UPDATE articles SET visits = #visits WHERE ID = 44;
But again even at the database level 3 separate atomic statements are not guaranteed to yield an atomic result. They can be overlap with other operations. Just like the PHP example.
To solve this you have to do the following:
START TRANSACTION
SELECT #visits := visits FROM articles WHERE ID = 44 FOR UPDATE;
SET #visits = #visits+1;
UPDATE articles SET visits = #visits WHERE ID = 44;
COMMIT;
I'm working on a research project that requires me to process large csv files (~2-5 GB) with 500,000+ records. These files contain information on government contracts (from USASpending.gov). So far, I've been using PHP or Python scripts to attack the files row-by-row, parse them, and then insert the information into the relevant tables. The parsing is moderately complex. For each record, the script checks to see if the entity named is already in the database (using a combination of string and regex matching); if it is not, it first adds the entity to a table of entities and then proceeds to parse the rest of the record and inserts the information into the appropriate tables. The list of entities is over 100,000.
Here are the basic functions (part of a class) that try to match each record with any existing entities:
private function _getOrg($data)
{
// if name of organization is null, skip it
if($data[44] == '') return null;
// use each of the possible names to check if organization exists
$names = array($data[44],$data[45],$data[46],$data[47]);
// cycle through the names
foreach($names as $name) {
// check to see if there is actually an entry here
if($name != '') {
if(($org_id = $this->_parseOrg($name)) != null) {
$this->update_org_meta($org_id,$data); // updates some information of existing entity based on record
return $org_id;
}
}
}
return $this->_addOrg($data);
}
private function _parseOrg($name)
{
// check to see if it matches any org names
// db class function, performs simple "like" match
$this->db->where('org_name',$name,'like');
$result = $this->db->get('orgs');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
// check to see if matches any org aliases
$this->db->where('org_alias_name',$name,'like');
$result = $this->db->get('orgs_aliases');
if(mysql_num_rows($result) == 1) {
$row = mysql_fetch_object($result);
return $row->org_id;
}
return null; // no matches, have to add new entity
}
The _addOrg function inserts the new entity's information into the db, where hopefully it will match subsequent records.
Here's the problem: I can only get these scripts to parse about 10,000 records / hour, which, given the size, means a few solid days for each file. The way my db is structured requires a several different tables to be updated for each record because I'm compiling multiple external datasets. So, each record updates two tables, and each new entity updates three tables. I'm worried that this adds too much lag time between MySQL server and my script.
Here's my question: is there a way to import the text file into a temporary MySQL table and then use internal MySQL functions (or PHP/Python wrapper) to speed up the processing?
I'm running this on my Mac OS 10.6 with local MySQL server.
load the file into a temporary/staging table using load data infile and then use a stored procedure to process the data - shouldnt take more than 1-2 mins at the most to completely load and process the data.
you might also find some of my other answers of interest:
Optimal MySQL settings for queries that deliver large amounts of data?
MySQL and NoSQL: Help me to choose the right one
How to avoid "Using temporary" in many-to-many queries?
60 million entries, select entries from a certain month. How to optimize database?
Interesting presentation:
http://www.mysqlperformanceblog.com/2011/03/18/video-the-innodb-storage-engine-for-mysql/
example code (may be of use to you)
truncate table staging;
start transaction;
load data infile 'your_data.dat'
into table staging
fields terminated by ',' optionally enclosed by '"'
lines terminated by '\n'
(
org_name
...
)
set
org_name = nullif(org_name,'');
commit;
drop procedure if exists process_staging_data;
delimiter #
create procedure process_staging_data()
begin
insert ignore into organisations (org_name) select distinct org_name from staging;
update...
etc..
-- or use a cursor if you have to ??
end#
delimiter ;
call process_staging_data();
Hope this helps
It sounds like you'd benefit the most from tuning your SQL queries, which is probably where your script spends the most time. I don't know how the PHP MySQL client performs, but MySQLdb for Python is fairly fast. Doing naive benchmark tests I can easily sustain 10k/sec insert/select queries on one of my older quad-cores. Instead of doing one SELECT after another to test if the organization exists, using a REGEXP to check for them all at once might be more efficient (discussed here: MySQL LIKE IN()?). MySQLdb lets you use executemany() to do multiple inserts simultaneously, you could almost certainly leverage that to your advantage, perhaps your PHP client lets you do the same thing?
Another thing to consider, with Python you can use multiprocessing to and try parallelize as much as possible. PyMOTW has a good article about multiprocessing.