I developed a PHP script that makes some calls to a SOAP web service.
When I get results I modify data and append them in a CSV file.
I have a MySQL table FOUNDS that contains all my founds ID that use to query the web service
FOUNDS
| ID | status |
| AADR | ok |
| AAIT | ok |
| AAXJ | pending |
| ACIM | pending |
I wrote a PHP page that reads from FOUNDS table with a 10 rows LIMIT and query on soap service, get the results, write them into the CSV, and flag the row as "ok".
After looping on retrieved rows the script check if I have other "pending" rows.
If the total number of "pending" rows is greater than zero I do a PHP location to itself. Basically the page reloads itself. Otherwise it makes a PHP location to an other page.
When we started I had only 100 ID's in FOUNDS tables.
Now I have 40.000 rows in the FOUNDS table and the process takes like 8/10 hours to complete.
How can I modify my script in a way that I don't need to keep my browser open and without a timeout?
Related
I am going to do a cycle process like:
CRON runs script process.php which takes 1000 urls;
process.php works with those urls (up to 20 min);
CRON runs process.php again and I want it to take next (different) 1000 urls;
How can I prevent getting urls which are already in process?
P.S.
process.php runs every 10 min
Table format see above.
+----+------+
| id | url |
+----+------+
| 1 | url1 |
| 2 | url2 |
| 3 | url3 |
| 4 | url4 |
| 5 | url5 |
+----+------+
There are many approaches to this "process once" requirement. The choice often depends upon:
How quickly records are 'grabbed'
Whether records are processed in parallel
How to handle processing failures
Here's some ideas:
Use a queue
You could create a queue using Amazon Simple Queuing Service (SQS). First, run a job that extracts the URLs from the database and puts them in a queue message. Then, process.php can read the details from the queue instead of the database.
While the script is running, the SQS message is invisible so other processes can't get it. When the process is finished, it should delete the message from the queue. If the process fails mid-way, the invisible message reappears after a pre-defined interval to be reprocessed.
Queues are a standard way of processing many records. It allows the processing to be distributed over multiple applications/servers. You could even insert single URLs into the queue rather than batching.
Mark them as processed in the database
Add a processed_timestamp column to the database. When a URL is processed, do an UPDATE command on the database to mark the URL as processed. When retrieving URLs, only SELECT ones that have not been processed.
Remember last processed
When retrieving URLs, store the 'last processed' ID number. This could be stored in another database table, in a cache, a disk file, an S3 file, or anywhere that is generally accessible. Then, retrieve this value to determine which records next need to be processed and update it when starting a batch of URLs.
I need to check for broken images with db entries. So now I am selecting all the items from table and using CURL to check it is broken or not. I have almost 5000 items in DB and CURL is taking lot of time. For one result, it is showing the total time as 0.07 seconds. My table structure is the following :
+----+----------------------------------------+
| id | image_url |
+----+----------------------------------------+
| 1 | http://s3.xxx.com/images/imagename.gif |
| 2 | http://s3.xxx.com/images/imagename.gif |
| 3 | http://s3.xxx.com/images/imagename.gif |
| 4 | http://s3.xxx.com/images/imagename.gif |
+----+----------------------------------------+
So is there any other idea to check for broken images?. I think I cannot use LIMITS here as I need to check for all items and then print the result. I have user file_get_contents() but it is also taking lot of time.
What you can do here is the following:
Use multi_curl to cURL the images in parallell.
Specify header only (as you're not interested in the image data) and if the status code is anything but 200 OK (or 302/Found), then the image does not exist.
Chunk the 5000 items first, don't run them all with multi_curl. About 50-100 items at a time is fine.
I'm selling my Windows application over my website. I'm using Paypal and LibertyReserve as my payment processors. I created checkout buttons to automate the purchase process. After the purchase is complete, the buyer is redirected to a thank you page.
I'm issuing a serial and a download link after the checkout is complete. I'm currently doing it by using PHP, but it's not secure.
I have a text file with some 100 serials in it. Every time the user visits the thank you page, PHP script checks if the cookie is not set (to avoid repeated visits). If the cookie is not set, it opens the text file with the serials, reads the first serial and stores it in a cookie. The serial is deleted from the text file. After that, user is redirected by using "header" to the final page where it shows the cookie content and says "This is your serial key: [cookie value containing the serial goes here]".
Now, everyone with a bit of brain can just read the whole text file with the serial keys. How can I secure this/make it better?
Any suggestion will come handy. Thanks!
Store every serial code in MySQL database, table could be like this:
id, serial, purchased, date, hashcode
When user comes back from paypal with success payment, you will redirect him to something like serial.php?hashcode={hashcode} script which retrieves serial number by hashcode and print it to user. Script will also set serial row as purchased and if you don't want to allow user to see this page second time you can do it.
You have a table with all the serial keys.
+--------+-----------+
| id | serial |
+--------+-----------+
| 1 | 123-45 |
+--------+-----------+
| 2 | 456-43 |
+--------+-----------+
When a transaction is executed, and you have an unique identifier for the transaction, you would have another table that holds what serial id belongs to the unique identifier.
+--------+-----------+-----------+
| id | trans | serialId |
+--------+-----------+-----------+
| 1 | 654326 | 1 |
+--------+-----------+-----------+
| 2 | 876430 | 2 |
+--------+-----------+-----------+
The serialId/trans column are unique so no two transactions could ever have the same serial id.
I do have a DB table, which is kind of spool for performing tasks:
| id | status | owner | param1 |
+----+--------+-------+--------+
| 1 | used | user1 | AAA1 |
| 2 | free | user2 | AAA2 |
| 3 | free | user1 | AAA3 |
| 4 | free | user1 | AAA4 |
| 5 | free | user3 | AAA2 |
This table is being access by many parallel processes, what would be the best way to assure, that each row from the table would be "used" just by single process but also at the same time given out in the same order as they are in table (sorted by id column value)?
My first idea was to simply mark always next row in queue with simple update:
UPDATE table
SET status = "used"
WHERE owner = "userX"
AND status <> "used"
ORDER BY id
LIMIT 1
and then fetch the marked row.
This was not performing at all - with some data (e.g. 3.000.000 rows) and bigger loads process list was full UPDATE statements and mysql crashed with "Out of sort memory" error...
So my next idea is doing following steps/queries:
step1
get the first unused row:
SELECT id
FROM table
WHERE owner = "userX"
AND status = "free"
ORDER BY id
LIMIT 1
step2
try to mark it as used if it is still free:
UPDATE table
SET status = "used"
WHERE id = <id from SELECT above>
AND status = "free"
step3
go to step1 if row was NOT updated (because some other process already used it) or go to step4 if row was updated
step4
do the required work with successfully found row
The disadvantage is that on many concurrent processes there will be always a lot of jumping between steps 1. and 2. till each process finds its "own" row. So to be sure that system works stable - I would need to limit the number of tries each process does and risk that processes may reach the limit and find nothing while there are still entries in the table.
Maybe there is some better way to solve this problem?
P.S. everything is done at the moment with PHP+MySQL
Just a suggestion, instead of sorting and limiting to 1 maybe just grab min(id):
SELECT MIN(id)
FROM table
WHERE owner = "userX"
AND status = "free"
I am also using a MySQL database to choose rows that need to be enqueued for lengthy processing, preferring to do them in the order of the primary index ID column, also using optimistic concurrency control as shown above (no transactions needed). Thank you to #sleeperson for the answer using min(id), that is far superior to order by / limit 1.
I am posting one additional suggestion that allows for graceful restart. I implemented the following that's done only at startup time:
step0
get lingering rows:
SELECT id
FROM table
WHERE owner = "userX"
AND status = "used"
call step4
Etc. After a crash or other unwelcome (yet oh so common event) this will distribute out for processing the rows that should have been done previously, instead of leaving them marked 'used' in the database to trip me up later.
I have a table with 200 rows. I'm running a cron job every 10 minutes to perform some kind of insert/update operation on the table. The operation needs to be performed only on 5 rows at a time every time the cron job runs. So in first 10 mins records 1-5 are updated, records 5-10 in the 20th minute and so on.
When the cron job runs for the 20th time, all the records in the table would have been updated exactly once. This is what is to be achieved at least. And the next cron job should repeat the process again.
The problem:
is that, every time a cron job runs, the insert/update operation should be performed on N rows (not just 5 rows). So, if N is 100, all records would've been updated by just 2 cron jobs. And the next cron job would repeat the process again.
Here's an example:
This is the table I currently have (200 records). Every time a cron job executes, it needs to pick N records (which I set as a variable in PHP) and update the time_md5 field with the current time's MD5 value.
+---------+-------------------------------------+
| id | time_md5 |
+---------+-------------------------------------+
| 10 | 971324428e62dd6832a2778582559977 |
| 72 | 1bd58291594543a8cc239d99843a846c |
| 3 | 9300278bc5f114a290f6ed917ee93736 |
| 40 | 915bf1c5a1f13404add6612ec452e644 |
| 599 | 799671e31d5350ff405c8016a38c74eb |
| 56 | 56302bb119f1d03db3c9093caf98c735 |
| 798 | 47889aa559636b5512436776afd6ba56 |
| 8 | 85fdc72d3b51f0b8b356eceac710df14 |
| .. | ....... |
| .. | ....... |
| .. | ....... |
| .. | ....... |
| 340 | 9217eab5adcc47b365b2e00bbdcc011a | <-- 200th record
+---------+-------------------------------------+
So, the first record(id 10) should not be updated more than once, till all 200 records are updated once - the process should start over once all the records are updated once.
I have some idea on how this could be achieved, but I'm sure there are more efficient ways of doing it.
Any suggestions?
You could use a Red/Black system (like for cluster management).
Basically, all your rows start out as black. When you run your cron, it will mark the rows it updated as "Red". Once all the rows are red, you switch, and now start turning all the red rows to be black. You keep this alternation going, and it should allow you to effectively mark rows so that you do not update them twice. (You could store whatever color goal you want in a file or something so that it is shared between crons)
I would just run the PHP script every 10/5 minutes with cron, and then use PHP's time and date functions to perform the rest of the logic. If you cannot time it, you could store a position marking variable in a small file.