I'm developing a small web-based (javascript) 'application' for an art project. The thing is called 'Poetry Generator', and it's a script that generates random poems based on user input.
The script has to display a random word to the user every 1/10th of a second. The wordlist used, counts 109.582 words.
I've already tried different solutions:
put all the words in a text file, and get a random line of the textfile -> too slow (and the user has to download a 3MB text-file before being able to use the application)
put all the words in an array in the Javascript. -> javascript arrays apparently can't handle 109.585 items
pull the words from a database using jQuery's Ajax function with a Javascript interval function -> this solution worked perfectly when testing on my localhost, but once uploaded to a real web-environment, this method proved to be too slow. (And I could imagine that my hosting provider wouldn't be so happy if I executed 10 query's to their server every second.)
So.. Does anybody knows a different approach that I could use to show a random word on a webpage every 1/10th of a second? It doesn't necessarily has to use php or javascript, as long as it runs in a browses, I'm happy!
Thanks in advance
Teis
There's no reason you have to pull the entire dataset every tenth of a second. Pull a reasonable amount from a database every minute (which would be about 600 words), load it into a local javascript object, and iterate through it.
When either the array index becomes high enough or the timer hits one minute, poll for another set of 600.
When dealing with times as low as a tenth of a second, you don't want to have to invoke the server EVERY single time! You could even load the entire data set into memcached and poll for random words, thus skipping costly database calls, as the entire data set is loaded into memory.
You could try to load only a subset of your words into your JS array.
Maybe you could try to load only 1000 (random) words from your database and show them.
As long as you don't need to generate insanely long text, you cold divide the randomization into two steps:
First preselect some of the words server-side (let's say -- 5000?)
Then, client-side, use JS to pick some more at random, from the preselected words.
Pros: No additional requests necessary; JS should handle array that big
Related
I'm building a PHP application which has a database containing approximately 140 URL's.
The goal is to download a copy of the contents of these web pages.
I've already written code which reads the URL's from my database then uses curl to grab a copy of the page. It then gets everything between <body> </body>, and writes it to a file. It also takes into account redirects, e.g. if I go to a URL and the response code is 302, it will follow the appropriate link. So far so good.
This all works ok for a number of URL's (maybe 20 or so) but then my script times out due to the max_execution_time being set to 30 seconds. I don't want to override or increase this, as I feel that's a poor solution.
I've thought of 2 work arounds but would like to know if these are a good/bad approach, or if there are better ways.
The first approach is to use a LIMIT on the database query such that it splits the task up into 20 rows at a time (i.e. run the script 7 separate times, if there were 140 rows). I understand from this approach it still needs to call the script, download.php, 7 separate times so would need to pass in the LIMIT figures.
The second is to have a script where I pass in the ID of each individual database record I want the URL for (e.g. download.php?id=2) and then do multiple Ajax requests to them (download.php?id=2, download.php?id=3, download.php?id=4 etc). Based on $_GET['id'] it could do a query to find the URL in the database etc. In theory I'd be doing 140 separate requests as it's a 1 request per URL set up.
I've read some other posts which have pointed to queueing systems, but these are beyond my knowledge. If this is the best way then is there a particular system which is worth taking a look at?
Any help would be appreciated.
Edit: There are 140 URL's at the moment, and this is likely to increase over time. So I'm looking for a solution that will scale without hitting any timeout limits.
I dont agree with your logic , if the script is running OK and it needs more time to finish, just give it more time it is not a poor solution.What you are suggesting makes things more complicated and will not scale well if your urls increase.
I would suggest moving your script to the command line where there is no time limit and not using the browser to execute it.
When you have an unknown list wich will take an unknown amount of time asynchronous calls are the way to go.
Split your script into a single page download (like you proposed, download.php?id=X).
From the "main" script get the list from the database, iterate over it and send an ajax call to the script for each one. As all the calls will be fired all at once, check for your bandwidth and CPU time. You could break it into "X active task" using the success callback.
You can either set the download.php file to return success data or to save it to a database with the id of the website and the result of the call. I recommend the later because you can then just leave the main script and grab the results at a later time.
You can't increase the time limit indefinitively and can't wait indefinitively time to complete the request, so you need a "fire and forget" and that's what asynchronous call does best.
As #apokryfos pointed out, depending on the timing of this sort of "backups" you could fit this into a task scheduler (like chron). If you call it "on demand", put it in a gui, if you call it "every x time" put a chron task pointing the main script, it will do the same.
What you are describing sounds like a job for the console. The browser is for the users to see, but your task is something that the programmer will run, so use the console. Or schedule the file to run with a cron-job or anything similar that is handled by the developer.
Execute all the requests simultaneously using stream_socket_client(). Save all the socket IDs in an array
Then loop through the array of IDs with stream_select() to read the responses.
It's almost like multi-tasking within PHP.
I have an old table with thousands of posts. Each post have 0-4 link to pictures, about a product. Now I have to use the data in the old table to crate a new, and to move the pictures into different folders. I also have to create thumbnails for every picture.
This task sound me a big job, with a long runtime. Should I write the code for this only in a single .php file, than just let the server to run it, or there is a special technique for this. All in all how can you work with huge tables and tons of pics via php?
It's possible for PHP to dynamically create this whenever it's needed (i.e. when a user attempts to access a post on the old table, PHP will detect it and automatically process it to the new location), this way you only do the ones you need, and you only need to do them once each. You also divide the work over a longer period of time, which will save you some server load.
This method is not recommended if you have a multitude of users and/or high traffic.
Finally I found out how can I solve it so I share the technique I used.
I wrote a code in PHP that manipulates the database, moves the pictures and creates thumbnails for each post (row in the table). Because of the big amount of posts, I wasn't able to run this correctly because the max_execution_time was set to 30 and of course the runtime was much bigger, so I divided the task to 30 sec parts. I added a new column (where I store if a row has processed) to the table, and always only selected the rows that weren't processed before. After all, I created a cron job that runs this PHP file in every minute.
I could also used the command line, because there aren't time limit, but I didn't have SSH access.
I am running a sweepstakes like thing and want my users to all be able to load the page and have the exact same countdown and number generator running. I tried this in AS3 but each user caches their own swf file and gets a different result from the random number generator, and the as3 countdown is a few seconds off from each user. How would I go about making a countdown that is the exact same for each user looking at it at the same time and then a random number generator where every user will see the same result it gives? Is it even possible?
Sorry I wasnt clear on this. I would like to have it where the viewers could see the number being generated when the timer runs out. Kind of like watching the lotto on TV. Again, not sure if this is possible.
I have looked around, I know as3, some php, and some javascript. I have given up on doing this in flash.
Assuming I understand correctly what you want, store a random value (RV) every time someone accesses the page associated with a UNIX timestamp in seconds. Make time column unique, then if another request is made the same second, the random number is going to be taken out of the database.
Store the timer result in a table, show users the stored random number. Then when needed simply create another random number, store again, show user. Repeat.
Every user that come to that page , first print the time from the server so every user that will come and if your server time is 12:00 am for example , you will show him 12:00 am.
And then with Ajax refresh this time every X seconds ajax will send a request to the server and the server will tell you the time to display.
Keep the random logic on your server and when your application logic will say to change to a different number the server will return a different number and your clients will get it on the next pull ( ajax ) .
I hope that answering your question.
I'm not too familiar with flash, but what I would do is make the countdown happen client side using javascript's getUTCSeconds(), getUTCMinutes(), getUTCHours() etc to figure out current time verses end time of the counter (end time in universal time code).
Then you could use php to generate the random numbers (and a corresponding remaining time associated with it? how often do you want these?) and store it in a location for later retrieval (database, or file or some such). You could use ajax to grab the random number at the specified times
for more about js date/time functions w3schools has a pretty good resource:
http://www.w3schools.com/jsref/jsref_obj_date.asp
This is possible.
All logic should be stored on the server side. Use Flash only to show results.
Countdown: create it with php, store on the server side in the storage (database, memory, files, whatever). All clients (written on Flash) request counter value and display the counter on the client side starting from the value taken from server.
Lotto results are also generated on the server and passed to the clients. On the first hand you may generate intermediate results on the server and read them one by one by the clients, but I'd generate all results at once and pass them to the client.
Intermediate results can then be synchronized with the counter.
I'm currently building a user panel which will scrape daily information using curl. For each URL it will INSERT a new row to the database. Every user can add multiple URLs to scrape. For example: the database might contain 1,000 users, and every user might have 5 URLs to scrape on average.
How do I to run the curl scraping - by a cron job once a day at a specific time? Will a single dedicated server stand this without lags? Are there any techniques to reduce the server load? And about MySQL databases: with 5,000 new rows a day the database will be huge after a single month.
If you wonder I'm building a statistics service which will show the daily growth of their pages (not talking about traffic), so as i understand i need to insert a new value per user per day.
Any suggestions will be appreciated.
5000 x 365 is only 1.8 million... nothing to worry about for the database. If you want, you can stuff the data into mongodb (need 64bit OS). This will allow you to expand and shuffle loads around to multiple machines more easily when you need to.
If you want to run curl non-stop until it is finished from a cron, just "nice" the process so it doesn't use too many system resources. Otherwise, you can run a script which sleeps a few seconds between each curl pull. If each scrape takes 2 seconds that would allow you to scrape 43,200 pages per 24 period. If you slept 4 sec between a 2 second pull that would let you do 14,400 pages per day (5k is 40% of 14.4k, so you should be done in half a day with 4 sec sleep between 2 sec scrape).
This seems very doable on a minimal VPS machine for the first year, at least for the first 6 months. Then, you can think about utilizing more machines.
(edit: also, if you want you can store the binary GZIPPED scraped page source if you're worried about space)
I understand that each customer's pages need to be checked at the same time each day to make the growth stats accurate. But, do all customers need to be checked at the same time? I would divide my customers into chunks based on their ids. In this way, you could update each customer at the same time every day, but not have to do them all at once.
For the database size problem I would do two things. First, use partitions to break up the data into manageable pieces. Second, if the value did not change from one day to the next, I would not insert a new row for the page. In my processing of the data, I would then extrapolate for presentation the values of the data. UNLESS all you are storing is small bits of text. Then, I'm not sure the number of rows is going to be all that big a problem if you use proper indexing and pagination for queries.
Edit: adding a bit of an example
function do_curl($start_index,$stop_index){
// Do query here to get all pages with ids between start index and stop index
$query = "select * from db_table where id >= $start_index and id<=$stop_index";
for($i=$start_index; $i<= $stop_index; $i++;){
// do curl here
}
}
urls would look roughly like
http://xxx.example.com/do_curl?start_index=1&stop_index=10;
http://xxx.example.com/do_curl?start_index=11&stop_index=20;
The best way to deal with the growing database size is to perhaps write a single cron script that would generate the start_index and stop_index based on the number of pages you need to fetch and how often you intend to run the script.
Use multi curl and properly optimise not simply normalise your database design. If I were to run this cron job, I will try to spend time studying that is it possible to do this in chunks or not? Regarding hardware start with an average configuration, keep monitoring it and increment the hardware, CPU or Memory. Remember, there is no silver bullet.
I am attempting to build a script that will log data that changes every 1 second. The initial thought was "Just run a php file that does a cURL every second from cron" -- but I have a very strong feeling that this isn't the right way to go about it.
Here are my specifications:
There are currently 10 sites I need to gather data from and log to a database -- this number will invariably increase over time, so the solution needs to be scalable. Each site has data that it spits out to a URL every second, but only keeps 10 lines on the page, and they can sometimes spit out up to 10 lines each time, so I need to pick up that data every second to ensure I get all the data.
As I will also be writing this data to my own DB, there's going to be I/O every second of every day for a considerably long time.
Barring magic, what is the most efficient way to achieve this?
it might help to know that the data that I am getting every second is very small, under 500bytes.
The most efficient way is to NOT use cron, but instead make an app that just always runs and keep curl handles open and repeats the request every second. This way, they will keep the connection almost forever and the repeated requests will be very fast.
However, if the target servers aren't yours or your friends, there's a likeliness that they will not appreciate your hammering on them.