Hi I need suggestions for my research project.
I'm building a database that read RSS feeds produced by google alerts and saves the results in a database for latter categorization.
I'm using Wordpress and pods framework to handle the database and UI.
I have 4 objects (pods) with their own tables:
Resources, it's the site data taken from an alert feed.
Sources, it's the domain of the site, like stackoverflow.com
Feeds, with the alert query and rss url.
Topics, the main topics under which the other objects are categorized.
The program flow is shortly this:
For every topic take the feeds.
For every feed load the the rss xml.
For every entry URL in the rss, if newer than last check, control if the domain is already saved in the Sources object.
If the source it's present check if the URL is saved in the Resource object.
If the resource is present (that is we already have the data for this URL) add the current topic and feed of the loop to the resource (if not present).
If the resource is not present save the resource with some data and the current topic and feed.
If the source is not present save the source with the current topic.
This way I will have a bunch of resources with linked feeds and topics, and the relative sources with their topics.
The problem is that the data goes up to the hundreds very fast and in one month I already reached more than 1500 records for resources.
So now every time I run the script, since for every new entry it has to compare it with all the previous ones, the script sistematically hangs.
So I need a way to make it more efficient or to avoid the problem splitting the process.
Since the script is called via Ajax, I thought this flow would work:
Ask the server for the topics/feeds structure.
For every feed in every topic ask the server to load the XML and pass it back has an array.
Then in the front end for every entry send a compare and save call.
Of course the drawback is that I will got plenty of calls.
Another technique I heard about is to flush the data during the server process, since I understood this should trick the server time limit to reset. But I'm not sure I understood well.
Of course the best solution would be to rebuild everything using more specific code instead than two general purpose abstraction layers. But I'm really short in time!
Edit: code here https://github.com/bakaburg1/overseer
I think you're loading too much data at once. The whole point of AJAX is to load data a bit at a time, on-demand as you need it. It doesn't really matter that you have a lot of HTTP calls to the server as long as the scripts you're running to return data don't hang like yours is doing.
I would suggest questioning whether you really need to return all this data at once. If you think you do, a front-end redesign is probably in order. Don't load so much data at once on a single page that it takes forever to load. If you do, caching and other things will help, but it will never truly fix the problem.
Related
Here's the context : we're actually using the basic web stack, and our website builds HTML templates with datas it gets from the database directly.
For tons of reasons, we're splitting this into two projects, one will be responsible for talking with the database directly, the other one will be responsible for displaying the datas.
To make it simple, one is the API, the other one is the client.
Now we're wondering about how we should ask our API for datas. To us, there are 2 totally different options :
One request, one route, for one page. So we would get a huge object to use which would contain everything needed to build the corresponding page.
One request for one little chunk of data. For example on a listing page, we'd make one request to get datas about the current logged user and display its name along with its avatar, then another request to get every articles, another request to get datas about the current page category...
Some like the first option, I don't like it at all. I feel like we're going to have a lot of redundance. I'm also not sure one huge request is that much faster than X tiny requests. I also don't like binding data to a specific page, as I feel like the API should be (somewhat) independant from our front website.
Some also don't like the second option, they fear we overcharge the server by making too many calls, and I can understand this fear. It also looks like it'll be hard to properly define the scope of what to send, what to not send without any redundancy. If we're sending only what's needed to display a page, isn't that the first option in the end ? But isn't sending unneeded information a waste ?
What do you guys think ?
The first approach will be good if getting all data is fast enough. The less requests - the faster app. Redundancy I think you mean code redundancy because sending the same amount of data in one request will be definitely faster than in 10 small non-parallel ones (network overhead). If you send a few parallel requests from UI you can get performance gain of cause. And you should take into account that browsers have some limitations for parallel requests.
Another case if getting some data is fast but another is slow you can return the first data and on UI show loading image and load the second data when it will come. It will improve user experience showing the page as fast as possible.
The second approach is more flexible as you can use some requests from other pages. But it comes with price - logic with making these requests (gathering information) you need to move to UI code making it more complex. And if you need the same data on another app like mobile you have to copy this logic. As a rule creating such code on backend side is easier.
You can also take a look at this pattern which allow you to locate business/domain logic inside one service and “frontend friendly” logic to another service (orchistration service).
I need some advice on website design.
Lets take example of twitter for my question. Lets say I am making twitter. Now on the home_page.php ,I need both, Data about tweets (Tweet id , who tweeted , tweet time etc. etc) and Data about the user( userId , username , user profile pic).
Now to display all this, I have two option in mind..
1) Making separate php files like tweets.php and userDetails.php. By using AJAX queries, I can get the data on the home_page.php.
2) Adding all the php code (connecting to db, fetching data ) in the home_page.php itself.
In option one, I need to make many HTTP requests, which (i think) will be load to the network. So it might slow down the website.
But option two, I will have a defined REST API. Which will be good of adding more features in the future.
Please give me some advice on picking the best. Also I am still a learner, so if there are more options of implementing this, please share.
In number 1 you're reliant on java-script which doesn't follow progressive enhancement or graceful degradation; if a user doesn't have JS they will see zero content which is obviously bad.
Split your code into manageable php files to make it easier to read and require them all in one main php file; this wont take any extra http requests because all the includes are done server side and 1 page is sent back.
You can add additional javascript to grab more "tweets" like twitter does, but dont make the main functionality rely on javascript.
Don't think of PHP applications as a collection of PHP files that map to different URLs. A single PHP file should handle all your requests and include functionality as needed.
In network programming, it's usually good to minimize the number of network requests, because each request introduces an overhead beyond the time it takes for the raw data to be transmitted (due to protocol-specific information being transmitted and the time it takes to establish a connection for example).
Don't rely on JavaScript. JavaScript can be used for usability enhancements, but must not be used to provide essential functionality of your application.
Adding to Kiee's answer:
It can also depend on the size of your content. If your tweets and user info is very large, the response the single PHP file will take considerable time to prepare and deliver. Then you should go for a "minimal viable response" (i.e. last 10 tweets + 10 most popular users, or similar).
But what you definitely will have to do: create an API to bring your page to life. No matter which approach you will use...
I really hate to ask these vague questions, but I could use some help figuring out the best method to go about this idea.
What I'm doing is fairly simple. I'm creating a page that will display data from several sportsbooks as accurately as possible. There are several different sites, all with their own APIs, but the data is the pretty much the same. Some of them have restrictions on the number of times they can be accessed per day so I'm trying to figure out how to go about that.
Because of the whole cross domain issue with JavaScript, I was looking into setting something up with PHP to retrieve the data locally. Would it be possible for a script to access each API say 4 times a day at certain times, parse the data locally and then serve it up with AJAX or something else?
Would it make sense to store the data in a MYSQL database and then query it? What I'm looking to do seems as simple as displaying live XML data, I just need to figure out a way not to exceed the API calls for the day. I'm fairly new to server side code in general so I apologize if this is blatantly obvious.
If anyone has any ideas, I would greatly appreciate it.
Thanks!
You need indeed to store data locally. It can be either a database or files.
The set-up would go like that:
Have a PHP script that would
Fetch the data associated to an API
Store that result in a cache file
Have a CRON job that would run the above script 3 times a day
Have a frontend php page that will display the content of the cache files.
If your server gets overloaded, you might consider caching the frontend php file to a static html file each time an API gets refreshed.
This issue has been quite the brain teaser for me for a little while. Apologies if I write quite a lot, I just want to be clear on what I've already tried etc.
I will explain the idea of my problem as simply as possible, as the complexities are pretty irrelevant.
We may have up to 80-90 users on the site at any one time. They will likely all be accessing the same page, that I will call result.php. They will be accessing different results however via a get variable for the ID (result.php?ID=456). It is likely that less than 3 or 4 users will be on an individual record at any one time, and there are upwards of 10000 records.
I need to know, with less than a 20-25 second margin of error (this is very important), who is on that particular ID on that page, and update the page accordingly. Removing their name once they are no longer on the page, once again as soon as possible.
At the moment, I am using a jQuery script which calls a php file, reading from a database of "Currently Accessing" usernames who are accessing this particular ID, and only if the date at which they accessed it is within the last 25 seconds. The file will also remove all entries older than 5 minutes, to keep the table tidy.
This was alright with 20 or 30 users, but now that load has more than doubled, I am noticing this is a particularly slow method.
What other methods are available to me? Has anyone had any experience in a similar situation?
Everything we use at the moment is coded in PHP with a little jQuery. We are running on a server managed offsite by a hosting company, if that matters.
I have come across something called Comet or a Comet Server which sounds like it could potentially be of assistance, but it also sounds extremely complicated for my purposes and far beyond my understanding at the moment.
Look into websockets for a realtime socket connection. You could use websockets to push out updates in real time (instead of polling) to ensure changes in the 'currently online users' is sent within milliseconds.
What you want is an in-memory cache with a service layer that maintains the state of activity on the site. Using memcached might be a good starting point. Your pseudo-code would be something like:
On page access, make a call to CurrentUserService
CurrentUserService takes as a parameter the page you're accessing and who you are.
Each time you call it, it removes whatever you were accessing before from the cache.
Then it adds what you're currently accessing.
Then it compiles a list of who else is accessing the same thing based on the current state in the cache.
It returns this list, which your page processes and displays.
If you record when someone accesses a page, you can set a timeout for when the service stops 'counting' them as accessing the page.
I have a large database of links, which are all sorted in specific ways and are attached to other information, which is valuable (to some people).
Currently my setup (which seems to work) simply calls a php file like link.php?id=123, it logs the request with a timestamp into the DB. Before it spits out the link, it checks how many requests were made from that IP in the last 5 minutes. If its greater than x, it redirects you to a captcha page.
That all works fine and dandy, but the site has been getting really popular (as well as been getting DDOsed for about 6 weeks), so php has been getting floored, so Im trying to minimize the times I have to hit up php to do something. I wanted to show links in plain text instead of thru link.php?id= and have an onclick function to simply add 1 to the view count. Im still hitting up php, but at least if it lags, it does so in the background, and the user can see the link they requested right away.
Problem is, that makes the site REALLY scrapable. Is there anything I can do to prevent this, but still not rely on php to do the check before spitting out the link?
It seems that the bottleneck is at the database. Each request performs an insert (logs the request), then a select (determine the number of requests from the IP in the last 5 minutes), and then whatever database operations are necessary to perform the core function of the application.
Consider maintaining the request throttling data (IP, request time) in server memory rather than burdening the database. Two solutions are memcache (http://www.php.net/manual/en/book.memcache.php) and memcached (http://php.net/manual/en/book.memcached.php).
As others have noted, ensure that indexes exist for whatever keys are queried (fields such as the link id). If indexes are in place and the database still suffers from the load, try an HTTP accelerator such as Varnish (http://varnish-cache.org/).
You could do the ip throttling at the web server level. Maybe a module exists for your webserver, or as an example, using apache you can write your own rewritemap and have it consult a daemon program so you can do more complex things. Have the daemon program query a memory database. It will be fast.
Check your database. Are you indexing everything properly? A table with this many entries will get big very fast and slow things down. You might also want to run a nightly process that deletes entries older than 1 hour etc.
If none of this works, you are looking at upgrading/load balancing your server. Linking directly to the pages will only buy you so much time before you have to upgrade anyway.
Every thing you do on the client side can't be protected, Why not just use AJAX ?
Have a onClick event that call's an ajax function, that returns just the link and fill it in a DIV on your page, beacause the size of the request an answer is small, it will work fast enougth for what you need. Just make sure in the function you call to check the timestamp, It is easy to make a script that call that function many times to steel you links.
You can check out jQuery, or other AJAX libraries (i use jQuery and sAjax). And I have lots of page that dinamicly change content very fast, The client doesn't even know is not pure JS.
Most scrapers just analyze static HTML so encode your links and then decode them dynamically in the client's web browser with JavaScript.
Determined scrapers can still get around this, but they can get around any technique if the data is valuable enough.