API autocomplete call (performance, etc...) - php

I'm just starting to plan out a web app which allows users to save information about movies. This relies on the TMDb API. Now, i'd like to include an autocomplete feature for when they are searching for a movie. Do you think it's wise to make an API call onKeyUp? Or wait for a certain amount of time after a keyUp? Overall, is this the best way to carry this out?
I will be using PHP, jQuery and saving/retrieving user data with MySQL

Delay after key up unless you (or the server you are hitting) has the speed to be able to handle it. You'll have to account for race conditions anyway, but having that many calls isn't going to be very helpful. Your speeds to query the API are going to be slower than most user's typing speed, which means you'll be making unnecessary calls to your api, using both yours and your users' bandwidth.
Also, I would set a minimum number of character entered before you query (probably ~3 is good). This will also help lower number of queries, and you won't be running a query for 'a' or even 'ap' which could both be a lot of things. Once you get to 3 ('app') you can get a much smaller list of results, which is more helpful for a user.

You could use the TypeWatch jQuery plugin or something similar to only call the API after the user has stopped typing for a certain amount of time. Stack Overflow's Tags and Users pages, for example, use TypeWatch to only process the input after the user has stopped typing for 500ms.

Related

Atomic/safe serving of single-use codes

I have a list of single-use discount codes for an ecommerce site I'm partnering with. I need to set up a page on my site where my users can fill out a form, and then will be given one of the codes. The codes are pre-determined and have been sent to me in a text file; I can't just generate them on the fly. I need to figure out the best way to get an unused code from the list, and then remove it from the list (or update a flag to mark it as used) at the same time, to avoid any possibility of giving two people the same code. In other words, something similar to a queue, where I can remove one item from the queue atomically.
This webapp will be running on AWS and the current code is Python (though I could potentially use something else if necessary; PHP would be easy). Ideally I'd use one of the AWS services or mysql to do this, but I'm open to other solutions if they're not a royal pain to get integrated. Since I thought "queue," SQS popped into my head, but this is clearly not what it's intended for (e.g. the 14 day limit on messages remaining in the queue will definitely not work for me). While I'm expecting very modest traffic (which means even really hacky solutions would probably work), I'd rather learn about the RIGHT way to do this even at scale.
I cant give actual code examples, but one of the easiest ways to do it would just be an increment counter in the file, so something like
0
code1
code2
code3
etc
and just skipping that many lines every time a code is used.
You could also do this pretty simply in a database
Amazon DynamoDB is a fast, NoSQL database from AWS, and it is potentially a good fit for this use case. Setting up a database table is easy, and you could load your codes into there. DynamoDB has a DeleteItem operation that also allows you to retrieve the data within the same, atomic operation (by setting the ReturnValues parameter to ALL_OLD). This would allow you to get and delete a code in one shot, so no other requests/processes can get the same code. AWS publishes official SDKs to help you connect to and use their services, including both a Python and PHP SDK (see http://aws.amazon.com/tools/).

How to deal with External API latency

I have an application that is fetching several e-commerce websites using Curl, looking for the best price.
This process returns a table comparing the prices of all searched websites.
But now we have a problem, the number of stores are starting to increase, and the loading time actually is unacceptable at the user experience side. (actually 10s pageload)
So, we decided to create a database, and start to inject all Curl filtered result inside this database, in order to reduce the DNS calls, and increase Pageload.
I want to know, despite of all our efforts, is still an advantage implement a Memcache module?
I mean, will it help even more or it is just a waste of time?
The Memcache idea was inspired by this topic, of a guy that had a similar problem: Memcache to deal with high latency web services APIs - good idea?
Memcache could be helpful, but (in my opinion) it's kind of a weird way to approach the issue. If it was me, I'd go about it this way:
Firstly, I would indeed cache everything I could in my database. When the user searches, or whatever interaction triggers this, I'd show them a "searching" page with whatever results the server currently has, and a progress bar that fills up as the asynchronous searches complete.
I'd use AJAX to add additional results as they become available. I'm imagining that the search takes about ten seconds - it might take longer, and that's fine. As long as you've got a progress bar, your users will appreciate and understand that Stuff Is Going On.
Obviously, the more searches go through your system, the more up-to-date data you'll have in your database. I'd use cached results that are under a half-hour old, and I'd also record search terms and make sure I kept the top 100 (or so) searches cached at all times.
Know your customers and have what they want available. This doesn't have much to do with any specific technology, but it is all about your ability to predict what they want (or write software that predicts for you!)
Oh, and there's absolutely no reason why PHP can't handle the job. Tying together a bunch of unrelated interfaces is one of the things PHP is best at.
Your result is found outside the bounds of only PHP. Do not bother hacking together a result in PHP when a cronjob could easily be used to populate your database and your PHP script can simply query your database.
If you plan to only stick with PHP then I suggest you change your script to index your database from the results you have populated it with. To populate the results, have a cronjob ping a PHP script that is not accessible to the users which will perform all of your curl functionality.

Page hit count or Google Analytics?

I'm developing a website that is sensitive to page visits. For instance it has sections that will show the users which parts of the website (which items) have been visited the most. To implement this features, two strategies come to my mind:
Create a page hit counter, sort the pages by the number of visits and pick the highest ones.
Create a Google Analytics account and use its info.
If the first strategy has been chosen, I would need a very fast and accurate hit counter with the ability to distinguish the unique IPs (or users). I believe that using MySQL wouldn't be a good choice, since a lot of page visits, means a lot of DB locks and performance problems. I think a fast logging class would be a good one.
The second option seems very interesting when all the problems of the first one emerge but I don't know if there is a way (like an API) for Google Analytics to make me able to access the information I want. And if there is, is it fast enough?
Which approach (or even an alternative approach) you suggest I should take? Which one is faster? The performance is my top priority. Thanks.
UPDATE:
Thank you. It's interesting to see different answers. These answers reminded me an important factor. My website updates the "most visited" items, every 8 minutes so I don't need the data in real time but I need it to be accurate enoughe every 8 minutes or so. What I had in mind was this:
Log every page visit to a simple text log file
Send a cookie to the user to separate unique users
Every 8 minutes, load the log file, collect the info and update the MySQL tables.
That said, I wouldn't want to reinvent the wheel. If a 3rd party service can meet my requirements, I would be happy to use it.
Given you are planning to use the page hit data to determine what data to display on your site, I'd suggest logging the page hit info yourself. You don't want to be reliant upon some 3rd party service that you'd have to interrogate in order to create your page. This is especially true if you are loading that data real time as you'd have to interrogate that service for every incoming request to your site.
I'd be inclined to save the data yourself in a database. If you're really concerned about the performance of the inserts, then you could investigate intercepting requests (I'm not sure how you go about this in PHP, but I'm assuming it's possible.) and then passing the request data of to a separate thread to store the request info. By having a separate thread handle the logging, then you won't interrupt your response to the end user.
Also, given you are planning using the data collected to "... show the users which parts of the website (which items) have been visited the most", then you'll need to think about accessing this data to build your dynamic page. Maybe it'd be good to store a consolidated count for each resource. For example, rather than having 30000 rows showing that index.php was requested, maybe have one row showing index.php was requested 30000 times. This would certainly be quicker to reference than having to perform queries on what could become quite a large table.
Google Analytics has a latency to it and it samples some of the data returned to the API so that's out.
You could try the API from Clicky. Bear in mind that:
Free accounts are limited to the last 30 days of history, and 100 results per request.
There are many examples of hit counters out there, but it sounds like you didn't find one that met your needs.
I'm assuming you don't need real-time data. If that's the case, I'd probably just read the data out of the web server log files.
Your web server can distinguish IP addresses. There's no fully reliable way to distinguish users. I live in a university town; half the dormitory students have the same university IP address. I think Google Analytics relies on cookies to identify users, but shared computers makes that somewhat less than 100% reliable. (But that might not be a big deal.)
"Visited the most" is also a little fuzzy. The easy way out is to count every hit on a particular page as a visit. But a "visit" of 300 milliseconds is of questionable worth. (Probably realized they clicked the wrong link, and hit the "back" button before the page rendered.)
Unless there are requirements I don't know about, I'd probably start by using awk to extract timestamp, ip address, and page name into a CSV file, then load the CSV file into a database.

Twitter competition ~ saving tweets (PHP & MySQL)

I am creating an application to help our team manage a twitter competition. So far I've managed to interact with the API fine, and return a set of tweets that I need.
I'm struggling to decide on the best way to handle the storage of the tweets in the database, how often to check for them and how to ensure there are no overlaps or gaps.
You can get a maximum number of 100 tweets per page. At the moment, my current idea is to run a cron script say, once every 5 minutes or so and grab a full 100 tweets at a time, and loop through them looking in the db to see if I can find them, before adding them.
This has the obvious drawback of running 100 queries against the db every 5 minutes, and however many INSERT there are also. Which I really don't like. Plus I would much rather have something a little more real time. As twitter is a live service, it stands to reason that we should update our list of entrants as soon as they enter.
This again throws up a drawback of having to repeatedly poll Twitter, which, although might be necessary, I'm not sure I want to hammer their API like that.
Does anyone have any ideas on an elegant solution? I need to ensure that I capture all the tweets, and not leave anyone out, and keeping the db user unique. Although I have considered just adding everything and then grouping the resultant table by username, but it's not tidy.
I'm happy to deal with the display side of things separately as that's just a pull from mysql and display. But the backend design is giving me a headache as I can't see an efficient way to keep it ticking over without hammering either the api or the db.
100 queries in 5 minutes is nothing. Especially since a tweet has essentially only 3 pieces of data associated with it: user ID, timestamp, tweet, tweet ID - say, about 170 characters worth of data per tweet. Unless you're running your database on a 4.77MHz 8088, your database won't even blink at that kind of "load"
The Twitter API offers a streaming API that is probably what you want to do to ensure you capture everything:
http://dev.twitter.com/pages/streaming_api_methods
If I understand what you're looking for, you'll probably want a statuses/filter, using the track parameter with whatever distinguishing characteristics (hashtags, words, phrases, locations, users) you're looking for.
Many Twitter API libraries have this built in, but basically you keep an HTTP connection open and Twitter continuously sends you tweets as they happen. See the streaming API overview for details on this. If your library doesn't do it for you, you'll have to check for dropped connections and reconnect, check the error codes, etc - it's all in the overview. But adding them as they come in will allow you to completely eliminate duplicates in the first place (unless you only allow one entry per user - but that's client-side restrictions you'll deal with later).
As far as not hammering your DB, once you have Twitter just sending you stuff, you're in control on your end - you could easily have your client cache up the tweets as they come in, and then write them to the db at given time or count intervals - write whatever it has gathered every 5 minutes, or write once it has 100 tweets, or both (obviously these numbers are just placeholders). This is when you could check for existing usernames if you need to - writing a cached-up list would allow you the best chance to make things efficient however you want to.
Update:
My solution above is probably the best way to do it if you want to get live results (which it seems like you do). But as is mentioned in another answer, it may well be possible to just use the Search API to gather entries after the contest is over, and not worry about storing them at all - you can specify pages when you ask for results (as outlined in the Search API link), but there are limits as to how many results you can fetch overall, which may cause you to miss some entries. What solution works best for your application is up to you.
I read over your question and it seems to me that you want to duplicate data already stored by Twitter. Without more specifics on the competition your running, how users enter for example, estimated amount of entries; its impossible to know whether or not storing this information locally on a database is the best way to approach this problem.
Might a better solution to be, skip storing duplicate data locally and drag the entrants directly from twitter, i.e. when your attempting to find a winner.
You could eliminate duplicate entries on-the-fly then whilst the code is running. You would just need to call "the next page" once its finished processing the 100 entries its already fetched. Although, i'm not sure if this is possible directly through the Twitter API.
I think running a cron every X minutes and basing it off of the tweets creation date may work. You can query your database to find the last date/time of the last recorded tweet, then only run selects if there are matching times to prevent duplicates. Then, when you do your inserts into the database, use one or two insert statements containing all the entries you want to record to keep performance up.
INSERT INTO `tweets` (id, date, ...) VALUES (..., ..., ...), (..., ..., ...), ...;
This doesn't seem too intensive...also depends on the number of tweets you expect to record though. Also make sure to index the table properly.

How To Aggregate API Data?

I have a system that connects to 2 popular APIs. I need to aggregate the data from each into a unified result that can then be paginated. The scope of the project means that the system could end up supporting 10's of APIs.
Each API imposes a max limit of 50 results per request.
What is the best way of aggregating this data so that it is reliable i.e ordered, no duplicates etc
I am using CakePHP framework on a LAMP environment, however, I think this question relates to all programming languages.
My approach so far is to query the search API of each provider and then populate a MySQL table. From this the results are ordered, paginated etc. However, my concern is performance: API communication, parsing, inserting and then reading all in one execution.
Am I missing something, does anyone have any other ideas? I'm sure this is a common problem with many alternative solutions.
Any help would be greatly appreciated.
Yes, this is a common problem.
Search SO for questions like https://stackoverflow.com/search?q=%5Bphp%5D+background+processing
Everyone who tries this realizes that calling other sites for data is slow. The first one or two seem quick, but other sites break (and your app breaks) and other sites are slow (and your app is slow)
You have to disconnect the front-end from the back-end.
Choice 1 - pre-query the data with a background process that simply gets and loads the database.
Choice 2 - start a long-running background process and check back from a JavaScript function to see if it's done yet.
Choice 3 - the user's initial request spawns the background process -- you then email them a link so they can return when the job is done.
i have a site doing just that with over 100 rss/atom feeds, this is what i do:
i have a list of feeds and a cron job that iterates over them, about 5 feeds a minute, meaning i cycle through all feeds every 20 minute or so.
i lift the feed, and try to insert each entry into the database, using the url as a unique field, if the url exists, i do not insert. the entry date is my current system clock, and is inserted by my application, as date fields in rss cannot be trusted, and in some cases, can't even be parsed.
for some feeds, and only experiece can tell you which, i also search for duplicate titles, some websites change the urls for their own reasons.
the items are now all placed in the same database table, ready to be queried.
one last thought: if your application is likely to have new feeds added while in production, you really should also check if a feed is "new" (ie: has no previous entries in the db), if it is, you should mark all currently available links as inactive, otherwise, when you add a feed, there will be a block of articles from that feed, all with the same date and time. (simply put: the method i described is for future additions to the feed only, past articles will not be available).
hope this helps.

Categories