You see them everywhere. Like the twitter and facebook buttons that show up on blogs and websites that display a number of "tweets" or "likes".
All I need to be able to do is display a number from my MySQL database based on two variables (username and an ID). It would probably be useful to encrypt the variables somehow so that users can't just alter the badge's code and display another user's number.
But more importantly, I just need to know how to use the HTML code like you find in social network badges and have it talk to a PHP script on my server which will calculate the number from the database based on the variables held within the badge.
Any clue where to start?
Edit: I'm not talking about the kind of badges like you find on stackoverflow, I mean the kind other sites let you paste on your blog/site. Like Digg lets you show that your site has been dugg 7000 times, etc.
You may wish to look up the GD library for PHP and related tutorials. Basically, all those badges consist of is a static image as the template with some dynamic text inserted on top, usually consisting of the username and a number (likes, tweets, etc..).
For the HTML code, you would do something similar to:
<img src="http://www.yourserver.com/yourscript.php?username=miki&id=1337" />
This will send a HTTP GET request to your script, causing it to execute. Your script can then communicate with the database, fetch the user's information, use GD to insert that text into a template and then return that to the browser with the correct mime type and content.
You're talking about calling a remote script, essentially.
I assume you mean something like this -
You are viewing your profile on your side. You have a widget that promises to display the total number of points you have, for example.
You offer a "code" button like youtube embed or facebook "like"
The user clicks this, gets a segment of code and is expected to be able to paste it anywhere on the internet where applicable and the code will generate an icon or something with presumably the username and their points.
First, you can do this several ways. The most cost effective, in my opinion, is to generate the user button on your server on update - like say your points meant "number of thumbs up your articles received" so it will be an integer value. Every time you get a thumbs up, you would re-cache the button and write it into a flat text file. If you're good, you would write it to an image and flatten it to a jpg or gif. If you don't know how to do that, you can write it to html and save the file as a user specific "slug" like md5(username).'.html' - that way every time the server is called, you don't need to pile on bandwidth with redundant queries and account look ups. You only serve the optimized image or html file.
Second - you can give the user an iframe that has the html in it. This is generally how facebook "like" does it for people that don't use the fbml method. Problem is, many sites see iframes as potential xss attack and will strip them out. So, in order to make use of the iframe you would need to have control of the domain, which may defeat the purpose of your request if the intention is to share your profile goodies.
Third, you can call a js file on your server that makes an ajax call to your database and serves the results. This is also most likely going to be seen as xss attack and you should probably not even give it much more thought.
I mentioned the iframe and js methods in case you were looking to provide an option for other people who run their own sites. The way "like" is used by site owners to show how many times their domain has been "liked" and so on. These people have control of domains so the iframe and js methods are logical.
So -
This answer may not have much in way of raw code for you, but it should help you start.
I would do the image method since it is safer. You would give the user an image tag with their slug in the src attribute. They can paste it anywhere and there is no way to re-write the number within the image. Most forums and places where you can just post to other people's sites allow images. Just do a google search on drawing images with php, as well as using the imagemagick library to merge text and images.
Related
The question is how to get ajax calls source code? this is not crawled, for example how to crawl the pictures on a link like this? http://www.tiendeo.nl/Catalogi/amsterdam/16558&subori=web_sliders&buscar=Boni&sw=1366
If you do inspect element, then it will show you the right code in the middle where the pictures are. But how to crawl this? If you click the next page, then it will have other images in the source. How to get the source for all images?
If I understand your question correctly (How do I crawl information loaded into the page by ajax calls?), the answer is that you either need some sort of javascript-aware crawler, or you need to inspect the javascript to figure out what resources are being polled to load the content you're interested in. From PHP you should be able to send curl get requests to these URLs and receive the same responses the site's javascript is using to render the entries.
The latter option has some rewards--namely that you will most likely be able to get simple, easy-to-use JSON responses to your requests.
As with most web scraping efforts, it tends to be the case that some content providers won't appreciate your interest in their data (especially if you're collecting it in ways that put undue strain on their systems or resources.) Keep in mind that they make take steps (technical or legal) to stop you if they notice/mind.
Addendum:
If you're hoping to crawl a variety of similar sites without needing to look through the source to find the resources they're using, (let's say for the sake of argument that you're just trying to naively scrape all images over a certain size from several sites selling the same sort of items), you'd need the former option--some sort of javascript-aware scraper. I don't know if such a thing exists, but it wouldn't surprise me.
In Imgur, you can input an image URL and a few seconds later, there's a thumbnail of the image. Or in Bing Search, you can (or used to) be able to view a thumbnail of the website in the search results before visiting it.
I would love to implement something similar for my website, but I can't wrap my head around on how it is done. Moreover, are there not security concerns? I'd imagine the servers have to at least download the website, render it and take a screenshot. What if it's a malicious website, and you download something malicious on your server?
A headless Web browser engine like PhantomJS can be used for this. See example on their wiki. Yes, it would be prudent to run this in some sort of a sandbox, feeding a queue of URLs into it, then taking the generated thumbnails from the file system.
While I don't know the internal workings of any of the aforementioned services, I'd guess that they download/create a local copy of the images and generate a thumbnail from that.
Imgur, as an image hosting service, definitely needs a copy of the image prior to being able to generate thumbnails or anything else from it. The image may be stored locally or just in memory, but either way, it must be downloaded.
The search engines displaying screenshots of the sites likely have services that periodically take a screenshot of the viewable area when the content is getting indexed, and then serve those screenshots (or derivatives) along with the search results. Taking a screenshot really isn't dangerous, so there's nothing to worry about there, and whatever tools are used to load/parse/index the websites will obviously be written with security considerations in mind.
Of course, there are security concerns about the data you're downloading, too; the images can easily contain executable code (such as PHP) in their EXIF data, so you need to be careful about what you do with the images and how.
I am looking for a way to create functionality, similar to when you post a link to the existed web-site in facebook. If this statement is rather ambiguous, I will try to elaborate.
When you paste your link and submit your post, facebook together with you link gives a small preview of the page, you are posting (text and may be a small image)
What are the ways to achieve this?
I read the similar post, but the thing is that I do not need an image so much, text will be sufficient.
Working in PHP, but language is not important, because I am looking for a high level idea.
Previously I was thinking about parsing content of the link with cURL but the thing is that in a lot of situations the text returned by facebook is not available on the page.
Is there other ways?
From what I can tell, Facebook pulls from the meta name="description" tag's content attribute on the linked page.
If no meta description tag is available, it seems to pull from the beginning of the first paragraph <p> tag it can find on the page.
Images are pulled from available <img> tags on the page, with a carousel selection available to pick from when posting.
Finally, the link subtext is also user-editable (start a status update, include a link, and then click in the link subtext area that appears).
Personally I would go with such a route: cURL the page, parse it for a meta tag description and if not grab some likely data using a basic algorithm or just the first paragraph tag, and then allow user editing of whatever was presented (it's friendlier to the user and also solves issues with different returns on user-agent). Do the user facing control as ajax so that you don't have issues with however long it takes your site to access the link you want to preview.
I'd recommend using a DOM library (you could even use DOMDocument if you're comfortable with it and know how to handle possibly malformed html pages) instead of regex to parse the page for the <meta>, <p>, and potentially also <img> tags. Building a regex which will properly handle all of the myriad potential different cases you will encounter "in the wild" versus from a known set of sites can get very rough. QueryPath usually comes recommended, and there are stackoverflow threads covering many of the available options.
Most modern sites, especially larger ones, are good about populating the meta description tag, especially for dynamically generated pages.
You can scrape the page for <img> tags as well, but you'll want to then host the images locally: You can either host all of the images, and then delete all except the one chosen, or you can host thumbnails (assuming you have an image processing library installed and turned on). Which you choose depends on whether bandwidth and storage are more important, or the one-time processing of running an imagecopyresampled, imagecopyresized, Gmagick::thumbnailimage, etc, etc. (pick whatever you have at hand/your favorite). You don't want to hot link to the images on the page due to both the morality of it in terms of bandwidth and especially the likelihood of ending up with broken images when linking any site with hotlink prevention (referrer/etc methods), or from expiration/etc. Personally I would probably go for storing thumbnails.
You can wrap the entire link entity up as an object for handling expiration/etc if you want to eventually delete the image/thumbnail files on your own server. I'll leave particular implementation up to you since you asked for a high level idea.
but the thing is that in a lot of situations the text returned by facebook is not available on the page.
Have you looked at the page's meta tags? I've tested with a few pages so far and this is generally where content not otherwise visible on the rendered linked pages is coming from, and seems to be the first choice for Facebook's algorithm.
Full disclosure upfront, I'm a developer at ThumbnailApp.com.
It's an JSON API service with an optional Javascript SDK which I think does exactly what you're after: It will parse a string to detect any urls and return the title, description and thumbnail of the asset. If the page has OpenGraph tags, it will use those for the image thumbnail. It's currently in private beta but we're adding more accounts each week.
If you feel that you really need a do-it-yourself solution:
Checkout the python based Webkit2Png and the headless browser PhantomJs. They can render webpages to an image (default size is 800x600), then you'll have to write some code to resize and crop the image like taswyn mentioned. Ideally you would then upload the resized image to Amazon S3 and then get it hosted on a CDN such as CloudFront.
To get the title and description, first get the URL content (cURL or whatever) and you will need to check the content-type header to make sure it's a webpage. If it is, you can then use a HTML parser such as the SimpleHTMLDOM PHP library to grab the title and description meta data. If you want it exactly like Facebook you will also need to check for any OpenGraph tags specifically the og:image tag.
Also don't forget about caching. The first render and description parsing can take a long time. Even if your site is fast, the webpage you're rendering could be slow and the best approach is to render / parse it once, then just save and return the resized image and meta data for subsequent requests. Depending on what your requirements are you may need to refresh the cached data every hour or you could get away with refreshing it once a day.
To do it yourself takes quite a bit of work and lots of server configuration. I feel using a 3rd party service is a better way to go, but obviously I have a biased opinion :)
Recently I've been working on a little side project, aimed at being a wallpaper site. It'll mainly consist of my own photos, but I may implement some kind of upload system for other users to submit their own.
Essentially, I can make a page that takes an ID from the URL, e.g., 284 from /wallpaper?id=284. I can easily pop a .jpg after it and make a different image load depending on the ID. The annoying thing is, I'd like to also have a title/description accompany the photos, and I don't want to also have a title attribute in the URL at it would be easy to tamper with, even if not doing any damage.
I'm guessing this is to do with PHP tables (judging by the name and the sudden realisation that not much would be possible without something like it), which I can use as I'm currently running on a local WAMP server. I should also be able to when and if I upload this to an online server.
I'm basically asking how I can create some kind of document or table containing information that can be retrieved using some kind of ID.
Thanks
You have a forum (vbulletin) that has a bunch of images - how easy would it be to have a page that visits a thread, steps through each page and forwards to the user (via ajax or whatever) the images. i'm not asking about filtering (that's easy of course).
doable in a day? :)
I have a site that uses codeigniter as well - would it be even simpler using it?
assuming this is to be carried out on server, curl + regexp are your friends .. and yes .. doable in a day...
there are also some open-source HTML parsers that might make this cleaner
It depends on where your scraping script runs.
If it runs on the same server as the forum software, you might want to access the database directly and check for image links there. I'm not familiar with vbulletin, but probably it offers a plugin api that allows for high level database access. That would simplify querying all posts in a thread.
If, however, your script runs on a different machine (or, in other words, is unrelated to the forum software), it would have to act as a http client. It could fetch all pages of a thread (either automatically by searching for a NEXT link in a page or manually by having all pages specified as parameters) and search the html source code for image tags (<img .../>).
Then a regular expression could be used to extract the image urls. Finally, the script could use these image urls to construct another page displaying all these images, or it could download them and create a package.
In the second case the script actually acts as a "spider", so it should respect things like robots.txt or meta tags.
When doing this, make sure to rate-limit your fetching. You don't want to overload the forum server by requesting many pages per second. Simplest way to do this is probably just to sleep for X seconds between each fetch.
Yes doable in a day
Since you already have a working CI setup I would use it.
I would use the following approach:
1) Make a model in CI capable of:
logging in to vbulletin (images are often added as attachments and you need to be logged in before you can download them). Use something like snoopy.
collecting the url for the "last button" using preg_match(), parsing the url with parse_url() / and parse_str() and generating links from page 1 to page last
collecting html from all generated links. Still using snoopy.
finding all images in html using preg_match_all()
downloading all images. Still using snoopy.
moving the downloaded image from a tmp directory into another directory renaming it imagename_01, imagename_02, etc. if the same imagename allready exists.
saving the image name and precise bytesize in a db table. Then you can avoid downloading the same image more than once.
2) Make a method in a controller that collects all images
3) Setup a cronjob that collect images at regular intervals. wget -o /tmp/useless.html http://localhost/imageminer/collect should do nicely
4) Write the code that outputs pretty html for the enduser using the db table to get the images.