The question is how to get ajax calls source code? this is not crawled, for example how to crawl the pictures on a link like this? http://www.tiendeo.nl/Catalogi/amsterdam/16558&subori=web_sliders&buscar=Boni&sw=1366
If you do inspect element, then it will show you the right code in the middle where the pictures are. But how to crawl this? If you click the next page, then it will have other images in the source. How to get the source for all images?
If I understand your question correctly (How do I crawl information loaded into the page by ajax calls?), the answer is that you either need some sort of javascript-aware crawler, or you need to inspect the javascript to figure out what resources are being polled to load the content you're interested in. From PHP you should be able to send curl get requests to these URLs and receive the same responses the site's javascript is using to render the entries.
The latter option has some rewards--namely that you will most likely be able to get simple, easy-to-use JSON responses to your requests.
As with most web scraping efforts, it tends to be the case that some content providers won't appreciate your interest in their data (especially if you're collecting it in ways that put undue strain on their systems or resources.) Keep in mind that they make take steps (technical or legal) to stop you if they notice/mind.
Addendum:
If you're hoping to crawl a variety of similar sites without needing to look through the source to find the resources they're using, (let's say for the sake of argument that you're just trying to naively scrape all images over a certain size from several sites selling the same sort of items), you'd need the former option--some sort of javascript-aware scraper. I don't know if such a thing exists, but it wouldn't surprise me.
Related
I am looking for a way to create functionality, similar to when you post a link to the existed web-site in facebook. If this statement is rather ambiguous, I will try to elaborate.
When you paste your link and submit your post, facebook together with you link gives a small preview of the page, you are posting (text and may be a small image)
What are the ways to achieve this?
I read the similar post, but the thing is that I do not need an image so much, text will be sufficient.
Working in PHP, but language is not important, because I am looking for a high level idea.
Previously I was thinking about parsing content of the link with cURL but the thing is that in a lot of situations the text returned by facebook is not available on the page.
Is there other ways?
From what I can tell, Facebook pulls from the meta name="description" tag's content attribute on the linked page.
If no meta description tag is available, it seems to pull from the beginning of the first paragraph <p> tag it can find on the page.
Images are pulled from available <img> tags on the page, with a carousel selection available to pick from when posting.
Finally, the link subtext is also user-editable (start a status update, include a link, and then click in the link subtext area that appears).
Personally I would go with such a route: cURL the page, parse it for a meta tag description and if not grab some likely data using a basic algorithm or just the first paragraph tag, and then allow user editing of whatever was presented (it's friendlier to the user and also solves issues with different returns on user-agent). Do the user facing control as ajax so that you don't have issues with however long it takes your site to access the link you want to preview.
I'd recommend using a DOM library (you could even use DOMDocument if you're comfortable with it and know how to handle possibly malformed html pages) instead of regex to parse the page for the <meta>, <p>, and potentially also <img> tags. Building a regex which will properly handle all of the myriad potential different cases you will encounter "in the wild" versus from a known set of sites can get very rough. QueryPath usually comes recommended, and there are stackoverflow threads covering many of the available options.
Most modern sites, especially larger ones, are good about populating the meta description tag, especially for dynamically generated pages.
You can scrape the page for <img> tags as well, but you'll want to then host the images locally: You can either host all of the images, and then delete all except the one chosen, or you can host thumbnails (assuming you have an image processing library installed and turned on). Which you choose depends on whether bandwidth and storage are more important, or the one-time processing of running an imagecopyresampled, imagecopyresized, Gmagick::thumbnailimage, etc, etc. (pick whatever you have at hand/your favorite). You don't want to hot link to the images on the page due to both the morality of it in terms of bandwidth and especially the likelihood of ending up with broken images when linking any site with hotlink prevention (referrer/etc methods), or from expiration/etc. Personally I would probably go for storing thumbnails.
You can wrap the entire link entity up as an object for handling expiration/etc if you want to eventually delete the image/thumbnail files on your own server. I'll leave particular implementation up to you since you asked for a high level idea.
but the thing is that in a lot of situations the text returned by facebook is not available on the page.
Have you looked at the page's meta tags? I've tested with a few pages so far and this is generally where content not otherwise visible on the rendered linked pages is coming from, and seems to be the first choice for Facebook's algorithm.
Full disclosure upfront, I'm a developer at ThumbnailApp.com.
It's an JSON API service with an optional Javascript SDK which I think does exactly what you're after: It will parse a string to detect any urls and return the title, description and thumbnail of the asset. If the page has OpenGraph tags, it will use those for the image thumbnail. It's currently in private beta but we're adding more accounts each week.
If you feel that you really need a do-it-yourself solution:
Checkout the python based Webkit2Png and the headless browser PhantomJs. They can render webpages to an image (default size is 800x600), then you'll have to write some code to resize and crop the image like taswyn mentioned. Ideally you would then upload the resized image to Amazon S3 and then get it hosted on a CDN such as CloudFront.
To get the title and description, first get the URL content (cURL or whatever) and you will need to check the content-type header to make sure it's a webpage. If it is, you can then use a HTML parser such as the SimpleHTMLDOM PHP library to grab the title and description meta data. If you want it exactly like Facebook you will also need to check for any OpenGraph tags specifically the og:image tag.
Also don't forget about caching. The first render and description parsing can take a long time. Even if your site is fast, the webpage you're rendering could be slow and the best approach is to render / parse it once, then just save and return the resized image and meta data for subsequent requests. Depending on what your requirements are you may need to refresh the cached data every hour or you could get away with refreshing it once a day.
To do it yourself takes quite a bit of work and lots of server configuration. I feel using a 3rd party service is a better way to go, but obviously I have a biased opinion :)
This is a very broad question, so I'm just looking for the best way of doing this.
I want to periodically monitor certain pages on my website.
I am looking to write a PHP script which will load the page as if it is being loaded in a browser. So that means, it loads all CSS, Javascript, Images, Videos, etc...
I want to just get the load time of these pages and then email the results to myself in a crontab. For this I was going to use microtime() and a phpMailer.
Does anyone know of a script to load a complete page, or have any suggestions on how to go about this?
Thanks.
What if the page has dynamic content? You will also need to execute all the JavaScript and fetch all CSS images to get the final amount of time. I believe that is impossible using only PHP.
A php script you run from the same server you host your site will give you abnormal readings (very low) since it's loading on the first hop essentially. What you really would want to do is run a script from various servers outside of your own. There are also limitations with what php can see ie JS and JQuery etc.
The simplest is to check from your home pc, using jmeter. You set your home browser to use it as a proxy and go to whichever website you want. Jmeter will record statistics. When you are happy you can choose to save the stats.
This avoids the problems of handling JS and JQuery through a script.
This could get very complicated. You'd basically have to parse the HTML, and then there's tons of edge cases, like JS including resources, etc... I would definitely recommend using something like the network tab of Chrome's dev tools instead.
You see them everywhere. Like the twitter and facebook buttons that show up on blogs and websites that display a number of "tweets" or "likes".
All I need to be able to do is display a number from my MySQL database based on two variables (username and an ID). It would probably be useful to encrypt the variables somehow so that users can't just alter the badge's code and display another user's number.
But more importantly, I just need to know how to use the HTML code like you find in social network badges and have it talk to a PHP script on my server which will calculate the number from the database based on the variables held within the badge.
Any clue where to start?
Edit: I'm not talking about the kind of badges like you find on stackoverflow, I mean the kind other sites let you paste on your blog/site. Like Digg lets you show that your site has been dugg 7000 times, etc.
You may wish to look up the GD library for PHP and related tutorials. Basically, all those badges consist of is a static image as the template with some dynamic text inserted on top, usually consisting of the username and a number (likes, tweets, etc..).
For the HTML code, you would do something similar to:
<img src="http://www.yourserver.com/yourscript.php?username=miki&id=1337" />
This will send a HTTP GET request to your script, causing it to execute. Your script can then communicate with the database, fetch the user's information, use GD to insert that text into a template and then return that to the browser with the correct mime type and content.
You're talking about calling a remote script, essentially.
I assume you mean something like this -
You are viewing your profile on your side. You have a widget that promises to display the total number of points you have, for example.
You offer a "code" button like youtube embed or facebook "like"
The user clicks this, gets a segment of code and is expected to be able to paste it anywhere on the internet where applicable and the code will generate an icon or something with presumably the username and their points.
First, you can do this several ways. The most cost effective, in my opinion, is to generate the user button on your server on update - like say your points meant "number of thumbs up your articles received" so it will be an integer value. Every time you get a thumbs up, you would re-cache the button and write it into a flat text file. If you're good, you would write it to an image and flatten it to a jpg or gif. If you don't know how to do that, you can write it to html and save the file as a user specific "slug" like md5(username).'.html' - that way every time the server is called, you don't need to pile on bandwidth with redundant queries and account look ups. You only serve the optimized image or html file.
Second - you can give the user an iframe that has the html in it. This is generally how facebook "like" does it for people that don't use the fbml method. Problem is, many sites see iframes as potential xss attack and will strip them out. So, in order to make use of the iframe you would need to have control of the domain, which may defeat the purpose of your request if the intention is to share your profile goodies.
Third, you can call a js file on your server that makes an ajax call to your database and serves the results. This is also most likely going to be seen as xss attack and you should probably not even give it much more thought.
I mentioned the iframe and js methods in case you were looking to provide an option for other people who run their own sites. The way "like" is used by site owners to show how many times their domain has been "liked" and so on. These people have control of domains so the iframe and js methods are logical.
So -
This answer may not have much in way of raw code for you, but it should help you start.
I would do the image method since it is safer. You would give the user an image tag with their slug in the src attribute. They can paste it anywhere and there is no way to re-write the number within the image. Most forums and places where you can just post to other people's sites allow images. Just do a google search on drawing images with php, as well as using the imagemagick library to merge text and images.
I am currently developing a web site with PHP + MySQL and jQuery. So far I have been doing it in my local machine. I notice that when I see the page the images on it take some time to load (few time but its visible). All images are small (PNG's with less than 3 KB). Now, when I load the page, there are some database connections happening in order to get some data that I will display.
I am not sure if this loading time issue has something to do with the amount of images, or with the time that the PHP script + the DB connections take to execute. (I have very little data in my database so I wouldn't assume this case.)
My question is: Is it a good approach to preload all the images in the beginning of each page? I tried it with jQuery and it works fine. I'm just not sure which disadvantages I can get with it. For example, to do so, I need to include the jQuery library in the beginning of the page? I thought it was a bad practice.
If these PNGs are stored in the database as BLOBs — not clear from your question — don't do that. Serving images from a DB through PHP is not as efficient as letting the web server serve them straight from the filesystem. If the images are tied to particular records, just name the PNG after the row ID, so you can find it in a directory dedicated to storing those images. The PHP code then just generates the URL that points to the PNG file on disk, so the web server can send them statically.
I don't think preloading the images within the same page is going to buy you anything. If anything, it might slow the apparent overall page load time because the browser can only retrieve a fixed number of resources concurrently, typically 2-4. Loading images at the top of the <body> means there are other things at the top of the page "above the fold" that have to wait for some HTTP connection slots to free up. Better to let the images load in their natural order.
Preloading makes sense in two situations:
The image isn't shown by default, but is expected to be needed as the user interacts with the page. Good examples of this are the hover and click state images for rollovers.
The image isn't used on this page, but will be needed on the next. Good examples of this are any site where there is a clear progression from one page to the next, like in a shopping cart.
Either way, do the preload at the very bottom of the <body>, so everything else loads first.
Having addressed those two issues, run YSlow on your site. It started out as a plugin for Firebug, which in turn is a plugin for Firefox, but it's since been ported to all major browsers except IE.
The beauty of YSlow is that it detects common slowdowns automatically, just by loading the page while the extension is active. It then gives you a clear grade for the page, so you can judge when you're "done" optimizing. If you're below an A, you're not done yet. :) It's not uncommon to see sites rating a D or worse, because the default configuration for web servers is conservative to avoid causing problems. Fixing YSlow warnings is generally pretty easy, but you have to be careful to avoid creating caching and other problems, which is why the default server config doesn't do these things.
Another answer recommended the Google PageSpeed offering. It's available as a plugin for Chrome and Firefox, as a server-side Apache module, and as a Google-hosted service. I have no idea how it compares to YSlow, but it looks interesting.
Also consider using the browser's debugger to get a waterfall graph of resource load times:
In Firebug you get this in the Net tab.
In Safari, you get to it via the Develop menu, which is normally disabled in Preferences. Turn it on if needed, then say Develop > Start Timeline Recording. That puts you into the Network Requests instrument. You can also get to it through Develop > Show Web Inspector.
In Chrome, say View > Developer > Developer Tools, then go to the Network tab.
IE has a very weak form of this, via Tools > Developer Tools > Profiler. It just gives a table of numbers, rather than a waterfall graph, so the information is there, but you can't just visually scan for long bars to find the slowest resources.
You should use page speed plugin from google to check what data takes most of the time to load. It will show separate load times for images as well.
If you're using lots of small pngs I suggest you combining them into one image and manipulating the display via css background property since they are part of styling and not information. In that case - instead of few images only one will be loaded and reused through all elements. In this case even preloading of one image is really easy.
Have you considered using CSS Sprites to combine all of your images into a single download? There are a number of tools online to help you do this, and it will significantly reduce the number of HTTP requests required by your page.
Make sure you have set the correct Expires header on your images to allow them to be cached.
Finally, take a look at YSlow which can provide you with futher optimisation tips.
I need to write a text file viewer (not the directory tree, but the actual file contents) for use in a browser. It will be used to view large files. I want to give the user the ability to actually ummm, browse the file, ie prev page & next page buttons, while each page will show only a portion of the file.
Two question:
Is there anyway to pass the file descriptor through POST (or something) so that on each page I can keep reading from an already open file, and not starting all over again (again - huge files)
Is there a way to read the file backwards? Will be very useful for browsing back in a file.
Any other implementation ideas are very welcome. Thanks
Keeping the file open between requests is not a good idea - you don't have to "start all over again" - just maintain an offset and use fseek() to jump to that offset. That way, you can also implement the "backwards jumping".
Cut your huge files into smaller files once, and then serve the small files to the user.
You should consider pagination. If you're concerned about the user being frustrated by needing to click "next" too often, you could make each chunk reasonably large (so a normal reader pages every 20min).
Another option is the Chunked-Endoding transfer type: Wikipedia Entry. This would allow your server to respond quickly and give the user something to read while it streams the rest of the file over the network (rather than the server needing to read in the file and send it all at once). This could dramatically improve the perceived performance compared to serving the files normally, but still consumes a lot of bandwidth for your server.
You might be able to simulate a large document with Javascript and AJAX, but only send pieces at a time for better performance.
Consider sending a few pages worth of your document and attaching listeners to the scroll event of your browser. Over time or as the user scrolls down you AJAX more chunks. This creates a few annoying UX edge cases, like:
Scroll bar indicates a much smaller document than there actually is
You might be able to avoid this by filling in the bottom of your document with many page breaks, but it'll be difficult to make the length perfect.
Scrolling past the point of currently-available content will show a blank page.
You could detect this using JavaScript and display a "loading" icon to let the user know what's going on.
Built-in "find" feature doesn't work
Hard to avoid this without the user downloading the entire document, but you could provide your own search feature for them to use instead (not as good but perhaps adequate).
Really though, you're probably best off with pagination with medium-sized pages. It's a very well understood design pattern that's a relatively easy (compared to other options at least) to implement and make fast.
Hope that helps!