Fetching images faster from a web page - php

I'm looking for a plugin or a simple code that fetches images from a link FASTER. I have been using http://simplehtmldom.sourceforge.net/ to extract first 3 images from a given link.
simplehtmldom is quite slow and many users on my site are reporting it as an issue.
Correct me if I'm wrong , I believe this plugin is taking lot of time to fetch complete html code from the url I pass and then it searches for img tags.
Someone please suggest me a technique to improvise the speed of fetching html code or an alternate plugin that i can try ?
What I'm thinking is something like fetching html code until it finds first three img tags and then kill the code fetching process ? So that things get faster.
I'm not sure if it's possible with php although, I'm trying hard to design that using jquery.
Thanks for your help !

Cross-site scripting rules will prevent you from doing something like this in jQuery/JS (unless you control all the domains that you'll be grabbing content from). What you're doing is not going to be super fast in any case, but try writing your own using file_get_content() paired with DOMDocument... the DOMDocument getElementsByTagName method may be faster than simplehtmldom's find() method.
You could also try a regex approach. It won't be as fool-proof as a true DOM parser, but it will probably be faster... Something like:
$html = file_get_contents($url);
preg_match_all("/<img[^']*?src=\"([^']*?)\"[^']*?>/", $html, $arr, PREG_PATTERN_ORDER);
If you want to avoid reading whole large files, you can also skip the file_get_contents() call and sub in a fopen(); while(feof()) loop and just check for images after each line is read from the remote server. If you take this approach, however, make sure you're regexing the WHOLE buffered string, not just the most recent line, as you could easily have the code for an image broke across several lines.
Keep in mind that real-life variability in HTML will make regex an imperfect solution at best, but if speed is a major concern it might be your best option.

Related

Taking information from one site and using it on my analyser/script made with PHP

I need to extract the "Toner Cartridges" levels from This site and "send it" to the one im working on. Im guessing i can use GET or something similar but im new to this so i dont know how it could be done.
Then the information needs to be run through a if/else sequence which checks for 4 possible states. 100% -> 50%, 50%->25%, 25%->5%, 5%->0%.
I have the if/else written down but i cant seem to find any good command for extracting the infromation from the index.php file.
EDIT: just need someone to poin me in the right the direction
To read the page you can use file_get_contents
$page = file_get_contents("http://example.com");
But in order to make the function work with URLs, allow_url_fopen must be set to true in the config file of your php (php.ini)
Then you can use a regular expression to filter the text and get data.
The php function to perform a regular expression is preg_match
Example:
preg_match('/[^.]+\.[^.]+$/', $host, $matches);
echo "domain name is: {$matches[0]}\n";
will output
domain name is: php.net
I imagine you are reading from a Printer Status Page. In which case, to give your self the flexibility to use sessions and login, I would look into Curl. Nice thing about Curl is, you can use the PHP library for code, but you can also test at the command-line rather quickly.
After you are retrieving the HTML contents, I would look into using an XML parser, like SimpleXML or DOMDocument. Either one will get you to the information you need. SimpleXML is a little easier to use for people new to traversing XML (this is, at the same time, like and very not like jQuery).
Although, that said, you could also hack to the data just as quick (if you are just now jumping in) with Regulair Expressions (it is seriously like that once you get the hang of it).
Best of luck!

is it possible to make a proxy with file_get_contents() or cURL?

I've just been messing around with file_get_contents() at school and have noticed, it allows me to open websites in school that are blacklisted.
Only a few issues:
No images load
Clicking a link on the website just takes me back to the original blocked page.
I think i know a way of fixing the linking issue, but haven't really thought it through..
I could do a str_replace on the content from file_get_contents to replace any link, with another file_gets_contents() function, on that link...right?
Would it make things easier if i used cURL instead?
Is what I'm trying to do, even possible, or am i just wasting my valuable time?
I know this isn't a good way to go about something like this, but, it is just a thought, thats made me curious.
This is not a trivial task. It is possible, but you would need to parse the returned document(s) and replace everything that refers to external content so that they are also relayed through your proxy, and that is the hard part.
Keep in mind that you would need to be able to deal with (for a start, this is not a complete list):
Relative and absolute paths that may or may not fetch external content
Anchors, forms, images and any number of other HTML elements that can refer to external content, and may or may not explicitly specify the content they refer to.
CSS and JS code that refers to external content, including JS that modifies the DOM to create elements with click events that act as links, to name but one challenge.
This is a fairly mammoth task. Personally I would suggest that you don't bother - you probably are wasting your valuable time.
Especially since some nice people have already done the bulk of the work for you:
http://sourceforge.net/projects/php-proxy/
http://sourceforge.net/projects/knproxy/
;-)
Your "problem" comes from the fact that HTTP is a stateless protocol and different resources like css, js, images, etc have their own URL, so you need a request for each. If you want to do it yourself, and not use php-proxy or similar, it's "quite trivial": you have to clean up the html and normalize it with tidy to xml (xhtml), then process it with DOMDocument and XPath.
You could learn a lot of things from this - it's not overly complicated, but it involves a few interesting "technologies".
What you'll end up with what is called a crawler or screen scraper.

Parsing HTML with PHP to get data for several articles of the same kind

I'm working on a web-site which parses coupon sites and lists those coupons. There are some sites which provide their listings as an XML file - no problem with those. But there are also some sites which do not provide XML. I'm thinking of parsing their sites and get the coupon information from the site content - grabbing that data from HTML with PHP. As an example, you can see the following site:
http://www.biglion.ru/moscow/
I'm working with PHP. So, my question is - is there a relatively easy way to parse HTML and get the data for each coupon listed on that site just like I get while parsing XML?
Thanks for the help.
You can always use a DOM parser, but scraping content from sites is unreliable at best.
If their layout changes every so slightly, your app could fail. Oh, and in most cases it's also against most sites TOSs to do so..
While using a DOM parser might seem a good idea, I usually prefer good old regular expressions for scraping. It's much less work, and if the site changes it's layout you're screwed anyway, whatever your approach is. But, if using a smart enough regex, your code should be immune to changes that do not directly impact the part you're interested in.
One thing to remember is to include some class names in regex when they're provided, but to assume anything can be between the info you need. E.g.
preg_match_all('#class="actionsItemHeadding".*?<a[^>]*href="([^"]*)"[^>]*>(.*?)</a>#s', file_get_contents('http://www.biglion.ru/moscow/'), $matches, PREG_SET_ORDER);
print_r($matches);
The most reliable method is the Php DOM Parser if you prefer working with php.
Here is an example of parsing only the elements.
// Include the library
include('simple_html_dom.php');
// Retrieve the DOM from a given URL
$html = file_get_html('http://mypage.com/');
// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e)
echo $e->href . '<br>';
I am providing some more information about parsing the other html elements too.
I hope that will be useful to you.

scraping a page

What would be best practice in scraping a horrible mess of a distributor's inventory page (using js to document.write a <td>, then using plaintext html to close it)? No divs/tds/anything is labelled with any id or classes, etc.
Should I just straight up preg_match(?_all) the thing or is there some xpath magic I can do?
There is no api, no feeds, no xml, nothing clean at all.
edit:
-
What i'm basically thinking of atm is something like http://pastebin.com/raw.php?i=EuMfRVD5 - is that my best bet or is there any other way?
Your example is not enough of an example. But since you seemingly don't need the highlighting meta info anyway, the JS-obfuscation could be undone with a bit of:
$html = preg_replace('# <script .*? (?: document.write\("(.*?)"\) )? .*? </script> #six', "$1", $html);
Maybe that's already good enough to pipe it through one of the DOM libraries afterwards.
In general you should always use http://www.php.net/DOM to parse a page. Regex is horrible and usually downright impossible to use for parsing html, because that's not what it was built for.
However...if the page uses a lot of javascript to output stuff, you are kind of SoL regardless. The best you can really do to get a complete picture is to grab it and run it through a browser and parse what is rendered. It is possible to automate it, though it's kind of a pita to setup.
But...given the issue w/ js outputting a lot of it...maybe regex really would be best route. But I guess first and foremost it kind of depends on what the actual content is and what it is you are trying to get from the page.

How to know if the website being scraped has changed?

I'm using PHP to scrape a website and collect some data. It's all done without using regex. I'm using php's explode() method to find particular HTML tags instead.
It is possible that if the structure of the website changes (CSS, HTML), then wrong data may be collected by the scraper. So the question is - how do I know if the HTML structure has changed? How to identify this before storing any data to my database to avoid wrong data being stored.
I think you don't have any clean solutions if you are scraping a page where content changes.
I have developed several python scrapers and I know how can be frustrating when site just makes a subtle change on its layout.
You could try a solution a la mechanize (don't know the php counterpart) and if you are lucky you could isolate the content you need to extract (links?).
Another possibile approach would be to code some constraints and check them before store to db.
For example, if you are scraping Urls, you will need to verify that what scraper has parsed is formally a valid Url; same for integer ID or whatever you want to scrape that can be recognized as valid.
If you are scraping plain text, it will be more difficult to check.
Depends on the site but you could count the number of page elements in the scraped page like div, class & style tags then by comparing these totals against those of later scrapes detect if the page structure has been changed.
A similiar process could be used for the CSS file where the names of each each class or id could be extracted using simple regex, stored and checked as needed. If this list has new additions then the page structure has almost certainly changed somewhere on the site being scraped.
Speaking out of my ass here, but its possible you might want to look at some Document Object Model PHP methods.
http://php.net/manual/en/book.dom.php
If my very, very limited understanding of DOM is correct, a change in HTML site structure would change the Document Object Model, but a simple content change within a fixed structure wouldn't. So, if you could capture the DOM state, and then compare it at each scrape, couldn't you in theory determine that such a change has been made?
(By the way, the way I did this when I was trying to get an email notification when the bar exam results were posted on a particular page was just compare file_get_contents() values. Surprisingly, worked flawlessly: No false positives, and emailed me as soon as the site posted the content.)
If you want to know changes with respect to structure, I think the best way is to store the DOM structure of your first page and then compare it with new one.
There are lot of way you can do it:-
SaxParser
DOmParser etc
I have a small blog which will give some pointers to what I mean
http://let-them-c.blogspot.com/2009/04/xml-as-objects-in-oops.html
or you can use http://en.wikipedia.org/wiki/Simple_API_for_XML or DOm Utility parser.
First, in some cases you may want to compare hashes of the original to the new html. MD5 and SHA1 are two popular hashes. This may or may not be valid in all circumstances but is something you should be familiar with. This will tell you if something has changed - content, tags, or anything.
To understand if the structure has changed you would need to capture a histogram of the tag occurrences and then compare those. If you care about tags being out of order then you would have to capture a tree of the tags and do a comparison to see if the tags occur in the same order. This is going to be very specific to what you want to achieve.
PHP Simple HTML DOM Parser is a tool which will help you parse the HTML.
Explode() is not an HTML parser, but you want to know about changes in the HTML structure. That's going to be tricky. Try using an HTML parser. Nothing else will be able to do this properly.

Categories