Extract data from website via PHP - php

I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?

$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";

It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}

What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.

1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!

You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).

The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>

Related

Using PHP to extract specific data from websites

I am new in PHP and I was looking to extract data like inventory quantity and sizes from different websites. Was kind of confused on how I would go about doing this. Would Domdocument be the way to go?
Not sure if that was the best method for this.
I was attempting from lines 164-174 on here.
Any help is greatly appreciated!
EDIT - this is my updated code. Dont really think its the most efficient way to do things though.
<html>
<?php
$url = 'https://kithnyc.com/collections/adidas/products/kith-x-adidas- consortium-response-trail-boost?variant=35276776455';
$html = file_get_contents($url);
//preg_match('~itemprop="image"\scontent="(\w+.\w+.\w+.\w+.\w+.\w+)~', $html, $image);
//$image = $image[1];
preg_match('~,"title":"(\w+.\w+.\w+.\w+.\w+.\w+)~', $html, $title);
$title = $title[1];
preg_match_all('~{"id":(\d+)~', $html, $id);
$id = $id[1];
preg_match_all('~","public_title":"(\d+..)~', $html, $size);
$size = $size[1];
preg_match_all('~inventory_quantity":(\d+)~', $html, $quantity);
$quantity = $quantity[1];
function plain_url_to_link($url) {
return preg_replace(
'%(https?|ftp)://([-A-Z0-9./_*?&;=#]+)%i',
'<a target="blank" rel="nofollow" href="$0" target="_blank">$0</a>', $url);
}
$i = 0;
$j = 2;
echo "$title<br />";
echo "<br />";
//echo $image;
echo plain_url_to_link($url);
echo "<br />";
echo "<br />";
for($i = 0; $i < 18; $i++) {
print "Size: $size[$i] --- Quantity: $quantity[$i] --- ID: $id[$j]";
$j++;
echo "<br />";
}
echo "<br />";
//print_r($quantity);
?>
</body>
</html>
As a general rule of thumb, you must avoid parsing HTML/XML content with regular expressions. Here's why:
Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
— https://stackoverflow.com/a/590789/65732
Use a DOM parser instead which is specifically designed for the purpose of parsing HTML/XML documents. Here's an example:
# Installing Symfony's dom parser using Composer
composer require symfony/dom-crawler symfony/css-selector
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
$crawler = new Crawler($html);
$price = $crawler->filter('.product-header-title[itemprop="price"]')->text();
// UPDATE: Does not work! as the page updates the button text
// later with javascript. Read more for another solution.
$in_stock = $crawler->filter('#AddToCartText')->text();
if ($in_stock == 'Sold Out') {
$in_stock = 0; // or `false`, if you will
}
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: Buy Now
// We'll fix "Availability" later...
Using such parsers, you have the ability to extract elements using XPath as well.
But if you want to parse the javascript code included in that page, you'd better use a browser emulator like Selenium. Then you have programmatic access to all the globally available javascript vars/functions in that page.
Update
Getting the price
So you were getting this error running the above code:
PHP Fatal error:
Uncaught Symfony\Component\CssSelector\Exception\SyntaxErrorException: Expected identifier, but found.
That's because the target page uses an invalid class name for the price element (.-price) and this Symfony's CSS selector component cannot parse it correctly, hence the exception. Here's the element:
<span id="ProductPrice" class="product-header-title -price" itemprop="price" content="220">$220.00</span>
To workaround it, let's use the itemprop attribute instead. Here's the selector that can match it:
.product-header-title[itemprop="price"]
I updated the above code accordingly to reflect it. I tested it and it's working for the price part.
Getting the stock status
Now that I actually tested the code, I see that the stock status of products is set later using javascript. It's not there when you fetch the page using file_get_contents(). You can see it for yourself, refresh the page, the button appears as Buy Now, then a second later it changes to Sold Out.
But fortunately, the quantity of the product variant is buried deep somewhere in the page. Here's a pretty printed copy of the huge object Shopify uses to render the product pages.
So now the problem is parsing javascript code with PHP. There are a few general approaches to tackle the problem:
Feel free to skip these approaches as they are not specific to your problem. Jump straight to number 6, if you just want a solution to your question.
The most reliable and common approach is to scrape data from such sites (that heavily rely on javascript) is to use a browser emulator like Selenium which are able to execute javascript code. Have a look at Facebook's PHP WebDriver package which is the most sophisticated PHP binding for Selenium WebDriver available. It provides you with an API to remotely control web browsers and execute javascript against them.
Also, see Behat's Mink that comes with various drivers for both headless browsers as well as full-fledged browser controllers. The drivers include Goutte, BrowserKit, Selenium1/2, Zombie.js, Sahi and WUnit.
See V8js, the PHP extension; which embeds V8 javascript engine into PHP. It allows you to evaluate javascript code right from your PHP script. But it's a little bit overkill to install a PHP extension if you're not heavily using the feature. But if you want to extract the relevant script using the DOM parser:
$script = $crawler->filterXPath('//head/following-sibling::script[2]')->text();
Use HtmlUnit to parse the page and then feed the final HTML to PHP. You gonna need a small Java wrapper. Right, overkill for your case.
Extract the javascript code and parse it using a JS parser/tokenizer library like hiltonjanfield/js4php5 or squizlabs/PHP_CodeSniffer which has a JS tokenizer.
In case that the application is making ajax calls to manipulate the DOM. You might be able to re-dispatch those requests and parse the response for your own application's sake. An example is the ajax call the page is making to cart.js to retrieve the data related to the cart items. But it's not the case for reading the product variant quantity here.
You may recall that I told you that it's a bad idea to utilize regular expressions to parse entire HTML/XML documents. But it's OK to use them partially to extract strings from an HTML/XML document when other approaches are even harder. Read the SO answer I quoted at the top of this post if you have any confusions about when to use it.
This approach is about matching the inventory_quantity of the product variant by running a simple regex against the whole page source (or you can only execute it against the script tag regarding a better performance):
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
$crawler = new Crawler($html);
$price = trim($crawler->filter('.product-header-title[itemprop="price"]')->text());
preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock);
$in_stock = $in_stock[1];
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: 0
This regex needs a variant ID (35276776455 in this case) to work, as the quantity of each product comes with a variant. You can extract it from the URL's query string: ?variant=35276776455.
Now that we're done with the stock status and we've done it with regex, you might want to do the same with the price and drop the DOM parser dependency:
<?php
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
// You need to check if it's matched before assigning
// $price[1]. Anyway, this is just an example.
preg_match('/itemprop="price".+?>\s*\$(.+?)\s*<\/span>/s', $html, $price);
$price = $price[1];
preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock);
$in_stock = $in_stock[1];
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: 0
Conclusion
Even though that I still believe that it's a bad idea to parse HTML/XML documents with regex, I must admit that available DOM parsers are not able to parse embedded javascript code (and probably will never be), which is your case. We can partially utilize regular expressions to extract strings from HTML/XML; the parts which are not parsable using DOM parsers. So, all in all:
Use DOM parsers to parse/scrape the HTML code that initially exists in the page.
Intercept ajax calls that may include information you want. Re-call them in a separate http request to get the data.
Use browser emulators for parsing/scraping JS-heavy sites that populate their pages using ajax calls and such.
Partially use regex to extract what is not extractable using DOM parsers.
If you just want these two fields, you're fine to go with regex. Otherwise, consider other approaches.

PHP simplehtmldom read only viewable text

I have the following html format
<p>This is viewable <span style="display:none">This is not viewable</span></p>
I want to use php simplehtmldom to extract only the "This is viewable" part.
Is there anyway to do it directly?
Sure you can, just remove that text:
$str = '<p>This is viewable <span style="display:none">This is not viewable</span></p>';
$html = str_get_html($str);
foreach($html->find('[style*=display:none]') as $el){
$el->innertext = '';
}
echo $html->find('p', 0)->text();
// This is viewable
No, SimpleHTMLDOM is merely a DOM parser, it does not process the attributes in any meaningful way, let alone process inline styles. To properly do what you intend to achieve, it would also need to be able to process extended inline styles, like style="anyother:'attribute';display:none" and alternative ways of hiding content, like visibility:hidden and opacity:0, or brilliant stuff like -webkit-transform:rotateY(90deg).
In a nutshell, there is no remotely easy way to achieve the intended result.

Simple PHP script for downloading results

I need to download results from a website using a for loop to compile them.
(Note that it's an ASP request which displays a webpage with these parameters)
I wrote the following code to get me this:
<?php
for ($i=10; $i<500; $i++) {
$m = $i*10;
$dl = $query;
$text = file_get_contents($dl);
$doc = new DOMDocument('1.0');
$doc->loadHTML($text);
$aObj = $doc->find('Academic');
if (count($aObj) > 0)
{
echo "<h4>Found</h4>";
//Don't download this
}
else
{
echo "<h4>Not found</h4>";
//Download this
}
}
?>
But it returns several errors. Apparently it can't copy the ASPX file to the HTML DOM. How do I go about doing this? Also, how can I download/save the pages where the string 'Download' is not found?
I also think my method of finding 'Download' in the document is not working. What is the correct way to do this?
The website you're attempting to parse contains a lot of errors, therefore you wont be able to use the standard DOMDocument object. You can attempt to use a library such as SimpleHTMLDOM (http://simplehtmldom.sourceforge.net/) or phpQuery (https://code.google.com/p/phpquery/) and hope that those are good enough to parse the malformed document.
In case you just need some information it might be easier to use regular expressions and preg_match_all (http://www.php.net/manual/en/function.preg-match-all.php) to find every occurrence of 'Academic' for example.
Note, usually it is not very advisable to use regular expression when working with structured documents such as HTML since you wont be able to take advantage of the structure, but since those documents seem to contain over 300 errors and differ from each other it might be the only way.

Blog display code, keeping other content in post

Alright, I have some code that will find a <code></code> tag set and clean up any code inside of it so it displays instead of functioning like regular code. Everything works, but my problem is how can I find the tag set/multiple tag sets inside, say, $content. Clean the code, and still have ALL of the other content in it? Here is my code, the problem is it checks for matches, and when it finds one it cleans it. But after it cleans it it has no way to put it back into it's original position $content. ($content is being grabbed from a form)
<?php
preg_match_all("'<code>(.*?)</code>'si", $html, $match);
if ($match) {
foreach ($match[1] as $snippet) {
$fixedCode = htmlspecialchars($snippet, ENT_QUOTES);
}
}
?>
What do I do with $fixedCode, now that it is clean?
Using regex for parsing HTML is bad. I'd suggest getting familiar with a DOM parser, such as PHP's DOM module.
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5.
Using the DOM module, in order to get the HTML/data from <code> tags in the document, you'd want to do something like this:
<?php
//So many variables!
$html = "<div> Testing some <code>code</code></div><div>Nother div, nother <code>Code</code> tag</div>";
$dom_doc = new DOMDocument;
$dom_doc->loadHTML($html);
$code = $dom_doc->getElementsByTagName('code');
foreach ($code as $scrap) {
echo htmlspecialchars($scrap->nodeValue, ENT_QUOTES), "<br />";
}
?>

using preg_match_all to get name of image

After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance
Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need
You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.
function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif
Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/
Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.

Categories