Simple html dom parser table - php

Im using Simple HTML Dom to parse the data into my own php script, I need to get the text inside the td, only one td of more in the table. Website from where I try to parse the table->td. Specifically, I need the first USD td.
The result must be
$ 0.0137
Source php:
<?php
include('../simple_html_dom.php');
$html = file_get_html('https://rub.currencyrate.today/');
foreach($html->find('table') as $e){
foreach($e->find('td',0) as $f){
echo strip_tags($f->innertext) . '<br>';
}
}
?>
This code displays result
₽ 1 $ 0.0137 € 0.0115 £ 0.00988 ¥ 0.0884 Ƀ 0.00000040
I've tried several ways to get that but i've fail in each and everyone of them. Can someone give me a hand?

You're looking for the second <td> in the first <table>.
Therefore there is no need to iterate (foreach) over all tables, and iterating over the first <td> is even wrong (if you check the error log, it will show you that already).
Lets do first table, second table-data, the numbers in find() are zero-based:
$dollar = $html->find('table', 0)->find('td', 1)->innertext();
For your output take care to properly encode as HTML, strip_tags is not of much use there, you want just the HTML characters properly encoded with htmlspecialchars (something strip_tags is not even capable of):
echo htmlspecialchars($dollar, ENT_QUOTES | ENT_HTML5), '<br>';
$ 0.0137
A few further notes:
run with simplehtmldom 2.0-RC2: the version you use might have bugs. I could not fully reproduce your output with that version (but the traversal was wrong anyway)
you should allow yourself the "luxury" to be able to see errors more prominently on your development box.
take care encoding HTML output properly.
the closing ?> php tag is not necessary at the end of file, leave it out before it causes problems.
last but not least if you allow me the remark: simplehtmldom is really old. you may consider at some time to make use of the DOMDocument class which is from the dom PHP extension and use it together with the other xml PHP extensions (simplexml, xmlreader etc.).
Example in full:
<?php declare(strict_types=1);
include __DIR__ . '/../simple_html_dom.php';
$html = file_get_html('https://rub.currencyrate.today/');
$dollar = $html->find('table', 0)->find('td', 1)->innertext();
echo htmlspecialchars($dollar, ENT_QUOTES | ENT_HTML5), '<br>';

Related

Is it possible to write PHP in jade/pug?

Is it possible? If so, how?
If its not, do I have to abandon pug if I need to write PHP in my documents?
After searching around I didnt find anyone that has adressed this.
You can embed PHP in Pug templates the same way you would any literal plain text that you want passed through relatively unmolested[*]. There are a number of options covered in the docs, but I think these are most likely the best options for embedding PHP:
After an element, it will just work. For example, p Good morning, <?php echo $user->name ?>.
On a single line by itself. Since any line beginning with "<" is passed as plain text, any one line PHP statement (e.g., <?php echo $foo; ?>) will just work.
Multi-line PHP is the one case where it gets a bit complicated. If you're ok with wrapping it in an HTML element, you can use Pug's block text syntax: Put a dot after the element, then include your plain text indented underneath.
p.
<?php
if ($multiline) {
echo 'Foo!';
}
?>
If you need it outside an element, the other option is to prefix every line with a pipe:
|<?php
|if ($multiline) {
| echo 'Foo!';
|}
|?>
(Technically, the first line doesn't need to be prefixed due to point 2 above, but if using this method I would prefix it anyway just for consistency.)
To use PHP in attributes, you just need to prevent escaping by prefixing the equals sign with a bang: p(class!="<?php echo $foo ?>"). (Interestingly, support for unescaped attribute values was added specifically for this use case.)
Of course by default .pug files are compiled to .html files, so if you're using them to generate PHP, you'll want to change the extension. One easy way to do this is to process them using gulp with the gulp-pug and gulp-rename plugins, which would look something like this:
var gulp = require('gulp'),
pug = require('gulp-pug'),
rename = require('gulp-rename');
gulp.task('default', function () {
return gulp.src('./*.pug')
.pipe(pug())
.pipe(rename({
extname: '.php'
}))
.pipe(gulp.dest('.'));
});
I haven't worked extensively with Pug, so I don't know if there are any potential gotchas that would come up in real world use cases, but the simple examples above all work as expected.
[*] Pug still performs variable interpolation on plain text, but it uses the #{variable} format, which should not conflict with anything in PHP's standard syntax.
Since PHP doesn't care whether the "outside" code is HTML or really anything specific, you could simply use PHP as you normally would and have it output Pug-formatted code instead of HTML. For instance:
myPugTemplate.pug.php
html
head
title "<?= $this->title ?>"
body
<?php
// Since we're outputing Pug markup, we have to take care of
// preserving indentation.
$indent=str_repeat(' ', 2);
if ($this->foo) {
echo $indent . 'bar= myPost';
} else {
echo $indent . 'baz= myNav';
}
?>
footer
+footerContent
And if your Pug is processed on the server then you'd also include a Pug-processing step, for instance if you use Apache you could use
mod_ext_filter configured in such fashion with pug-cli installed:
ExtFilterDefine pug-to-html mode=output intype=text/pug outtype=text/html \
cmd="pug"
<Location />
SetOutputFilter pug-to-html
</Location>
Have you checked out the pug-php project? I personally have no experience with this particular module, but it seems to do just what you're trying to accomplish: Being able to use PHP in Pug.
You can use the scape syntax with quotes:
!{'<?php #php code ?>'}
For example:
p Hello !{'<?php echo "My name"; ?>'}
Will render:
<p>Hello <?php echo "My name"; ?></p>
You can test it here: https://pug-demo.herokuapp.com/
There is a well-known and well-maintained Pug processor written natively in PHP. You can use it to process your Pug files into HTML, just like the original Pug, with the advantage that it allows you to embed and use PHP code in your Pug file with ease. If you're working with PHP inside Pug, check it out:
Phug - the Pug template engine for PHP
p Hello !{'<?php echo "My name"; ?>'}
works but
link(href="../assets/css/style.css?v=!{'<?=$AppsVersion?>'}" rel="stylesheet" type="text/css")
don't work

Using PHP to extract specific data from websites

I am new in PHP and I was looking to extract data like inventory quantity and sizes from different websites. Was kind of confused on how I would go about doing this. Would Domdocument be the way to go?
Not sure if that was the best method for this.
I was attempting from lines 164-174 on here.
Any help is greatly appreciated!
EDIT - this is my updated code. Dont really think its the most efficient way to do things though.
<html>
<?php
$url = 'https://kithnyc.com/collections/adidas/products/kith-x-adidas- consortium-response-trail-boost?variant=35276776455';
$html = file_get_contents($url);
//preg_match('~itemprop="image"\scontent="(\w+.\w+.\w+.\w+.\w+.\w+)~', $html, $image);
//$image = $image[1];
preg_match('~,"title":"(\w+.\w+.\w+.\w+.\w+.\w+)~', $html, $title);
$title = $title[1];
preg_match_all('~{"id":(\d+)~', $html, $id);
$id = $id[1];
preg_match_all('~","public_title":"(\d+..)~', $html, $size);
$size = $size[1];
preg_match_all('~inventory_quantity":(\d+)~', $html, $quantity);
$quantity = $quantity[1];
function plain_url_to_link($url) {
return preg_replace(
'%(https?|ftp)://([-A-Z0-9./_*?&;=#]+)%i',
'<a target="blank" rel="nofollow" href="$0" target="_blank">$0</a>', $url);
}
$i = 0;
$j = 2;
echo "$title<br />";
echo "<br />";
//echo $image;
echo plain_url_to_link($url);
echo "<br />";
echo "<br />";
for($i = 0; $i < 18; $i++) {
print "Size: $size[$i] --- Quantity: $quantity[$i] --- ID: $id[$j]";
$j++;
echo "<br />";
}
echo "<br />";
//print_r($quantity);
?>
</body>
</html>
As a general rule of thumb, you must avoid parsing HTML/XML content with regular expressions. Here's why:
Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
— https://stackoverflow.com/a/590789/65732
Use a DOM parser instead which is specifically designed for the purpose of parsing HTML/XML documents. Here's an example:
# Installing Symfony's dom parser using Composer
composer require symfony/dom-crawler symfony/css-selector
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
$crawler = new Crawler($html);
$price = $crawler->filter('.product-header-title[itemprop="price"]')->text();
// UPDATE: Does not work! as the page updates the button text
// later with javascript. Read more for another solution.
$in_stock = $crawler->filter('#AddToCartText')->text();
if ($in_stock == 'Sold Out') {
$in_stock = 0; // or `false`, if you will
}
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: Buy Now
// We'll fix "Availability" later...
Using such parsers, you have the ability to extract elements using XPath as well.
But if you want to parse the javascript code included in that page, you'd better use a browser emulator like Selenium. Then you have programmatic access to all the globally available javascript vars/functions in that page.
Update
Getting the price
So you were getting this error running the above code:
PHP Fatal error:
Uncaught Symfony\Component\CssSelector\Exception\SyntaxErrorException: Expected identifier, but found.
That's because the target page uses an invalid class name for the price element (.-price) and this Symfony's CSS selector component cannot parse it correctly, hence the exception. Here's the element:
<span id="ProductPrice" class="product-header-title -price" itemprop="price" content="220">$220.00</span>
To workaround it, let's use the itemprop attribute instead. Here's the selector that can match it:
.product-header-title[itemprop="price"]
I updated the above code accordingly to reflect it. I tested it and it's working for the price part.
Getting the stock status
Now that I actually tested the code, I see that the stock status of products is set later using javascript. It's not there when you fetch the page using file_get_contents(). You can see it for yourself, refresh the page, the button appears as Buy Now, then a second later it changes to Sold Out.
But fortunately, the quantity of the product variant is buried deep somewhere in the page. Here's a pretty printed copy of the huge object Shopify uses to render the product pages.
So now the problem is parsing javascript code with PHP. There are a few general approaches to tackle the problem:
Feel free to skip these approaches as they are not specific to your problem. Jump straight to number 6, if you just want a solution to your question.
The most reliable and common approach is to scrape data from such sites (that heavily rely on javascript) is to use a browser emulator like Selenium which are able to execute javascript code. Have a look at Facebook's PHP WebDriver package which is the most sophisticated PHP binding for Selenium WebDriver available. It provides you with an API to remotely control web browsers and execute javascript against them.
Also, see Behat's Mink that comes with various drivers for both headless browsers as well as full-fledged browser controllers. The drivers include Goutte, BrowserKit, Selenium1/2, Zombie.js, Sahi and WUnit.
See V8js, the PHP extension; which embeds V8 javascript engine into PHP. It allows you to evaluate javascript code right from your PHP script. But it's a little bit overkill to install a PHP extension if you're not heavily using the feature. But if you want to extract the relevant script using the DOM parser:
$script = $crawler->filterXPath('//head/following-sibling::script[2]')->text();
Use HtmlUnit to parse the page and then feed the final HTML to PHP. You gonna need a small Java wrapper. Right, overkill for your case.
Extract the javascript code and parse it using a JS parser/tokenizer library like hiltonjanfield/js4php5 or squizlabs/PHP_CodeSniffer which has a JS tokenizer.
In case that the application is making ajax calls to manipulate the DOM. You might be able to re-dispatch those requests and parse the response for your own application's sake. An example is the ajax call the page is making to cart.js to retrieve the data related to the cart items. But it's not the case for reading the product variant quantity here.
You may recall that I told you that it's a bad idea to utilize regular expressions to parse entire HTML/XML documents. But it's OK to use them partially to extract strings from an HTML/XML document when other approaches are even harder. Read the SO answer I quoted at the top of this post if you have any confusions about when to use it.
This approach is about matching the inventory_quantity of the product variant by running a simple regex against the whole page source (or you can only execute it against the script tag regarding a better performance):
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
$crawler = new Crawler($html);
$price = trim($crawler->filter('.product-header-title[itemprop="price"]')->text());
preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock);
$in_stock = $in_stock[1];
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: 0
This regex needs a variant ID (35276776455 in this case) to work, as the quantity of each product comes with a variant. You can extract it from the URL's query string: ?variant=35276776455.
Now that we're done with the stock status and we've done it with regex, you might want to do the same with the price and drop the DOM parser dependency:
<?php
$html = file_get_contents('https://kithnyc.com/collections/footwear/products/kith-x-adidas-consortium-response-trail-boost?variant=35276776455');
// You need to check if it's matched before assigning
// $price[1]. Anyway, this is just an example.
preg_match('/itemprop="price".+?>\s*\$(.+?)\s*<\/span>/s', $html, $price);
$price = $price[1];
preg_match('/35276776455,.+?inventory_quantity":(\d)/', $html, $in_stock);
$in_stock = $in_stock[1];
echo "Price: $price - Availability: $in_stock";
// Outputs:
// Price: $220.00 - Availability: 0
Conclusion
Even though that I still believe that it's a bad idea to parse HTML/XML documents with regex, I must admit that available DOM parsers are not able to parse embedded javascript code (and probably will never be), which is your case. We can partially utilize regular expressions to extract strings from HTML/XML; the parts which are not parsable using DOM parsers. So, all in all:
Use DOM parsers to parse/scrape the HTML code that initially exists in the page.
Intercept ajax calls that may include information you want. Re-call them in a separate http request to get the data.
Use browser emulators for parsing/scraping JS-heavy sites that populate their pages using ajax calls and such.
Partially use regex to extract what is not extractable using DOM parsers.
If you just want these two fields, you're fine to go with regex. Otherwise, consider other approaches.

Storing HTML in MySQL

I'm storing HTML and text data in my database table in its raw form - however I am having a slight problem in getting it to output correctly. Here is some sample data stored in the table AS IS:
<p>Professional Freelance PHP & MySQL developer based in Manchester.
<br />Providing an unbeatable service at a competitive price.</p>
To output this data I do:
echo $row['details'];
And this outputs the data correctly, however when I do a W3C validator check it says:
character "&" is the first character of a delimiter but occurred as data
So I tried using htmlemtities and htmlspecialchars but this just causes the HMTL tags to output on the page.
What is the correct way of doing this?
Use & instead of &.
What you want to do is use the php function htmlentities()...
It will convert your input into html entities, and then when it is outputted it will be interpreted as HTML and outputted as the result of that HTML...For example:
$mything = "<b>BOLD & BOLD</b>";
//normally would throw an error if not converted...
//lets convert!!
$mynewthing = htmlentities($mything);
Now, just insert $mynewthing to your database!!
htmlentities is basically as superset of htmlspecialchars, and htmlspecialchars replaces also < and >.
Actually, what you are trying to do is to fix invalid HTML code, and I think this needs an ad-hoc solution:
$row['details'] = preg_replace("/&(?![#0-9a-z]+;)/i", "&", $row['details']);
This is not a perfect solution, since it will fail for strings like: someone&son; (with a trailing ;), but at least it won't break existing HTML entities.
However, if you have decision power over how the data is stored, please enforce that the HTML code stored in the database is correct.
In my Projects I use XSLT Parser, so i had to change to   (e.g.). But this is the safety way i found...
here is my code
$html = trim(addslashes(htmlspecialchars(
html_entity_decode($_POST['html'], ENT_QUOTES, 'UTF-8'),
ENT_QUOTES, 'UTF-8'
)));
And when you read from DB, don't forget to use stripslashes();
$html = stripslashes($mysq_row['html']);

PHP> Extracting html data from an html file?

What I've been trying to do recently is to extract listing information from a given html file,
For example, I have an html page that has a list of many companys, with their phone number, address, etc'
Each company is in it's own table, every table started like that: <table border="0">
I tried to use PHP to get all of the information, and use it later, like put it in a txt file, or just import into a database.
I assume that the way to achieve my goal is by using regex, which is one of the things that I really have problems with in php,
I would appreciate if you guys could help me here.
(I only need to know what to look for, or atleast something that could help me a little, not a complete code or anything like that)
Thanks in advance!!
I recommend taking a look at the PHP DOMDocument and parsing the file using an actual HTML parser, not regex.
There are some very straight-forward ways of getting tables, such as the GetElementsByTagName method.
<?php
$htmlCode = /* html code here */
// create a new HTML parser
// http://php.net/manual/en/class.domdocument.php
$dom = new DOMDocument();
// Load the HTML in to the parser
// http://www.php.net/manual/en/domdocument.loadhtml.php
$dom->LoadHTML($htmlCode);
// Locate all the tables within the document
// http://www.php.net/manual/en/domdocument.getelementsbytagname.php
$tables = $dom->GetElementsByTagName('table');
// iterate over all the tables
$t = 0;
while ($table = $tables->item($t++))
{
// you can now work with $table and find children within, check for
// specific classes applied--look for anything that would flag this
// as the type of table you'd like to parse and work with--then begin
// grabbing information from within it and treating it as a DOMElement
// http://www.php.net/manual/en/class.domelement.php
}
If You're familiar with jQuery (and even if You're not as it's command are simple enough) I recommend this PHP counterpart: http://code.google.com/p/phpquery/
If your HTML is valid XML, as in XHTML, then you could parse it using SimpleXML

using preg_match_all to get name of image

After using curl i've got from an external page i've got all source code with something like this (the part i'm interested)
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
So i'm using preg_match_all, i want to get only "buy_tickets.gif"
$pattern_before = "<td valign='top' class='rdBot' align='center'>";
$pattern_after = "</td>";
$pattern = '#'.$pattern_before.'(.*?)'.$pattern_after.'#si';
preg_match_all($pattern, $buffer, $matches, PREG_SET_ORDER);
Everything fine up to now... but the problem it's becase sometimes that external pages changes and the image i'm looking for it's inside a link
(page...)<td valign='top' class='rdBot' align='center'><img src="/images/buy_tickets.gif" border="0" alt="T"></td> (page...)
and i dunno how to get always my code to work (not just when the image gets no link)
hope u understand
thanks in advance
Don't use regex to parse HTML, Use PHP's DOM Extension. Try this:
$doc = new DOMDocument;
#$doc->loadHTMLFile( 'http://ventas.entradasmonumental.com/eventperformances.asp?evt=18' ); // Using the # operator to hide parse errors
$xpath = new DOMXPath( $doc );
$img = $xpath->query( '//td[#class="BrdBot"][#align="center"][1]//img[1]')->item( 0 ); // Xpath->query returns a 'DOMNodeList', get the first item which is a 'DOMElement' (or null)
$imgSrc = $img->getAttribute( 'src' );
$imgSrcInfo = pathInfo( $imgSrc );
$imgFilename = $imgSrcInfo['basename']; // All you need
You're going to get lots of advice not to use regex for pulling stuff out of HTML code.
There are times when it's appropriate to use regex for this kind of thing, and I don't always agree with the somewhat rigid advice given on the subject here (and elsewhere). However in this case, I would say that regex is not the appropriate solution for you.
The problem with using regex for searching for things in HTML code is exactly the problem you've encountered -- HTML code can vary wildly, making any regex virtually impossible to get right.
It is just about possible to write a regex for your situation, but it will be an insanely complex regex, and very brittle -- ie prone to failing if the HTML code is even slightly outside the parameters you expect.
Contrast this with the recommended solution, which is to use a DOM parser. Load the HTML code into a DOM parser, and you will immediately have an object structure which you can query for individual elements and attributes.
The details you've given make it almost a no-brainer to go with this rather than a regex.
PHP has a built-in DOM parser, which you can call as follows:
$mydom = new DOMDocument;
$mydom->loadHTMLFile("http://....");
You can then use XPath to search the DOM for your specific element or attribute that you want:
$myxpath = new DOMXPath($mydom);
$myattr = $xpath->query("//td[#class="rdbot"]//img[0]#src");
Hope that helps.
function GetFilename($file) {
$filename = substr($file, strrpos($file,'/')+1,strlen($file)-strrpos($file,'/'));
return $filename;
}
echo GetFilename('/images/buy_tickets.gif');
This will output buy_tickets.gif
Do you only need images inside of the "td" tags?
$regex='/<img src="\/images\/([^"]*)"[^>]*>/im';
edit:
to grab the specific image this should work:
$regex='/<td valign=\'top\' class=\'rdBot\' align=\'center\'>.*src="\/images\/([^"]*)".*<\/td>/
Parsing HTML with Regex is not recommended, as has been mentioned by several posters.
However, if the path of your images always follows the pattern src="/images/name.gif", you can easily extract it in Regex:
$pattern = <<<EOD
#src\s*=\s*['"]/images/(.*?)["']#
EOD;
If you are sure that the images always follow the path "/images/name.ext" and that you don't care where the image link is located in the page, this will do the job. If you have more detailed requirements (such matching only within a specific class), forget Regex, it's not the right tool for the job.
I just read in your comments that you need to match within a specific tag. Use a parser, it will save you untold headaches.
If you still want to go through regex, try this:
\(?<=<td .*?class\s*=\s*['"]rdBot['"][^<>]*?>.*?)(?<!</td>.*)<img [^<>]*src\s*=\s*["']/images/(.*?)["']\i
This should work. It does work in C#, I am not totally sure about php's brand of regex.

Categories