PHP web scraping

PHP web scraping - php

I use php web scraping, and I want to get the price (3.65) on Sunday form the html code below:
<tr class="odd">
<td >
<b>Sunday</b> Info
<div class="test">test</div>
</td>
<td>
€ 3.65 *
</td>
</tr>
But I don't find the best regex to do this...
I use this php code:
<?php
$data = file_get_contents('http://www.test.com/');
preg_match('/<tr class="odd"><td ><b>Sunday</b> Info<div class="test">test<\/div><\/td><td>€ (.*) *<\/td><\/tr>/i', $data, $matches);
$result = $matches[1];
?>
But no result... What's wrong in the regex? (I think it's because of the new lines/spaces?)

Don't use regular expressions, HTML is not regular.
Instead, use a DOM Tree parser like DOMDocument. This documentation may help you.
The /s switch should help you with your original regex though I haven't tried it.

The problems are the spaces between the tags.
there a line breaks, tabs and/or spaces.
your regex doesn't match to them.
you also need to setup your preg_match for multiline!
i think it is more easy to use xpath for scraping.

Try to replace newlines with '' and then perform the regexp again.

Try in this way:
$uri = ('http://www.test.com/');
$get = file_get_contents($uri);
$pos1 = strpos($get, "<tr class=\"odd\"><td ><b>Sunday</b> Info<div class=\"test\">test</div></td><td>€");
$pos2 = strpos($get, "*</td></tr>", $pos1);
$text = substr($get,$pos1,$pos2-$pos1);
$text1 = strip_tags($text);

Using PHP DOMDocument Object. We're going to parse the HTML DOM data from the web page
$dom = new DOMDocument();
$dom->loadHTML($data);
$trs = $dom->getElementsByTagName('tr'); // this gives us all the tr elements on the webpage
// loop through all the tr tags
foreach($trs as $tr) {
// until we get one with the class 'odd' and has a b tag value of SUNDAY
if ($tr->getAttribute('class') == 'odd' && $tr->getElementsByTagName('b')->item(0)->nodeValue == 'Sunday') {
// now set the price to the node value of the second td tag
$price = trim($tr->getElementsByTagName('td')->item(1)->nodeValue);
break;
}
}
Instead of using DOMDocument for web scraping, it's a bit tedious, you can get your hands on SimpleHtmlDomParser, it's open source.

Related

regex to match a specific HTML string with any number of spaces inside it

I have code with several lines like this
<p> <inset></p>
Where there may be any number of spaces or tabs (or none) between the opening <p> tag and the rest if the string. I need to replace these, but I can't get it to work.
I thought this would do it, but it doesn't work:
<p>[ \t]+<inset></p>

Try this:
$html = preg_replace('#(<p>)\s+(<inset></p>)#', '$1$2', $html);

If you want true text-trimming for HTML including everything you can encounter like those entitites, comments, child-elements and all that stuff, you can make use of a TextRangeTrimmer and TextRange:
$htmlFragment = '<p> <inset></p>';
$dom = new DOMDocument();
$dom->loadHTML($htmlFragment);
$parent = $dom->getElementsByTagName('body')->item(0);
if (!$parent)
{
throw new Exception('Parent element not found.');
}
$range = new TextRange($parent);
$trimmer = new TextRangeTrimmer($range);
$trimmer->ltrim();
// inner HTML (PHP >= 5.3.6)
foreach($parent->childNodes as $node)
{
echo $dom->saveHTML($node);
}
Output:
<p><inset></p>
I've both classes in a gist: https://gist.github.com/1894360/ (codepad viper is down).
See as well the related questions / answers:
Wordwrap / Cut Text in HTML string
Ignore html tags in preg_replace

Try to load your HTML string into a DOM tree instead, and then trim all the text values in the tree.
http://php.net/domdocument.loadhtml
http://php.net/trim

Regex / DOMDocument - match and replace text not in a link

I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:
<p>Match this text and replace it</p>
<p>Don't match this text</p>
<p>We still need to match this text and replace it</p>
Searching for 'match this text' would only replace the first instance and last instance.
[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.

Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.
The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).
The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is a link <span>with <strong>don\'t match this text</strong> content</span></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?
I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).
Thanks for Gordon and stillstanding for commenting on my other answer.

Try this one:
$dom = new DOMDocument;
$dom->loadHTML($html_content);
function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
if (!empty($dom->childNodes)) {
foreach ($dom->childNodes as $node) {
if ($node instanceof DOMText &&
!in_array($node->parentNode->nodeName, $excludeParents))
{
$node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
}
else
{
preg_replace_dom($regex, $replacement, $node, $excludeParents);
}
}
}
}
preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));

This is the stackless non-recursive approach using pre-order traversal of the DOM tree.
libxml_use_internal_errors(TRUE);
$dom=new DOMDocument('1.0','UTF-8');
$dom->substituteEntities=FALSE;
$dom->recover=TRUE;
$dom->strictErrorChecking=FALSE;
$dom->loadHTMLFile($file);
$root=$dom->documentElement;
$node=$root;
$flag=FALSE;
for (;;) {
if (!$flag) {
if ($node->nodeType==XML_TEXT_NODE &&
$node->parentNode->tagName!='a') {
$node->nodeValue=preg_replace(
'/match this text/is',
$replacement, $node->nodeValue
);
}
if ($node->firstChild) {
$node=$node->firstChild;
continue;
}
}
if ($node->isSameNode($root)) break;
if ($flag=$node->nextSibling)
$node=$node->nextSibling;
else
$node=$node->parentNode;
}
echo $dom->saveHTML();
libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.

$a='<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>';
echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);
The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.

You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use.
Here is the alternative in parallel with Netcoder's DomDocument solution:
function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
require_once('simple_html_dom.php');
$html = str_get_html($html_content);
foreach ($html->find('text') as $element) {
if (!in_array($element->parent()->tag, $excludedParents))
$element->innertext = str_ireplace($search, $replace, $element->innertext);
}
return (string)$html;
}
I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).

<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>
This way works. Hope you want realy case sensitive, so match with small letter.

HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:
preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");
If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.

Unable to use regex to search in PHP?

I'm trying to get the code of a html document in specific tags.
My method works for some tags, but not all, and it not work for the tag's content I want to get.
Here is my code:
<html>
<head></head>
<body>
<?php
$url = "http://sf.backpage.com/MusicInstruction/";
$data = file_get_contents($url);
$pattern = "/<div class=\"cat\">(.*)<\/div>/";
preg_match_all($pattern, $data, $adsLinks, PREG_SET_ORDER);
var_dump($adsLinks);
foreach ($adsLinks as $i) {
echo "<div class='ads'>".$i[0]."</div>";
}
?>
</body>
</html>
The above code doesn't work, but it works when I change the $pattern into:
$pattern = "/<div class=\"date\">(.*)<\/div>/";
or
$pattern = "/<div class=\"sponsorBoxPlusImages\">(.*)<\/div>/";
I can't see any different between these $pattern. Please help me find the error.
Thanks.

Use PHP DOM to parse HTML instead of regex.
For example in your case (code updated to show HTML):
$doc = new DOMDocument();
#$doc->loadHTML(file_get_contents("http://sf.backpage.com/MusicInstruction/"));
$nodes = $doc->getElementsByTagName('div');
for ($i = 0; $i < $nodes->length; $i ++)
{
$x = $nodes->item($i);
if($x->getAttribute('class') == 'cat');
echo htmlspecialchars($x->nodeValue) . "<hr/>"; //this is the element that you want
}

The reason your regex fails is that you are expecting . to match newlines, and it won't unless you use the s modifier, so try
$pattern = "/<div class=\"cat\">(.*)<\/div>/s";
When you do this, you might find the pattern a little too greedy as it will try to capture everything up to the last closing div element. To make it non-greedy, and just match up the very next closing div, add a ? after the *
$pattern = "/<div class=\"cat\">(.*?)<\/div>/s";
This just serves to illustrate that for all but the simplest cases, parsing HTML with regexes is the road to madness. So try using DOM functions for parsing HTML.

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.

First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.

try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php

The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.

I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php

It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Matching everything between html <body> tags using PHP

I have a script that returns the following in a variable called $content
<body>
<p><span class=\"c-sc\">dgdfgdf</span></p>
</body>
I however need to place everything between the body tag inside an array called matches
I do the following to match the stuff between the body tag
preg_match('/<body>(.*)<\/body>/',$content,$matches);
but the $mathces array is empty, how could I get it to return everything inside the body tag

Don't try to process html with regular expressions! Use PHP's builtin parser instead:
$dom = new DOMDocument;
$dom->loadHTML($string);
$bodies = $dom->getElementsByTagName('body');
assert($bodies->length === 1);
$body = $bodies->item(0);
for ($i = 0; $i < $body->children->length; $i++) {
$body->remove($body->children->item($i));
}
$string = $dom->saveHTML();

You should not use regular expressions to parse HTML.
Your particular problem in this case is you need to add the DOTALL modifier so that the dot matches newlines.
preg_match('/<body>(.*)<\/body>/s', $content, $matches);
But seriously, use an HTML parser instead. There are so many ways that the above regular expression can break.

If for some reason you don't have DOMDocument installed, try this
Step 1. Download simple_html_dom
Step 2. Read the documentation about how to use its selectors
require_once("simple_html_dom.php");
$doc = new simple_html_dom();
$doc->load($someHtmlString);
$body = $doc->find("body")->innertext;

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP web scraping - php

Don't use regular expressions, HTML is not regular. Instead, use a DOM Tree parser like DOMDocument. This documentation may help you. The /s switch should help you with your original regex though I haven't tried it.

The problems are the spaces between the tags. there a line breaks, tabs and/or spaces. your regex doesn't match to them. you also need to setup your preg_match for multiline! i think it is more easy to use xpath for scraping.

Try to replace newlines with '' and then perform the regexp again.

Related

regex to match a specific HTML string with any number of spaces inside it

Regex / DOMDocument - match and replace text not in a link

Unable to use regex to search in PHP?

Using regex to remove HTML tags

Matching everything between html <body> tags using PHP

Categories

Resources