Screen scraping with cURL and Regex

Screen scraping with cURL and Regex - php

Consider a document in the following format:
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
I am loading a document like this from one domain to another with PHP cURL. I would like to trim my cURL result to only include div.blog_post_item.first and its children. I know the structure of the other page, yet I can't edit it. I imagine I can use preg_match to find the opening and closing tags; they will always look the same, including that ending comment.
I have searched for examples/tutorials of screen scraping with cURL/XPath/XSLT/whatever, and its mostly a cyclical rattling off of names of HTML parsing libraries. For that reason, please provide a simple working example. Please do not simply explain that parsing HTML with regex is a potential security vulnerability. Please do not just list libraries and specifications that I should read further into.
I have some simple PHP cURL code:
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_HEADER, 0);
$output = curl_exec($ch);
curl_close($ch);
Of course, now $output contains the entire source. How will I get just the contents of that element?

That's quite easy if you are sure the begin and end is ALWAYS the same. All you have to do is search for the beginning and end and match everything between that. I think a lot of people will be pissed at me for using regex to find a bit of HTML but it'll do the job!
// cURL
$ch = curl_init("http://a.web.page.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
if(empty($output)) exit('Couldn\'t download the page');
// finding your data
$pattern = '/<div class="blog_post_item first">(.*?)<\/div><!-- end blog_post_item -->/';
preg_match_all($pattern, $output, $matches);
var_dump($matches); // all matches
Because I don't know which website you're trying to crawl I'm not sure if this works or not.
After searching for quite a while (26 minutes to be exact) I have found why it didn't work. The dot (.) doesn't match newlines. Because HTML is full of new lines, it couldn't match the contents. Using a slightly dirty hack I managed to get it matching anyway (even though you already picked an answer).
// cURL
$ch = curl_init('http://blogg.oscarclothilde.com/');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);
if(empty($output)) exit('Couldn\'t download the page');
// finding your data
$pattern = '/<div class="blog_post_item first">(([^.]|.)*?)<\/div><!-- end blog_post_item -->/';
preg_match_all($pattern, $output, $matches);
var_dump($matches[1][0]); // all matches

If you are sure about the following structure:
<div class="blog_post_item first">
WHATEVER
</div><!-- end blog_post_item -->
AND you are sure the ending-code doesn't appear in WHATEVER, then you can simply grab it.
(Note please that I replaced your original PHP with WHATEVER. CURL will only fetch the HTML, and it will contain content, not PHP.)
You don't need a regex. You can also do it simply by searching for the wanted strings, like in my example below.
$curlResponse = '
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>';
$startStr = '<div class="blog_post_item first">';
$endStr = '</div><!-- end blog_post_item -->';
$startStrPos = strpos($curlResponse, $startStr)+strlen($startStr);
$endStrPos = strpos($curlResponse, $endStr);
$wanted = substr($curlResponse, $startStrPos, $endStrPos-$startStrPos );
echo htmlentities($wanted);

This piece of code should work (>= 5.3.6 and dom extension):
$s = <<<EOM
<!DOCTYPE html>
<html>
<head>
<title></title>
<body>
<div class="blog_post_item first">
<?php // some child elements ?>
</div><!-- end blog_post_item -->
</body>
</html>
EOM;
$d = new DOMDocument;
$d->loadHTML($s);
$x = new DOMXPath($d);
foreach ($x->query('//div[contains(#class, "blog_post_item") and contains(#class, "first")]') as $el) {
echo $d->saveHTML($el);
}

Related

How to format plaintext in PHP Simple HTML DOM Parser?

I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.

Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>

Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}

How to stop adding new values to an array after one value is added to that array?

I have some html files that contain the same tags with different strings between these tags , I want to get strings from specific tags and after it finds the first match then this string is the only added to the array , for more details see this code.
The html:
<!DOCTYPE html>
<html>
<head></head>
<body>
<h1>Some Text</h1>
<p>This is the first Paragraph</p>
<ul>
<li></li>
<li></l1>
</ul>
<p>This is the second Pharagraph</p>
</body>
</html>
The html files will contain more elements
I want to get the text inside the first <p> only and prevent wasting time searching the whole html file while I just want to get one value from a specific tag.
The PHP:
//Loop inside all the HTML files inside a folder
$files = glob("files/*.html");
foreach($files as $file){
//Get the whole content of each HTMl file
$content = file_get_contents($file);
//Search for specific tag
preg_match_all('#<p>(.*?)<\/p>', $content, $matches);
}
I only want to add the value of the first match to the $matches.
I can't edit the html code to add class or id to the tags I want to get values from because I'm not the one who created them and I can't edit all the files manually
I don't mind using another way to get these values but it should achieve what I want (only the first match then it's stopped searching the whole file)

You can do this with DomDocument.
<?php
$html = '<!DOCTYPE html>
<html>
<head></head>
<body>
<h1>Some Text</h1>
<p>This is the first Paragraph</p>
<ul>
<li></li>
<li></l1>
</ul>
<p>This is the second Pharagraph</p>
</body>
</html>';
$err = libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
libxml_clear_errors();
libxml_use_internal_errors($err);
// find all p tags, select the first, get its value
$pValue = $dom->getElementsByTagName('p')->item(0)->nodeValue;
//This is the first Paragraph
echo $pValue;
https://3v4l.org/kjFoC
So if you wanted to add to your code, perhaps do it like:
<?php
function getFirstParagraph($src) {
$err = libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($src);
libxml_clear_errors();
libxml_use_internal_errors($err);
return $dom->getElementsByTagName('p')->item(0)->nodeValue;
}
//Loop inside all the HTML files inside a folder
$files = glob("files/*.html");
foreach($files as $file){
//Get the whole content of each HTMl file
$content = file_get_contents($file);
//
$matches[] = getFirstParagraph($content);
}

PHP: parsing only namespaced xml

I'm trying to parse data like this:
<vin:layout name="Page" xmlns:vin="http://www.example.com/vin">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
How can I parse data like this in PHP?
I tried DOM but it not works, because of the malformed xml inside the root element. Can I tell the parser, that everithing without vin namespace is text?

I probably would throw a sort of Tagsoup parser on it. Something that can read your format which apart from that deficiencies looks pretty okay written. Nothing that textually would stay in the way against a simple regular expression based scanner. I called mine Tagsoup with just the four node-types you got: Starttag, Endtag, Text and Comment. For the Tags you need to know about their Tagname and the NamespacePrefix. It's just named similar to XML/HTML for convienience, but in fact this is all "rool your own", so do not stretch these terms to any standards.
A usage to change every tag (starting or ending) that does not have the namespace prefix could look like ($string contains the data you have in your question):
$scanner = new TagsoupIterator($string);
$nsPrefix = 'vin';
foreach ($scanner as $node) {
$isTag = $node instanceof TagsoupTag;
$isOfNs = $isTag && $node->getTagNsPrefix() === $nsPrefix;
if ($isTag && !$isOfNs) {
$node = strtr($node, ['&' => '&', '<' => '<']);
}
echo $node;
}
Output:
<vin:layout name="Page" xmlns:vin="http://www.example.com/vin">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
A usage to extract everything inside a certain tag of a namespace could look like:
$scanner = new TagsoupIterator($string);
$parser = new TagsoupForwardNavigator($scanner);
$startTagWithNsPrefix = function ($namespace) {
return function (TagsoupNode $node) use ($namespace) {
/* #var $node TagsoupTag */
return $node->getType() === Tagsoup::NODETYPE_STARTTAG
&& $node->getTagNsPrefix() === $namespace;
};
};
$start = $parser->nextCondition($startTagWithNsPrefix('vin'));
$tag = $start->getTagName();
$parser->next();
echo $html = implode($parser->getUntilEndTag($tag));
Output:
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
Next part is to replace that part of the $string. As Tagsoup offers binary offsets and lengths, this is easy (and I shortcut a little dirty via SimpleXML):
$xml = substr($string, 0, $start->getEnd()) . substr($string, $parser->getOffset());
$doc = new SimpleXMLElement($xml);
$doc[0] = $html;
echo $doc->asXML();
Output:
<vin:layout xmlns:vin="http://www.example.com/vin" name="Page">
<header>
{someText}
<div>
<!-- some invalid xml code -->
<aas>
<nav class="main">
<vin:show section="Menu" />
</nav>
</div>
</header>
</vin:layout>
Depending on the concrete needs this would require to change the implementation. For example this one won't allow to put the same tags into each other. It does not throw you out, however it does not handle that. No idea if you have that case, if so you would need to add some open/close counter, the navigator class could be easily extended for that, even to offer two kind of end-tag finding methods.
The examples given here are using the Tagsoup which you can see at this gist: https://gist.github.com/4415105

Only show certain ID with PHP web scrape?

I'm working on a personal project where it gets the content of my local weather station's school/business closing and it displays the results on my personal site. Since the site doesn't use an RSS feed (sadly), I was thinking of using a PHP scrape to get the contents of the page, but I only want to show a certain ID element. Is this possible?
My PHP code is,
<?php
$url = 'http://website.com';
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
echo $output;
?>
I was thinking of using preg_match, but I'm not sure of the syntax or if that's even the right command. The ID element I want to show is #LeftColumnContent_closings_dg.

Here's an example using DOMDocument. It pulls the text from the first <h1> element with the id="test" ...
$html = '
<html>
<body>
<h1 id="test">test element text</h1>
<h1>test two</h1>
</body>
</html>
';
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$res = $xpath->query('//h1[#id="test"]');
if ($res->item(0) !== NULL) {
$test = $res->item(0)->nodeValue;
}

A library I've used with great success for this sort of things is PHPQuery: http://code.google.com/p/phpquery/ .
You basically get your website into a string (like you have above), then do:
phpQuery::newDocument($output);
$titleElement = pq('title');
$title = $titleElement->html();
For instance - that would get the contents of the title element. The benefit is that all the methods are named after the jQuery ones, making it pretty easy to learn if you already know jQuery.

Get HTML source code of page with PHP

If I have the html file:
<!doctype html>
<html>
<head></head>
<body>
<!-- Begin -->
Important Information
<!-- End -->
</body>
</head>
</html>
How can I use PHP to get the string "Important Information" from the file?

If you already have the parsing sorted, just use file_get_contents(). You can pass it a URL and it will return the content found at the URL, in this case, the html. Or if you have the file locally, you pass it the file path.

In this simple example you can open the file and do fgets() until you find a line with <!-- Begin --> and saving the lines until you find <!-- End -->.
If your HTML is in a variable you can just do:
<?php
$begin = strpos($var, '<!-- Begin -->') + strlen('<!-- Begin -->'); // Can hardcode this with 14 (the length of your 'needle'
$end = strpos($var, '<!-- End -->');
$text = substr($var, $begin, ($end - $begin));
echo $text;
?>
You can see the output here.

You can fetch "HTML" by this
//file_get_html function from third party library
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
and any operation on DOM then read following docs:
http://de.php.net/manual/en/book.dom.php

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Screen scraping with cURL and Regex - php

Related

How to format plaintext in PHP Simple HTML DOM Parser?

How to stop adding new values to an array after one value is added to that array?

PHP: parsing only namespaced xml

Only show certain ID with PHP web scrape?

Get HTML source code of page with PHP

Categories

Resources