I wrote a small helper function to do basic search replace using xpath, because I found it easy to write manipulations very short and at the same time easy to read and understand.
Code:
<?php
function xml_search_replace($dom, $search_replace_rules) {
if (!is_array($search_replace_rules)) {
return;
}
$xp = new DOMXPath($dom);
foreach ($search_replace_rules as $search_pattern => $replacement) {
foreach ($xp->query($search_pattern) as $node) {
$node->nodeValue = $replacement;
}
}
}
The problem is that now I need to do different "search/replace" on different parts of the XML dom. I had hoped something like the following would work, but DOMXPath can't use DOMDocumentFragment :(
The first part (until the foreach loop) of the example below works like a charm. I'm looking for inspiration for an alternative way to go around it which is still short and readable (without to much boiler plate).
Code:
<?php
$dom = new DOMDocument;
$dom->loadXml(file_get_contents('container.xml'));
$payload = $dom->getElementsByTagName('Payload')->item(0);
xml_search_replace($dom, array('//MessageReference' => 'SRV4-ID00000000001'));
$payloadXmlTemplate = file_get_contents('payload_template.xml');
foreach (array(array('id' => 'some_id_1'),
array('id' => 'some_id_2')) as $request) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($payloadXmlTemplate);
xml_search_replace($fragment, array('//PayloadElement' => $request['id']));
$payload->appendChild($fragment);
}
Thanks to Francis Avila I came up with the following:
<?php
function xml_search_replace($node, $search_replace_rules) {
if (!is_array($search_replace_rules)) {
return;
}
$xp = new DOMXPath($node->ownerDocument);
foreach ($search_replace_rules as $search_pattern => $replacement) {
foreach ($xp->query($search_pattern, $node) as $matchingNode) {
$matchingNode->nodeValue = $replacement;
}
}
}
$dom = new DOMDocument;
$dom->loadXml(file_get_contents('container.xml'));
$payload = $dom->getElementsByTagName('Payload')->item(0);
xml_search_replace($dom->documentElement, array('//MessageReference' => 'SRV4-ID00000000001'));
$payloadXmlTemplate = file_get_contents('payload_template.xml');
foreach (array(array('id' => 'some_id_1'),
array('id' => 'some_id_2')) as $request) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($payloadXmlTemplate);
xml_search_replace($payload->appendChild($fragment),
array('//PayloadElement' => $request['id']));
}
Related
i made a code like this and how to make it short ? i mean i don't want to use foreach all the time for regex match, thank you.
<?php
preg_match_all('#<article [^>]*>(.*?)<\/article>#sim', $content, $article);
foreach($article[1] as $posts) {
preg_match_all('#<img class="images" [^>]*>#si', $posts, $matches);
$img[] = $matches[0];
}
$result = array_filter($img);
foreach($result as $res) {
preg_match_all('#src="(.*?)" data-highres="(.*?)"#si', $res[0], $out);
$final[] = array(
'src' => $proxy.base64_encode($out[1][0]),
'highres' => $proxy.base64_encode($out[2][0])
);
?>
If you want a robust code (that always works), avoid to parse html using regex, because html is more complicated and unpredictable than you think. Instead use build-in tools available for these particular tasks, i.e DOMxxx classes.
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$imgList = $xp->query('//article//img[#src][#data-highres]');
foreach($imgList as $img) {
$final[] = [
'src' => $proxy.base64_encode($img->getAttribute('src')),
'highres' => $proxy.base64_encode($img->getAttribute('data-highres'))
];
}
I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?
solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;
I have this basic code that doesn't work. How can I use Xpath with html5lib php? Or Xpath with HTML5 in any other way.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url);
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
//$elements = $dom->getElementsByTagName('h1');
foreach ($elements as $element)
{
var_dump($element);
}
No elements are found. Using $xpath->query('.') works for getting the root element (xpath in general seems to work). $dom->getElementsByTagName('h1') is working.
use disable_html_ns option.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5(array(
'disable_html_ns' => true, // add `disable_html_ns` option
));
$dom = $html5->loadHTML($response);
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element) {
var_dump($element);
}
https://github.com/Masterminds/html5-php#options
disable_html_ns (boolean): Prevents the parser from automatically assigning the HTML5 namespace to the DOM document. This is for non-namespace aware DOM tools.
So it looks like html5lib is setting us up with a default namespace.
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
echo $de->namespaceURI . "\n";
}
This outputs:
http://www.w3.org/1999/xhtml
To query against namespaced nodes with xpath you need to register the namespace and use the prefix in the query.
$xpath = new DOMXPath($dom);
$xpath->registerNamespace('n', $de->namespaceURI);
$elements = $xpath->query('//n:h1');
foreach ($elements as $element)
{
echo $element->nodeValue;
}
This outputs PHP.
Generally I find it tedious to prefix everything in xpath queries when there's a default namespace involved, so I just strip it.
$de = $dom->documentElement;
$de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
$dom->loadXML($dom->saveXML()); // reload the existing dom, now sans default ns
After that you can use your original xpath and it'll work just fine.
$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
echo $element->nodeValue;
}
This now outputs PHP as well.
So the modified version of the example would be something like:
Example:
$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
$de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
$dom->loadXML($dom->saveXML());
}
$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
var_dump($element);
}
Output:
class DOMElement#11 (18) {
public $tagName =>
string(2) "h1"
public $schemaTypeInfo =>
NULL
public $nodeName =>
string(2) "h1"
public $nodeValue =>
string(3) "PHP"
...
public $textContent =>
string(3) "PHP"
}
I am trying to do some web scraping using simple_html_dom. But I just want inner text of a span element only. Do I have to load the entire page for that? It is taking a lot of time since I am running it in a loop. What are other alternatives to do this faster?
Here is what I am doing now-
$html = file_get_html($url);
foreach($html->find('span') as $element) {
if($element->innertext=="some text") {
$html->clear();
unset($html);
break;
}
else {
//do something
}
This is too slow if this is used inside a loop. Faster way to do this?
I am not sure about the speed, but instead of doing foreach loop, you can do something like this
$html->find( $selector, $idx )
<?php
$html = file_get_html( $url );
if ( is_object( $html ) ) {
if ( $span = $html->find( "span", 0 ) ) {
$span->innertext = "some text";
}
}
?>
You could give the following a try:
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$content = $xpath->query("//span")->item(0)->nodeValue;
echo $content;
Fastest will be:
$dom = new DOMDocument();
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$content = $xpath->query("//span[contains(text(), 'some text')]")->item(0)->nodeValue;
So i like to take vine image url and video url using PHP Simple HTML DOM Parser.
http://simplehtmldom.sourceforge.net/
here is a example vine url
https://vine.co/v/bjHh0zHdgZT
So i need to take this info from the URL. Form image URL:
<meta property="twitter:image" content="https://v.cdn.vine.co/v/thumbs/8B474922-0D0E-49AD-B237-6ED46CE85E8A-118-000000FFCD48A9C5_1.0.6.mp4.jpg?versionId=mpa1lJy2aylTIEljLGX63RFgpSR5KYNg">
and For the video URL
<meta property="twitter:player:stream" content="https://v.cdn.vine.co/v/videos/8B474922-0D0E-49AD-B237-6ED46CE85E8A-118-000000FFCD48A9C5_1.0.6.mp4?versionId=ul2ljhBV28TB1dUvAWKgc6VH0fmv8QCP">
I want to take only the content of the these meta tags. if anyone can help really appreciate it. Thanks
Instead of using the lib you pointed out, I'm using native PHP DOM in this example, and it should work.
Here's a small class I created for something like that:
<?php
class DomFinder {
function __construct($page) {
$html = #file_get_contents($page);
$doc = new DOMDocument();
$this->xpath = null;
if ($html) {
$doc->preserveWhiteSpace = true;
$doc->resolveExternals = true;
#$doc->loadHTML($html);
$this->xpath = new DOMXPath($doc);
$this->xpath->registerNamespace("html", "http://www.w3.org/1999/xhtml");
}
}
function find($criteria = NULL, $getAttr = FALSE) {
if ($criteria && $this->xpath) {
$entries = $this->xpath->query($criteria);
$results = array();
foreach ($entries as $entry) {
if (!$getAttr) {
$results[] = $entry->nodeValue;
} else {
$results[] = $entry->getAttribute($getAttr);
}
}
return $results;
}
return NULL;
}
function count($criteria = NULL) {
$items = 0;
if ($criteria && $this->xpath) {
$entries = $this->xpath->query($criteria);
foreach ($entries as $entry) {
$items++;
}
}
return $items;
}
}
To use it you can try:
$url = "https://vine.co/v/bjHh0zHdgZT";
$dom = new DomFinder($url);
$content_cell = $dom->find("//meta[#property='twitter:player:stream']", 'content');
print $content_cell[0];