PHP DomDocument XPATH does not match to the HTML real structure

PHP DomDocument XPATH does not match to the HTML real structure - php

I'm trying to validate the following HTML code (
Please note the text content inside IMG tag, which is structurally correct as markup, but invalid as HTML):
<html>
<head>
</head>
<body>
<img src="./">
Some Text
</img>
</body>
</html>
Using PHP and DomDocument, I try to read entire tree with XPATH:
$dom = new DOMDocument();
$dom->validateOnParse = 0;
$dom->loadHTML($htmlSource);
$xpath = new DOMXPath($dom);
$allNodes = $xpath->query("//node()");
The result I get:
/html
/html/head
/html/body
/html/body/#text[1]
/html/body/img
/html/body/#text[2]
which obviously does not match the exact HTML structure.
What I expected to see is
....
/html/body/img/#text
....
Why does XPATH interpret the tree this way?
How can I get it to work as I expected?

Related

how to give some words in html a link

I have this html as just example
this is some html code, and this is html
this is image <img src="any url with html word" alt="html" />
<iframe src="html"></iframe>
<script type="text/javascript">
var html = "any thing here";
var x = "this is html"
</script>
I want any way to replace all html word with html
As we see it may be in html tag attribute and we must exclude all these chance to replace and just replace this word if it plain text in span or p or div
I tried all dom ways to do that and no way
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXPath($dom);
$query_entries = $xpath->evaluate("(//div | //span | //p)[not(ancestor::a)]/text()");
foreach($query_entries as $element){
if($element instanceof DOMText){
$element->nodeValue = str_replace('html','html',$element->nodeValue);
}
}
When I replace the nodeValue with a html it escape it and if I try to decode it it make errors in js codes
Any regex solution?

Get IFrame src with SimpleHTMLDOM parser

hi i was working on a scraper but i am unable to get one of information.
this is the link http://sfglobe.com/?id=19110
div class="video_container">
<div class="video_object">
<iframe id="player" width="100%" height="100%" frameborder="0" allowfullscreen="1" title="YouTube video player"
src="http://www.youtube.com/embed/KMYrIi_Mt8A?enablejsapi=1&controls=1&showinfo=0& color=white&rel=0&wmode=transparent&modestbranding=1&theme=light&autohide=1&start=4& origin=http%3A%2F%2Fsfglobe.com">
<!DOCTYPE html>
<html lang="en" data-cast-api-enabled="true" dir="ltr"
i need src ="http://www.youtube.com/embed/KMYrIi_Mt8A....."
i this is my code which does not work
foreach ($html->find('.video_object')as $iframe){
echo "this is video ".$iframe->outertext ." <br>";
}
thank you very uc

Do this return anything on your code?
$html->find('.video_object iframe')
If so, try using ->getAttribute('src'); it might work.
For further information take a look at PHP DOMElement
EDIT
Use XPath instead, it will output the expected result
//init DOMDocument
$dom = new DOMDocument();
//get the source from the URL
$html = file_get_contents("URL");
//load the html from html string
$dom->loadHTML($html);
//init XPath
$xpath = new DOMXPath($dom);
//fetch the src from the iframe within
$iframe_src=$xpath->query('//*[#class="CLASSNAME"]/iframe//#src');
vardump($iframe_src);

How can I preg_match script tag src, but avoid effecting img tag src?

I have to match local src's and make them load via the web. Example:
src="/js/my.js">
Becomes:
src="http://cdn.example.com/js/my.js">
This is what I have now:
if (!preg_match("#<script(.+?) src=\"http#i",$page)){
$page = preg_replace("#<script(.+?) src=\"#is", "<script$1 src=\"$workingUrl", $page);
}
It works fine when it encounters something like this:
<script type='text/javascript' src='/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>
It fails when it encounters something like this:
<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>
If the script tag doesn't contain a src it will then find the src of the first image tag and switch out its URL.
I need to know how to get it to terminate the match on the script tag only and/or how to perform the replacement better.

Definitely use a DOM parser. Xpath with DOMDocument will cleanly, reliably replace the script tags that:
Have a src attribute and
The src attribute does not start with http.
I could have further developed the xpath query expression to check for the leading http substring, but I didn't want to scare you off with more syntax.
Code: (Demo)
$html = <<<HTML
<html>
<head>
<script type='text/javascript' src='/wp-includes/js/jquery/jquery.js?ver=1.8.3'></script>
<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>
</head>
</html>
HTML;
$workingUrl = 'https://www.example.com';
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//script[#src]") as $node) {
if (strpos($node->getAttribute('src'), 'http') !== 0) {
$node->setAttribute('src', $workingUrl);
}
}
echo $dom->saveHTML();
Output:
<html>
<head>
<script type="text/javascript" src="https://www.example.com"></script>
<script language="JavaScript">
window.moveTo(0,0);
window.resizeTo(screen.width,screen.height);
</script>
</head>
</html>
The only slightly "scarier" xpath version: (Demo)
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query("//script[#src and not(starts-with(#src,'http'))]") as $node) {
$node->setAttribute('src', $workingUrl);
}
echo $dom->saveHTML();

Barring the usage of DOMDocument::loadHTML and using the DOM instead, dropping the use of . and only accepting everything up to the first > as a fallback will probably work better (although not perfect, as there might in theoretical cases be other attributes to <script> that contain a >).
Using:
#<script([^>]+?) src=\"#is
as your pattern instead makes the pattern stop matching when it encounters the first > after <script.

Rewriting HTML tags with DOM/Xpath (PHP)

I'm parsing a block of HTML with DOM/Xpath in PHP. Within this HTML, there are a few p tags that I want to convert to h4 tags, instead.
Raw HTML =>
<p class="archive">Awesome line of text</p>
Desired HTML =>
<h4>Awesome line of text</h4>
How can I do this with Xpath? I think I need to call on appendChild, but I'm not sure. Thank you for any guidance.

Something along these lines should do it:
<?php
$html = <<<END
<html>
<head>
<title>Test</title>
</head>
<body>
<p>hi</p>
<p class="archive">Awesome line of text</p>
<p>bye</p>
<p class="archive">Another line of <b>text</b></p>
<p>welcome</p>
<p class="archive">Another <u>line</u> of <b>text</b></p>
</body>
</html>
END;
$doc = new DOMDocument();
$doc->loadXML($html);
$xpath = new DOMXPath($doc);
// Find the nodes we want to change
$nodes = $xpath->query("//p[#class = 'archive']");
foreach ($nodes as $node) {
// Create a new H4 node
$h4 = $doc->createElement('h4');
// Move the children of the current node to the new one
while ($node->hasChildNodes())
$h4->appendChild($node->firstChild);
// Replace the current node with the new
$node->parentNode->replaceChild($h4, $node);
}
echo $doc->saveXML();
?>

find if a img have "alt", if not then add from array ( serverside )

first I need to find all img in the sites,
and then check if the img have the "alt" attribute, if image have the attribute it'll be escaped and if it not have one or the alt is empty,a string will be randomly added to img from a list or array.
here is how you do it with javascript:
find if a img have alt in jquery if not then add from array
but it did not help me because according to this:
How do search engines crawl Javascript?
search bots can't read it , if you use JavaScript you need to use server-side language to add keyword to img alt.
what next? php? can i do it with a simple code?

Well, import it into an DOMDocument object and find all images inside.
Seems rather trivial. See the DOMDocument class
Here's my code for the problem:
<?php
$html = <<<HTML
<html lang="en-US">
<head>
<meta charset="UTF-8">
<title></title>
</head>
<body>
<p>
<img src="test.png">
<img src="test.jpg" alt="Testing">
<img src="test.gif">
</p>
</body>
</html>
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$images = $dom->getElementsByTagName("img");
foreach ($images as $image) {
if (!$image->hasAttribute("alt")) {
$altAttribute = $dom->createAttribute("alt");
$altAttribute->value = "Ready Value!";
$image->appendChild($altAttribute);
}
}
echo $dom->saveHTML();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP DomDocument XPATH does not match to the HTML real structure - php

Related

how to give some words in html a link

Get IFrame src with SimpleHTMLDOM parser

How can I preg_match script tag src, but avoid effecting img tag src?

Rewriting HTML tags with DOM/Xpath (PHP)

find if a img have "alt", if not then add from array ( serverside )

Categories

Resources