PHP Dom Document Don't fix markup

PHP Dom Document Don't fix markup - php

How do I stop DOMDocument from having a mind of its own?
$dom = new DOMDocument();
$validHtml = '<body>Test</body>';
$dom->loadHTML($validHtml);
After loading, the anchor attribute is encoded. I want it not to do this.
$body = $dom->saveHTML();
var_dump($body);
//<body>Test</body>
I realize this has been covered before, but every where I look, it's more useless Ninja code. Any help appreciated.

Here's how I fixed my own problem. Basically, I decided to strip out all the tags in the markup and put in place holders that I can use later use to put back in:
$validHtml = '<body>Test</body>';
$matches = array();
preg_match_all('/{{[^}]+}}/',$validHtml, $matches);
$matches = $matches[0];
if (count($matches)>0){
foreach ($matches as $i=>$match){
$validHtml = str_replace($match, "<!--INDEX-$i-->", $validHtml);
}
}
$dom = new DOMDocument();
$dom->loadHTML($validHtml);
... //do processing on the loaded dom
Later on after manipulating the dom, I put back all the matches:
$validHtml = $dom->saveHTML();
if (count($matches)>0){
foreach ($matches as $i=>$match){
$validHtml = str_replace(array("<!--INDEX-$i-->", "<!--INDEX-$i-->"), $match, $validHtml);
}
}

Related

preg_match returns nothing

I'm trying to get the wheather data from http://www.weather-forecast.com/locations/Berlin/forecasts/latest
but preg_match just returns nothing
<?php
$contents=file_get_contents("http://www.weather-forecast.com/locations/Berlin/forecasts/latest");
preg_match('/3Day Weather Forecast Summary:<\/b><span class="phrase">(.*?)</s', $contents, $matches);
print_r($matches)
?>

Don't use a regex to parse html, user an html parser like DOMDocument,
$contents = file_get_contents("http://www.weather-forecast.com/locations/Berlin/forecasts/latest");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($contents);
$x = new DOMXpath($dom);
foreach($x->query('//span[contains(#class,"phrase")]') as $phase)
{
echo $phase->textContent;
}

How to get links with mp3 as extension

I have this code which extracts all links from a website. How do I edit it so that it only extracts links that ends on .mp3?
Here are the following code:
preg_match_all("/\<a.+?href=(\"|')(?!javascript:|#)(.+?)(\"|')/i", $html, $matches);

Update:
A nice solution would be to use DOM together with XPath, as #zerkms mentioned in the comments:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new DOMXPath($doc);
// use the XPath function ends-with to select only those links which end with mp3
$links = $xpath->query('//a[ends-with(#href, ".mp3")]/#href');
Original Answer:
I would use DOM for this:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$links = array();
foreach($doc->getElementsByTagName('a') as $elem) {
if($elem->hasAttribute('href')
&& preg_match('/.*\.mp3$/i', $elem->getAttribute('href')) {
$links []= $elem->getAttribute('href');
}
}
var_dump($links);

I would prefer XPath, which is meant to parse XML/xHTML:
$DOM = new DOMDocument();
#$DOM->loadHTML($html); // use the # to suppress warnings from invalid HTML
$XPath = new DOMXPath($DOM);
$links = array();
$link_nodes = $XPath->query('//a[contains(#href, ".mp3")]');
foreach($link_nodes as $link_node) {
$source = $link_nodes->getAttribute('href');
// do some extra work to make sure .mp3 is at the end of the string
$links[] = $source;
}
There is an ends-with() XPath function that you can replace contains(), if you are using XPath 2.0. Otherwise, you might want to add an extra conditional to make sure the .mp3 is at the end of the string. It may not be necessary though.

PHP nodeValue strips html tags - innerHTML alternative?

I'm using the following script for a lightweight DOM editor. However, nodeValue in my for loop is converting my html tags to plain text. What is a PHP alternative to nodeValue that would maintain my innerHTML?
$page = $_POST['page'];
$json = $_POST['json'];
$doc = new DOMDocument();
$doc = DOMDocument::loadHTMLFile($page);
$xpath = new DOMXPath($doc);
$entries = $xpath->query('//*[#class="editable"]');
$edits = json_decode($json, true);
$num_edits = count($edits);
for($i=0; $i<$num_edits; $i++)
{
$entries->item($i)->nodeValue = $edits[$i]; // nodeValue strips html tags
}
$doc->saveHTMLFile($page);

Since $edits[$i] is a string, you need to parse it into a DOM structure and replace the original content with the new structure.
Update
The code fragment below does an incredible job when using non-XML compliant HTML. (e.g. HTML 4/5)
for($i=0; $i<$num_edits; $i++)
{
$f = new DOMDocument();
$edit = mb_convert_encoding($edits[$i], 'HTML-ENTITIES', "UTF-8");
$f->loadHTML($edit);
$node = $f->documentElement->firstChild;
$entries->item($i)->nodeValue = "";
foreach($node->childNodes as $child) {
$entries->item($i)->appendChild($doc->importNode($child, true));
}
}

I haven't working with that library in PHP before, but in my other xpath experience I think that nodeValue on anything other than a text node does strip tags. If you're unsure about what's underneath that node, then I think you'll need to recursively descend $entries->item($i)->childNodes if you need to get the markup back.
Or...you may wany textContent instead of nodeValue:
http://us.php.net/manual/en/class.domnode.php#domnode.props.textcontent

PHP Reg ex for parsing a link

I've a PHP script that parse the POST content of a form (message) and transform any URL in a real HTML link. This is the 2 regular expressions I use:
$dbQueryList['sb_message'] = preg_replace("#(^|[\n ])([\w]+?://[^ \"\n\r\t<]*)#is", "\\1\\2", $dbQueryList['sb_message']);
$dbQueryList['sb_message'] = preg_replace("#(^|[\n ])((www|ftp)\.[^ \"\t\n\r<]*)#is", "\\1\\2", $dbQueryList['sb_message']);
Ok it works well but now, in another script I would like to do the opposite. So in my $dbQueryList['sb_message'] I could have a link like this "Google" and I would like to just have "http://google.com".
I cannot write the regex that can do that. Could you help me please?
Thanks :)

Something like this i think:
echo preg_replace('/ Google helloworld');

It's safer to use DOMDocument instead of regex to parse HTML contents.
Try this code:
<?php
function extractAnchors($html)
{
$dom = new DOMDocument();
// loadHtml() needs mb_convert_encoding() to work well with UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//a') as $node)
{
if ($node->hasAttribute('href'))
{
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($node->getAttribute('href'));
$node->parentNode->replaceChild($newNode, $node);
}
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
return mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
}
$html = 'Some text Google some text <img src="http://dontextract.it" alt="alt"> some text.';
echo extractAnchors($html);

Str_replace with regex

Say I have the following link:
<li class="hook">
I_have_underscores
</li>
How would I, remove the underscores only in the text and not the href? I have used str_replace, but this removes all underscores, which isn't ideal.
So basically I would be left with this output:
<li class="hook">
I have underscores
</li>
Any help, much appreciated

You can use a HTML DOM parser to get the text within the tags, and then run your str_replace() function on the result.
Using the DOM Parser I linked, it is as simple as something like this:
$html = str_get_html(
'<li class="hook">I_have_underscores</li>');
$links = $html->find('a'); // You can use any css style selectors here
foreach($links as $l) {
$l->innertext = str_replace('_', ' ', $l->innertext)
}
echo $html
//<li class="hook">I have underscores</li>
That's it.

It's safer to parse HTML with DOMDocument instead of regex. Try this code:
<?php
function replaceInAnchors($html)
{
$dom = new DOMDocument();
// loadHtml() needs mb_convert_encoding() to work well with UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[(ancestor::a)]') as $node)
{
$replaced = str_ireplace('_', ' ', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
return mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
}
$html = '<li class="hook">
I_have_underscores
</li>';
echo replaceInAnchors($html);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Dom Document Don't fix markup - php

Related

preg_match returns nothing

How to get links with mp3 as extension

PHP nodeValue strips html tags - innerHTML alternative?

PHP Reg ex for parsing a link

Str_replace with regex

Categories

Resources