I have a function that gets the title from a HTML source (I curl it first then pass the source to this):
function get_dom_page_title($source){
$doc = new DOMDocument('1.0', 'utf-8');
$doc->formatOutput = false;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
#$doc->loadHTML('<?xml encoding="UTF-8">' . $source);
$title = $doc->getElementsByTagName("title")->item(0)->nodeValue;
if ($title !== ""){
return (string)$title;
}
else{
return false;
}
}
However when I type in a youtube linkhttp://www.youtube.com/watch?v=IFeE4q4-M0o, the title returned is all weird: ‪Arsenal vs Benfica FT Highlights‬†- YouTube, or \n \u202aArsenal vs Benfica FT Highlights\u202c\u200f\n - YouTube\n.
How can I sort this?
Use PHP Simple HTML DOM Parser
Code:
include("simple_html_dom.php");
$html = file_get_html('http://www.youtube.com/watch?v=IFeE4q4-M0o');
$title = $html->getElementsByTagName("title")->innertext;
echo preg_replace('/&#x([0-9a-f]+);/ei', 'chr(hexdec("$1"))', $title)
will output *Arsenal vs Merdosos FT Highlights, - YouTube
PHP Simple HTML DOM Parser means less code and consistent results :)
You can do the same thing with DOMDocument
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents('http://www.youtube.com/watch?v=IFeE4q4-M0o'));
$t = $doc->getElementsByTagName("title")->item(0)->nodeValue;
print_r($t);
Using DOMDocument means faster DOM processing compared to Simple.
Related
I want to get the HTML content in this page using file_get_contents as string :
https://www.emitennews.com/search/
Then I want to unminify the html code.
So far what I done to unminify it :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
But in the code above I got is error :
DOMDocument::loadHTML(): Tag header invalid in Entity, line: 1
What is the proper way to do it ?
You must add the xml tag at the first line:
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);
This is the correct code :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = false;
$dom->loadHTML('<?xml encoding="UTF-8">' . $html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
The problem is the site using HTML5. So we need to put :
libxml_use_internal_errors(true);
All I want to do is save the first div with attribute role="main" as a string from an external URL using PHP.
So far I have this:
$doc = new DOMDocument();
#$doc->loadHTMLFile("http://example.com/");
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//div[#role="main"]');
$str = "";
if ($elements->length > 0) {
$str = $elements->item(0)->textContent;
}
echo htmlentities($str);
But unfortunately the $str does not seem to be displaying the HTML tags. Just the text.
You can get the HTML via the saveHTML() method.
$str = $doc->saveHTML($elements->item(0));
I have this code which extracts all links from a website. How do I edit it so that it only extracts links that ends on .mp3?
Here are the following code:
preg_match_all("/\<a.+?href=(\"|')(?!javascript:|#)(.+?)(\"|')/i", $html, $matches);
Update:
A nice solution would be to use DOM together with XPath, as #zerkms mentioned in the comments:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new DOMXPath($doc);
// use the XPath function ends-with to select only those links which end with mp3
$links = $xpath->query('//a[ends-with(#href, ".mp3")]/#href');
Original Answer:
I would use DOM for this:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$links = array();
foreach($doc->getElementsByTagName('a') as $elem) {
if($elem->hasAttribute('href')
&& preg_match('/.*\.mp3$/i', $elem->getAttribute('href')) {
$links []= $elem->getAttribute('href');
}
}
var_dump($links);
I would prefer XPath, which is meant to parse XML/xHTML:
$DOM = new DOMDocument();
#$DOM->loadHTML($html); // use the # to suppress warnings from invalid HTML
$XPath = new DOMXPath($DOM);
$links = array();
$link_nodes = $XPath->query('//a[contains(#href, ".mp3")]');
foreach($link_nodes as $link_node) {
$source = $link_nodes->getAttribute('href');
// do some extra work to make sure .mp3 is at the end of the string
$links[] = $source;
}
There is an ends-with() XPath function that you can replace contains(), if you are using XPath 2.0. Otherwise, you might want to add an extra conditional to make sure the .mp3 is at the end of the string. It may not be necessary though.
I'm using the following script for a lightweight DOM editor. However, nodeValue in my for loop is converting my html tags to plain text. What is a PHP alternative to nodeValue that would maintain my innerHTML?
$page = $_POST['page'];
$json = $_POST['json'];
$doc = new DOMDocument();
$doc = DOMDocument::loadHTMLFile($page);
$xpath = new DOMXPath($doc);
$entries = $xpath->query('//*[#class="editable"]');
$edits = json_decode($json, true);
$num_edits = count($edits);
for($i=0; $i<$num_edits; $i++)
{
$entries->item($i)->nodeValue = $edits[$i]; // nodeValue strips html tags
}
$doc->saveHTMLFile($page);
Since $edits[$i] is a string, you need to parse it into a DOM structure and replace the original content with the new structure.
Update
The code fragment below does an incredible job when using non-XML compliant HTML. (e.g. HTML 4/5)
for($i=0; $i<$num_edits; $i++)
{
$f = new DOMDocument();
$edit = mb_convert_encoding($edits[$i], 'HTML-ENTITIES', "UTF-8");
$f->loadHTML($edit);
$node = $f->documentElement->firstChild;
$entries->item($i)->nodeValue = "";
foreach($node->childNodes as $child) {
$entries->item($i)->appendChild($doc->importNode($child, true));
}
}
I haven't working with that library in PHP before, but in my other xpath experience I think that nodeValue on anything other than a text node does strip tags. If you're unsure about what's underneath that node, then I think you'll need to recursively descend $entries->item($i)->childNodes if you need to get the markup back.
Or...you may wany textContent instead of nodeValue:
http://us.php.net/manual/en/class.domnode.php#domnode.props.textcontent
I am looking for something equivalent to this:
$e= xmlwriter_open_uri("test.xml");
....
print htmlentities(xmlwriter_output_memory($e));
now this print allows to display whats in the xml list into a table.
But my with my simple xml (combined with $dom for formatting) i have no idea how to display this. Although this generates the proper output i wish into the xml how do i display the xml below? Something similar to a print or?
The purpose is to display the values of the xml into a table.
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$xml = new SimpleXMLElement('<test></test>');
$one= $xml->addChild('enemy', 'yes');
$two= $xml->addChild('friend', 'maybe');
$dom->loadXML($xml->asXML());
$dom->save('test.xml');
Regards
You don't need to stringify (technical term!) the SimpleXMLElement to load it into a DOMDocument, in fact that's a terrible idea (though, you're forgiven).
$xml = new SimpleXMLElement('<test></test>');
$one= $xml->addChild('enemy', 'yes');
$two= $xml->addChild('friend', 'maybe');
// Get the DOMDocument associated with this XML
$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
echo $dom->saveXML(); // or echo htmlentities($dom->saveXML()) if you really must
More info about retrieving a DOMElement (and its DOMDocument) from a SimpleXMLElement can be found in the docs for dom_import_simplexml().