Getting title of youtube videos using php domdocument

Getting title of youtube videos using php domdocument - php

I have a function that gets the title from a HTML source (I curl it first then pass the source to this):
function get_dom_page_title($source){
$doc = new DOMDocument('1.0', 'utf-8');
$doc->formatOutput = false;
$doc->preserveWhiteSpace = false;
$doc->strictErrorChecking = false;
#$doc->loadHTML('<?xml encoding="UTF-8">' . $source);
$title = $doc->getElementsByTagName("title")->item(0)->nodeValue;
if ($title !== ""){
return (string)$title;
}
else{
return false;
}
}
However when I type in a youtube linkhttp://www.youtube.com/watch?v=IFeE4q4-M0o, the title returned is all weird: â€ªArsenal vs Benfica FT Highlightsâ€¬â€ - YouTube, or \n \u202aArsenal vs Benfica FT Highlights\u202c\u200f\n - YouTube\n.
How can I sort this?

Use PHP Simple HTML DOM Parser
Code:
include("simple_html_dom.php");
$html = file_get_html('http://www.youtube.com/watch?v=IFeE4q4-M0o');
$title = $html->getElementsByTagName("title")->innertext;
echo preg_replace('/&#x([0-9a-f]+);/ei', 'chr(hexdec("$1"))', $title)
will output *Arsenal vs Merdosos FT Highlights,‏ - YouTube
PHP Simple HTML DOM Parser means less code and consistent results :)

You can do the same thing with DOMDocument
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents('http://www.youtube.com/watch?v=IFeE4q4-M0o'));
$t = $doc->getElementsByTagName("title")->item(0)->nodeValue;
print_r($t);
Using DOMDocument means faster DOM processing compared to Simple.

Related

How to get HTML from file_get_content PHP then unminify it

I want to get the HTML content in this page using file_get_contents as string :
https://www.emitennews.com/search/
Then I want to unminify the html code.
So far what I done to unminify it :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML($html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
But in the code above I got is error :
DOMDocument::loadHTML(): Tag header invalid in Entity, line: 1
What is the proper way to do it ?

You must add the xml tag at the first line:
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="UTF-8">' . $html);

This is the correct code :
$html = file_get_contents("https://www.emitennews.com/search/");
$dom = new \DOMDocument();
libxml_use_internal_errors(true);
$dom->preserveWhiteSpace = false;
$dom->loadHTML('<?xml encoding="UTF-8">' . $html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;
print $dom->saveXML($dom->documentElement);
The problem is the site using HTML5. So we need to put :
libxml_use_internal_errors(true);

Parse a HTML document and get a specific element in PHP and save its HTML

All I want to do is save the first div with attribute role="main" as a string from an external URL using PHP.
So far I have this:
$doc = new DOMDocument();
#$doc->loadHTMLFile("http://example.com/");
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//div[#role="main"]');
$str = "";
if ($elements->length > 0) {
$str = $elements->item(0)->textContent;
}
echo htmlentities($str);
But unfortunately the $str does not seem to be displaying the HTML tags. Just the text.

You can get the HTML via the saveHTML() method.
$str = $doc->saveHTML($elements->item(0));

How to get links with mp3 as extension

I have this code which extracts all links from a website. How do I edit it so that it only extracts links that ends on .mp3?
Here are the following code:
preg_match_all("/\<a.+?href=(\"|')(?!javascript:|#)(.+?)(\"|')/i", $html, $matches);

Update:
A nice solution would be to use DOM together with XPath, as #zerkms mentioned in the comments:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new DOMXPath($doc);
// use the XPath function ends-with to select only those links which end with mp3
$links = $xpath->query('//a[ends-with(#href, ".mp3")]/#href');
Original Answer:
I would use DOM for this:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$links = array();
foreach($doc->getElementsByTagName('a') as $elem) {
if($elem->hasAttribute('href')
&& preg_match('/.*\.mp3$/i', $elem->getAttribute('href')) {
$links []= $elem->getAttribute('href');
}
}
var_dump($links);

I would prefer XPath, which is meant to parse XML/xHTML:
$DOM = new DOMDocument();
#$DOM->loadHTML($html); // use the # to suppress warnings from invalid HTML
$XPath = new DOMXPath($DOM);
$links = array();
$link_nodes = $XPath->query('//a[contains(#href, ".mp3")]');
foreach($link_nodes as $link_node) {
$source = $link_nodes->getAttribute('href');
// do some extra work to make sure .mp3 is at the end of the string
$links[] = $source;
}
There is an ends-with() XPath function that you can replace contains(), if you are using XPath 2.0. Otherwise, you might want to add an extra conditional to make sure the .mp3 is at the end of the string. It may not be necessary though.

PHP nodeValue strips html tags - innerHTML alternative?

I'm using the following script for a lightweight DOM editor. However, nodeValue in my for loop is converting my html tags to plain text. What is a PHP alternative to nodeValue that would maintain my innerHTML?
$page = $_POST['page'];
$json = $_POST['json'];
$doc = new DOMDocument();
$doc = DOMDocument::loadHTMLFile($page);
$xpath = new DOMXPath($doc);
$entries = $xpath->query('//*[#class="editable"]');
$edits = json_decode($json, true);
$num_edits = count($edits);
for($i=0; $i<$num_edits; $i++)
{
$entries->item($i)->nodeValue = $edits[$i]; // nodeValue strips html tags
}
$doc->saveHTMLFile($page);

Since $edits[$i] is a string, you need to parse it into a DOM structure and replace the original content with the new structure.
Update
The code fragment below does an incredible job when using non-XML compliant HTML. (e.g. HTML 4/5)
for($i=0; $i<$num_edits; $i++)
{
$f = new DOMDocument();
$edit = mb_convert_encoding($edits[$i], 'HTML-ENTITIES', "UTF-8");
$f->loadHTML($edit);
$node = $f->documentElement->firstChild;
$entries->item($i)->nodeValue = "";
foreach($node->childNodes as $child) {
$entries->item($i)->appendChild($doc->importNode($child, true));
}
}

I haven't working with that library in PHP before, but in my other xpath experience I think that nodeValue on anything other than a text node does strip tags. If you're unsure about what's underneath that node, then I think you'll need to recursively descend $entries->item($i)->childNodes if you need to get the markup back.
Or...you may wany textContent instead of nodeValue:
http://us.php.net/manual/en/class.domnode.php#domnode.props.textcontent

displaying Simple XML php

I am looking for something equivalent to this:
$e= xmlwriter_open_uri("test.xml");
....
print htmlentities(xmlwriter_output_memory($e));
now this print allows to display whats in the xml list into a table.
But my with my simple xml (combined with $dom for formatting) i have no idea how to display this. Although this generates the proper output i wish into the xml how do i display the xml below? Something similar to a print or?
The purpose is to display the values of the xml into a table.
$dom = new DOMDocument('1.0');
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$xml = new SimpleXMLElement('<test></test>');
$one= $xml->addChild('enemy', 'yes');
$two= $xml->addChild('friend', 'maybe');
$dom->loadXML($xml->asXML());
$dom->save('test.xml');
Regards

You don't need to stringify (technical term!) the SimpleXMLElement to load it into a DOMDocument, in fact that's a terrible idea (though, you're forgiven).
$xml = new SimpleXMLElement('<test></test>');
$one= $xml->addChild('enemy', 'yes');
$two= $xml->addChild('friend', 'maybe');
// Get the DOMDocument associated with this XML
$dom = dom_import_simplexml($xml)->ownerDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
echo $dom->saveXML(); // or echo htmlentities($dom->saveXML()) if you really must
More info about retrieving a DOMElement (and its DOMDocument) from a SimpleXMLElement can be found in the docs for dom_import_simplexml().

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Getting title of youtube videos using php domdocument - php

You can do the same thing with DOMDocument $doc = new DOMDocument(); $doc->loadHTML(file_get_contents('http://www.youtube.com/watch?v=IFeE4q4-M0o')); $t = $doc->getElementsByTagName("title")->item(0)->nodeValue; print_r($t); Using DOMDocument means faster DOM processing compared to Simple.

Related

How to get HTML from file_get_content PHP then unminify it

Parse a HTML document and get a specific element in PHP and save its HTML

How to get links with mp3 as extension

PHP nodeValue strips html tags - innerHTML alternative?

displaying Simple XML php

Categories

Resources