Can't decode html entities in title - php

I am having trouble decoding entities in the title from this youtube video:
http://www.youtube.com/watch?v=p7NMsywVQhY
Here is my code:
$url = 'http://www.youtube.com/watch?v=p7NMsywVQhY';
$html = #file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
//decode the '‪' in the title
$title = html_entity_decode($title,ENT_QUOTES,'UTF-8'); //does not seem to have any effect
//decode the utf data
$title = utf8_decode($title);
$title returns everything fine except returns question marks where ‪ is originally in the title.
Thanks.

I don't know if PHP provides any function to do that, however you can use preg_replace like this:
$string = preg_replace('/&#x([0-9a-f]+);/ei', 'chr(hexdec("$1"))', $string);

Try this to force correct detection of the charset:
$doc = new DOMDocument();
#$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
echo $title;

Related

simple_html_dom scrape all lines with chracteristic and then output them below

I currently got this far in scraping with htmldom (as far as examples go)
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://nitter.absturztau.be/chillartaholic');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
However instead of retrieving a title and image,
I'd like to instead get all lines in the target page that begin with:
<a class="tweet-link"
and display the lines scraped - in their entirety - top to bottom below.
(First scraped line would then be:
> <a class="tweet-link"
> href="/ChillArtaholic/status/1413973360841744390#m"></a>
Is this possible with htmldom (or are there limitations on the scrapeable number of lines et all?)
Strangely enough, the answer from yesterday is gone.
This was the consensus that works
(altho their answer had many different other approaches) :/
<?php
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
$url = 'https://nitter.absturztau.be/chillartaholic';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[#class="tweet-link"]');
foreach ($nodes as $node){
echo $link->nodeValue;
echo $node-> getAttribute('href'), '<br>';
}
?>

PHP echo returns error if using saveXML()?

I am using PHP to generate a XML and display on my browser. I have the following code:
<?php
header("content-type:application/xml; charset=ISO-8859-15");
$doc = new DOMDocument('1.0');
// we want a nice output
$doc->formatOutput = true;
$doc->preserveWhiteSpace = false;
$root = $doc->createElement('book');
$root = $doc->appendChild($root);
$title = $doc->createElement('title');
$title = $root->appendChild($title);
$text = $doc->createTextNode('This is the title');
$text = $title->appendChild($text);
//echo "Saving all the document:\n";
//echo $doc->saveXML()."\n";
echo "Saving only the title part:\n";
echo $doc->saveXML($title);
?>
If I comment out echo "Saving only the title part:\n";, it generates the XML to me in my browser without any problem, but if I try to add this echo "Saving only the title part:\n"; before echo $doc->saveXML($title);, it will give me the following error in my browser :
This page contains the following errors:
error on line 1 at column 1: Document is empty
Below is a rendering of the page up to the first error.
Does anybody knows why? Is there a way to display a string before printing out the XML in my browser?
This is because you have set your documents content type to application/xml, so the browser is expecting all content to be in the form of XML.
To get the above code to work, change the content type to text/plain
The following code works (tested on my machine):
header("content-type:text/plain; charset=ISO-8859-15");
$doc = new DOMDocument('1.0');
$doc->formatOutput = true;
$doc->preserveWhiteSpace = false;
$root = $doc->createElement('book');
$root = $doc->appendChild($root);
$title = $doc->createElement('title');
$title = $root->appendChild($title);
echo "Saving only the title part:\n";
echo $doc->saveXML($title);

Get element from giving URL with dom php

Building an MTDB DB with php, and need to scrape a specific tag from the URL.
Tag to get from url
vars.disqus = '';
vars.lists = [];
vars.titleId = '35079';
vars.trailersPlayer = 'default';
vars.userId = '907791';
vars.title = {"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec}
I need the
"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec
My code:
$html = 'myurl';
libxml_use_internal_errors(TRUE); $dom = new DOMDocument; $dom->loadHTMLFile($html); libxml_clear_errors();
$xp = new DOMXpath($dom); $nodes = $xp->query('//script[#\'id','trailer','title');
echo $nodes->item(0)->nodeValue;
the "Tag" is not a HTML format, its looks like some javascript code ~~
to resolve these string, simply via regex
preg_match('/title\s*=\s*\{([^}]+)}/', $str, $matches);
var_dump($matches[1]);

Parse a HTML document and get a specific element in PHP and save its HTML

All I want to do is save the first div with attribute role="main" as a string from an external URL using PHP.
So far I have this:
$doc = new DOMDocument();
#$doc->loadHTMLFile("http://example.com/");
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//div[#role="main"]');
$str = "";
if ($elements->length > 0) {
$str = $elements->item(0)->textContent;
}
echo htmlentities($str);
But unfortunately the $str does not seem to be displaying the HTML tags. Just the text.
You can get the HTML via the saveHTML() method.
$str = $doc->saveHTML($elements->item(0));

How to decode HTML entities when saving an XML file?

I have the following code in my PHP script:
$str = '<item></item>';
$xml = new DOMDocument('1.0', 'UTF-8');
$xml->formatOutput = true;
$xml->load('file.xml');
$items = $addon->getElementsByTagName('items')->item(0);
$items->nodeValue = $str;
$xml->save('file.xml');
In the saved file.xml I see the following:
<item><\item>
How can I save it in the XML file without encoding HTML entities?
Use a DOMDocumentFragment:
<?php
$doc = new DOMDocument();
$doc->load('file.xml'); // '<doc><items/></doc>'
$item = $doc->createDocumentFragment();
$item->appendXML('<item id="1">item</item>');
$doc->getElementsByTagName('items')->item(0)->appendChild($item);
$doc->save('file.xml');
If you're appending to the root element, use $doc->documentElement->appendChild($item); instead of getElementsByTagName.

Categories