php problem with russian language - php

i get page in utf-8 with russian language using curl. if i echo text it show good. then i use such code
$dom = new domDocument;
/*** load the html into the object ***/
#$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
/*** the table by its tag name ***/
$tables = $dom->getElementsByTagName('table');
/*** get all rows from the table ***/
$rows = $tables->item(0)->getElementsByTagName('tr');
/*** loop over the table rows ***/
for ($i = 0; $i <= 5; $i++)
{
/*** get each column by tag name ***/
$cols = $rows->item($i)->getElementsByTagName('td');
echo $cols->item(2)->nodeValue;
echo '<hr />';
}
$html contains russian text. after it line echo $cols->item(2)->nodeValue; display error text, not russian. i try iconv but not work. any ideas?

I suggest use mb_convert_encoding before load UTF-8 page.
$dom = new DomDocument();
$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
#$dom->loadHTML($html);
OR else you could try this
$dom = new DomDocument('1.0', 'UTF-8');
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
..........
echo html_entity_decode($cols->item(2)->nodeValue,ENT_QUOTES,"UTF-8");
..........

The DOM cannot recognize the HTML's encoding.
You can try something like:
$doc = new DOMDocument();
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
// taken from http://php.net/manual/en/domdocument.loadhtml.php#95251

mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
The same thing worked for PHPQuery.
P.S. I use phpQuery::newDocument($html);
instead of $dom->loadHTML($html);

Related

How does one strip tags (and their content) from an HTML string using PHP's DOMDocument?

I'd like to remove all links and their contents from an HTML string.
So this ...
LINK1 and <i>also</i> LINK2 ... should become this: and <i>also</i>
The following ...
$html = 'LINK1 - and <i>also</i> LINK2';
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = false;
$dom->resolveExternals = false;
$dom->substituteEntities = false;
$dom->loadHTML( $html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$list = $dom->getElementsByTagName('a');
while ($list->length > 0) {
$p = $list->item(0);
$p->parentNode->removeChild($p);
}
$html_new = $dom->saveHTML();
echo htmlentities($html);
echo '<br><br><hr><br>';
echo htmlentities($html_new);
... does not work unless I wrap $html in a <div>, but then I have <div> and <i>also</i> </div>. I could use substr to strip the first 5 and last 6 characters off the result, but that's just stupid, and my face is already too sore from all the face-palming I've endured trying to figure out the above.
Any advice on how to strip all tags out of a string without using regex, or resorting to facepalmy hacks?
Based on Niet the Dark Absol's comment, my solution was to simply wrap my code nippet in a div, and then use substr to remove it. Seems like an acceptable workaround for working with valid inline HTML snippets (and not the entire DOM) via DOMDocument.
$html = 'LINK1 - and <i>also</i> LINK2';
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->validateOnParse = false;
$dom->resolveExternals = false;
$dom->substituteEntities = false;
$dom->loadHTML( '<div>'.$html.'</div>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD );
$list = $dom->getElementsByTagName('a');
while ($list->length > 0) {
$p = $list->item(0);
$p->parentNode->removeChild($p);
}
$result = substr($dom->saveHTML(), 5, -6);

Displaying the results of php document parser

Can you echo the results of a document parser or do you have to first create an array to display the results? Anyway, when running the code, nothing appears (no output or errors), and I have tried both methods. Could possibly be a site issue but I have tried a few others and get the same result.
<?php
$ebayquery ='halo';
$ebayhtml = 'https://www.ebay.com/sch/i.html_from=R40&_trksid=p2380057.m570.l1311.R6.TR12.TRC2.A0.H0.X.TRS0&_nkw=' . $ebayquery . '&_sacat=0';
$ebayresults = array();
$document = new \DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$document->loadHTML($ebayhtml);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXpath($document);
$links = $xpath->query('//h3[#id="lvtitle"]/a');
foreach($links as $a) {
echo $a->nodeValue;
}
?>
There are a couple of problems with the code. Firstly is that loadHTML() takes a string for the HTML and not a filename or URI. So first you have to read the web page and pass it in ( I've used file_get_contents() here).
Secondly, the XPath was looking for any <h3> tag with an id attribute of lvtitle, there are only instances where the class attribute is lvtitle. I've updated the XPath expression to use this instead.
$ebayquery ='halo';
$ebayhtml = 'https://www.ebay.com/sch/i.html_from=R40&_trksid=p2380057.m570.l1311.R6.TR12.TRC2.A0.H0.X.TRS0&_nkw=' . $ebayquery . '&_sacat=0';
$ebayresults = array();
$document = new \DOMDocument('1.0', 'UTF-8');
$internalErrors = libxml_use_internal_errors(true);
$ebayhtml = file_get_contents($ebayhtml);
$document->loadHTML($ebayhtml);
libxml_use_internal_errors($internalErrors);
$xpath = new DOMXpath($document);
$links = $xpath->query('//h3[#class="lvtitle"]/a');
print_r($links);
foreach($links as $a) {
echo $a->nodeValue.PHP_EOL;
}

Echo HTML code, which is retrieved from a external page in php

We have this code
$page = file_get_contents('http://example.aspx?a=14&c=14213&med=0');
$doc = new DOMDocument();
$doc->loadHTML($page);
$divs = $doc->getElementsByTagName('table');
foreach($divs as $div) {
// Loop through the table´s looking for one withan id of "Table2"
// Then echo out its contents
if ($div->getAttribute('id') === 'Table2') {
echo $div->childNodes;
}
}
As you see the code works, but outputs plain text, because the function of childnodes, but we need to output the code of "Table2" instead of plain text.
How can I do this?
Solved, with this code
$dom = new DOMDocument();
$data = file_get_contents('http://example.aspx?a=14&c=14213&med=0');
$dom->loadHTML($data); // $data is your html code, grab it using file_get_contents or cURL.
$xpath = new DOMXPath($dom);
$div = $xpath->query('//table[#id="Table2"]');
$div = $div->item(0);
echo $dom->saveXML($div);

DOM Parser grabbing href of <a> tag by class="Decision"

I'm working with a DOM parser and I'm having issues. I'm basically trying to grab the href within the tag that only contain the class ID of 'thumbnail '. I've been trying to print the links on the screen and still get no results. Any help is appreciated. I also turned on error_reporting(E_ALL); and still nothing.
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$classId = "thumbnail ";
$div = $html->find('a#'.$classId);
echo $div;
I also tried this but still had the same result of NOTHING:
include('simple_html_dom.php');
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
// grab all the on the page
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
$ret = $html->find('a[class=thumbnail]');
echo $ret;
You were almost there:
<?php
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a[contains(concat(' ',normalize-space(#class),' '),' thumbnail ')]");
var_dump($hrefs);
Gives:
class DOMNodeList#28 (1) {
public $length =>
int(25)
}
25 matches, I'd call it success.
This code would probably work:
$html = file_get_contents('http://www.reddit.com/r/funny');
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$hyperlinks = $xpath->query('//a[#class="thumbnail"]');
foreach($hyperlinks as $hyperlink) {
echo $hyperlink->getAttribute('href'), '<br>;'
}
if you're using simple_html_dom, why are you doing all these superfluous things? It already wraps the resource in everything you need -- http://simplehtmldom.sourceforge.net/manual.htm
include('simple_html_dom.php');
// set up:
$html = new simple_html_dom();
// load from URL:
$html->load_file('http://www.reddit.com/r/funny');
// find those <a> elements:
$links = $html->find('a[class=thumbnail]');
// done.
echo $links;
Tested it and made some changes - this works perfect too.
<?php
// load the url and set up an array for the links
$dom = new DOMDocument();
#$dom->loadHTMLFile('http://www.reddit.com/r/funny');
$links = array();
// loop thru all the A elements found
foreach($dom->getElementsByTagName('a') as $link) {
$url = $link->getAttribute('href');
$class = $link->getAttribute('class');
// Check if the URL is not empty and if the class contains thumbnail
if(!empty($url) && strpos($class,'thumbnail') !== false) {
array_push($links, $url);
}
}
// Print results
print_r($links);
?>

PHP HTML DOM, XPATH - weird characters?

Assume $html_dom contains a page that has HTML entities like  . In the output below, I get an output like this  .
$html_dom = new DOMDocument();
#$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);
$query = '//div[#class="foo"]/div/p';
$my_foos = $xpath->query($query_abstract);
foreach ($my_foos as $my_foo)
{
echo html_entity_decode($my_foos->nodeValue);
die;
}
How do I handle this properly so that I don't get weird characters? I tried the following with no success:
$html_doc = mb_convert_encoding($html_doc, 'HTML-ENTITIES', 'UTF-8');
$html_dom = new DOMDocument();
$html_dom->resolveExternals = TRUE;
#$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);
$query = '//div[#class="foo"]/div/p';
$my_foos = $xpath->query($query);
foreach ($my_foos as $my_foo)
{
echo html_entity_decode($my_foos->nodeValue);
die;
}
mb_convert_encoding was a good idea, but it does not work as expected because DOMDocument seems to be a little big buggy when it comes to encoding.
Moving the mb_convert_encoding to the actual node output did the trick.
$html_dom = new DOMDocument();
$html_dom->resolveExternals = TRUE;
#$html_dom->loadHTML($html_doc);
$xpath = new DOMXPath($html_dom);
$query = '//div[#class="foo"]/div/p';
$my_foos = $xpath->query($query);
foreach ($my_foos as $my_foo)
{
echo mb_convert_encoding($my_foo->nodeValue, 'HTML-ENTITIES', 'UTF-8');
die;
}

Categories