I am trying to scrape some old pages and present them in a modern design for me using Dom
And I have a problem with the encoding, The content is in french
I am using this code to get the content that I want, There is 2 type of content "Categories" And "Data"
$html = new DOMDocument();
$html->validateOnParse = true;
#$html->loadHTML($page);
$xpath = new DOMXPath($html);
$table =$xpath->query("//*[#style='background: white']")->item(0);
Then I process the content , First I enter the Categories in a function that convert them to id for me
function category_to_id($category) {
$categories = array('Forêts','Assurance','Aéronautique','Equipement ','Autre');
foreach ($categories as $id => $cat) {
if(trim($cat) == trim($category)) {
return $id + 1;
}
}
}
Then I store everything in MYSQL database
My first problem is my function work only for categories without spécial charachters like Assurance
And the second is that when I go to the database, I find the data stored like this Travaux d'électricité instead of Travaux d'électricité
I tried adding $html->encoding = 'utf-8'; But that didn't change anything
What am i doing wrong, And how can I fix it
Dom doesn't use UTF-8 as default, so you should encode the page to it
$xml->loadHTML(mb_convert_encoding($page, 'HTML-ENTITIES', "UTF-8"););
Alternatively, you could utf8_decode your string
echo category_to_id(utf8_decode("Travaux d'électricité"));
Related
the source of this problem is because I'm running ads on my website, my content is mainly HTML stored in a database, so I decided to place "In-Text Ads", ads that are not in a fixed zone.
My solution was to explode the content by paragraphs and place the text ad in the middle of the p tags, which worked pretty cool since I use CKEditor to generate the content, I thought images, blockquotes, and other tags would be nested inside p tags (fool me) I realize now that images and blockquotes disappeared from my posts, what did I do next? I changed my code to explode using * instead of exploding by p tag, I sang victory too soon, because now I get a lot of duplicate content, for example, if I have one image now I get the same image 4 times as well as all other tags, I´m not sure about the source of this duplicates but I think It has something to do with nested HTML, I looked for a solution for hours and now I'm here asking to see whether somebody can help me solve this headache
Here is my code:
//In a helper file
function splitByHTMLTagName(string $string, string $tagName = 'p')
{
$text = <<<TEXT
$string
TEXT;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$nodes = [];
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $text);
foreach ($dom->getElementsByTagName($tagName) as $node) {
array_push($nodes, $dom->saveHTML($node));
}
libxml_clear_errors();
return $nodes;
}
//In my view
$text = nl2br($database['content']);
$nodes = splitByHTMLTagName($text, '*');
//Using var_dump($nodes); here shows the duplicates are here already.
$nodes_count = count($nodes);
$show_ad_at = -1;
$was_added = false;
if($nodes_count % 2 == 0 ){
$show_ad_at = $nodes_count /2;
}else if ($nodes_count == 1 || $nodes_count < 3){
$show_ad_at = -1; //add later
}else if ($nodes_count > 3 && $nodes_count % 2 != 0){
$show_ad_at = ceil($nodes_count/2);
}
for($i = 0; $i<count($nodes); $i++){
if(!$was_added && $i == $show_ad_at){
$was_added = true;
?>
<div>
<script></script><!--This script is provided to me, it adds the ad where it is placed, I don't show the full script, It has nothing to do with the duplicates problem-->
</div>
<?php
}
echo $nodes[$i]; //print the node that comes from $nodes array where the duplicates already exist
}
if(!$was_added){
$was_added = true;
?>
<div>
<script></script><!--This script is provided to me, it adds the ad where it is placed, I don't show the full script, It has nothing to do with the duplicates problem-->
</div>
<?php
}
What can I do?
Thanks in advance.
Postdata #1: I use codeigniter as PHP Framework
Postdata #2: My ads provider does not implement "In-Text ads" as a feature like google does.
It seems you are printing the "ads block" inside if statement.
If I don't misunderstood your code is like
foreach ... {
if (strpos($html_line, "In-Text Ads") !== FALSE) {
print($ads_html);
}
I think, you should use str_replace() instead of print() like functions, if you are using something like print() when you outputting the value...
Exemple with a mediawiki link : https://www.visionduweb.eu/wiki/index.php?title=Utiliser_PHP
Show the source code and identify the sommaire from this Mediawiki page.
I search how i can parse the source code and found the HTML code for this sommaire.
#
I tried with $domExemple = $xpath->query(« //ul/li »); but I have too many answers and poorly formatted.
I tried with $domExemple = $xpath->query(« //ul/li[#class=’toclevel-1 tocsection-1′] »); which gives me the result, but, how to get all toclevel and tocsection, without having to specify the number 1, or 2, or 3, ... toclevel or tocsection.
In this example, I do not get the HTML content, only the text content.
I would have preferred to retrieve the HTML content.
I believe you can simplify your xpath expression using the syntax defined here:
How can I match on an attribute that contains a certain string?
Try something like this:
$results = $xpath->query('//ul/li[contains(#class, "toclevel-") and contains(#class, "tocsection-"]');
foreach ($results as $li) {
// to get html of $li, import it into a fresh DOMDocument and run saveHTML
$newdoc = new DOMDocument();
$cloned = $li->cloneNode(true);
$newdoc->appendChild($newdoc->importNode($cloned, true));
echo $newdoc->saveHTML();
}
I know how to xpath and echo text off another website via tags like div id, class ,etc, using the below code. But, I don't know how to do it under more precise conditions, for example when trying to scrape and echo a bit of text that has no unique tag identifier like a div.
This below code spits out scraped data.
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
In this below source code for an example, I want to pull the value "21,271.97". But there's no unique tag for this, no div id. Is it possible to pull this data by identifying a keyword in the < p> that never changes, for example "DJIA all time".
<p>DJIA All Time, Record-High Close: <font color="#0000FF">June 9,
2017</font>
(<font color="#FF0000"><b bgcolor="#FFFFCC"><font face="Verdana, Arial,
Helvetica, sans-serif" size="2">21,271.97</font></b></font>)</p>
Wondering if I could possibly replace this with something around the lines of $query = "//div[#class='market']";
$query = "//p['DJIA all time']";
Could this be possible?
I also wonder if using a loop with something like $query = "//p[='DJIA']";?
could work, though I don't know how to use that exactly.
Thanks!!
It would be good to have a play with an online XPath tester - I use https://www.freeformatter.com/xpath-tester.html#ad-output
$query = "//p[contains(text(),'DJIA')]";
Although if you use the page your after, I've found that the value seems to be the first record for...
$query = "//span[contains(#class,'market_price')]";
But the idea is the same in both cases, using contains(source,value) will match a set of nodes. In the first case the text() is the value of the node,the second looks for the specific class definition.
Try to use below XPath expression:
//p[contains(text(), "DJIA All Time")]//b/font
Considering provided link (http://www.nbcnews.com/business) you can get required text with
//span[text()="DJIA"]/following-sibling::span[#class="market_item market_price"]
I'm trying to load an HTML page by using a URL. This is what I'm doing now to find the count of images on a page:
$html = "http://stackoverflow.com/";
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('*');
$count = 0;
foreach ($tags as $tag) {
if (strcmp($tag->tagName, "img") == 0) {
$count++;
}
}
echo $count;
I know this isn't an efficient way to do this, I just set it up as an example. Each time, count is 0. But there are images on the page. Which brings me to believe the page isn't loading right. What am I doing wrong? Thanks.
Tag names in HTML are canonically in upper-case, however you can avoid the issue by using strcasecmp instead of strcmp.
Or avoid both problems by doing it properly:
$count = $doc->getElementsByTagName('img')->length;
From the docs
DOMDocument::loadHTML — Load HTML from a string
It's signature is quite clear about this, too:
public bool DOMDocument::loadHTML ( string $source [, int $options = 0 ] )
You could try using DOMDocument::loadHTMLFile, or simply get the markup of the given url using file_get_contents or a cURL request (whichever works best for you).
And please don't use the error-suppression operator # of death if something emits a notice/warning/error, there's a problem. Don't ignore it, fix it!
I generate a lot of posts in Wordpress from an XML file. The worry: accented characters.
The header of the stream is:
<? Xml version = "1.0" encoding = "ISO-8859-15"?>
Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54
My site is in utf8.
So I use the function utf8_encode ... but that does not solve the problem, the accents are always misunderstood.
Does anyone have an idea?
EDIT 04-10-2011 18:02 (french hour) :
Here is the complete flux : http://flux.netaffiliation.com/rsscp.php?maff=177053821BA2E13E910D54
Here is my code :
/**
* parse an rss flux from netaffiliation and convert each item to posts
* #var $flux = external link
* #return bool
*/
private function parseFluxNetAffiliation($flux)
{
$content = file_get_contents($flux);
$content = iconv("iso-8859-15", "utf-8", $content);
$xml = new DOMDocument;
$xml->loadXML($content);
//get the first link : http://www.netaffiliation.com
$link = $xml->getElementsByTagName('link')->item(0);
//echo $link->textContent;
//we get all items and create a multidimentionnal array
$items = $xml->getElementsByTagName('item');
$offers = array();
//we walk items
foreach($items as $item)
{
$childs = $item->childNodes;
//we walk childs
foreach($childs as $child)
{
$offers[$child->nodeName][] = $child->nodeValue;
}
}
unset($offers['#text']);
//we create one article foreach offer
$nbrPosts = count($offers['title']);
if($nbrPosts <= 0)
{
echo self::getFeedback("Le flux ne continent aucune offre",'error');
return false;
}
$i = 0;
while($i < $nbrPosts)
{
// Create post object
$description = '<p>'.$offers['description'][$i].'</p><p>'.$offers['link'][$i].'</p>';
$my_post = array(
'post_title' => $offers['title'][$i],
'post_content' => $description,
'post_status' => 'publish',
'post_author' => 1,
'post_category' => array(self::getCatAffiliation())
);
// Insert the post into the database
if(!wp_insert_post($my_post));;
$i++;
}
echo self::getFeedback("Le flux a généré {$nbrPosts} article(s) depuis le flux NetAffiliation dans la catégorie affiliation",'updated');
return false;
}
All the posts are generated but... the accented chars are ugly. You can see the result here: http://monsieur-mode.com/test/
There are plenty difficulties which you have to master when swapping between different encodings. Also, encodings which use more than one byte to encode characters (so-called multibyte-encodings) like UTF-8, which is used by WordPress, deserve special attention in PHP.
First, make sure that all the files you create are saved with the same encoding as they will be served. For example, make sure you set the same encoding as in the "Save as..."-dialog as you use in the HTTP Content-Type header.
Second, you need to verify that the input has the same encoding as the file you want to deliver. In your case, the input file has the encoding ISO-8859-15, so you'll need to convert it to UTF-8 using iconv().
Third, you must know that PHP doesn't natively support multibyte-encodings such as UTF-8. Functions such as htmlentities() will produce strange characters. For many of these functions, there are multibyte-alternatives, which are prefixed with mb_. If your encoding is UTF-8, check your files for such functions and replace them if necessary.
For more information about these topics, see Wikipedia about variable-width encodings, and the page in the PHP-Manual.
By default, most application work with UTF-8 data and output UTF-8 content. Wordpress should definitely not be apart and surely works on a UTF-8 basis.
I would simply not convert at all any information when printing, but instead change your header to UTF-8 instead of ISO-8859-15.
If your incoming XML data is ISO-8859-15, use iconv() to convert it:
$stream = file_get_contents("stream.xml");
$stream = iconv("iso-8859-15", "utf-8", $stream);
mb_convert_encoding()saves my life.
Here is my solution :
$content = preg_replace('/ encoding="ISO-8859-15"/is','',$content);
$content = mb_convert_encoding($content,"UTF-8");