Simple_Html_Dom how to parse chinese character - php

Would like to try crawling data from taobao site.
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
</head>
<body>
<?php
include_once('simple_html_dom.php');
$target_url = "http://item.taobao.com/item.htm?spm=a2106.m893.1000384.54.61Q4Fp&id=37676614376&_u=fm86qe4d813&scm=1029.newlist-0.1.50006843&ppath=&sku=&ug=#detail";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach ($html->find('h3[class=tb-main-title]') as $post) {
echo html_entity_decode($post, ENT_QUOTES, "ISO-8859-1") . "<br />";
}
?>
</body>
</html>
But it displays the product title in this:
2014��ЬŮʿ�������¿��ϸ��ƽ���ļ��¿����ϴ���ƽ����Ь��

In order to avoid that, you need to use iconv function. Consider this example:
include 'simple_html_dom.php';
$target_url = "http://item.taobao.com/item.htm?spm=a2106.m893.1000384.54.61Q4Fp&id=37676614376&_u=fm86qe4d813&scm=1029.newlist-0.1.50006843&ppath=&sku=&ug=#detail";
$contents = file_get_contents($target_url);
$html = str_get_html($contents);
foreach($html->find('h3[class=tb-main-title]') as $post) {
$text = $post->innertext;
$text = iconv('gb2312', 'utf-8', $text);
echo $text;
// 2014拖鞋女士人字拖新款豹纹细带平底夏季新款凉拖大码平底拖鞋潮
}

Related

How to include file in PHP with user defined variable

I am trying to include file in string replace but in output i am getting string not the final output.
analytic.php
<?php echo "<title> Hello world </title>"; ?>
head.php
<?php include "analytic.php"; ?>
index.php
string = " <head> </head>";
$headin = file_get_contents('head.php');
$head = str_replace("<head>", "<head>". $headin, $head);
echo $head;
Output i am getting :
<head><?php include "analytic.php"; ?> </head>
Output i need :
<head><title> Hello world </title> </head>
Note : Please do not recommend using analytic.php directly in index.php because head.php have some important code and it has to be merged analytic.php with head.php and then index.php
To get the desired output :
function getEvaluatedContent($include_files) {
$content = file_get_contents($include_files);
ob_start();
eval("?>$content");
$evaluatedContent = ob_get_contents();
ob_end_clean();
return $evaluatedContent;
}
$headin = getEvaluatedContent('head.php');
string = " <head> </head>";
$head = str_replace("<head>", "<head>". $headin, $head);
echo $head;
Output will be output string not file string :
<head><title> Hello world </title> </head>
I think your approach is pretty basic (you try to hardcore modify - programmerly edit - the template script, right?) but anyway:
$file = file('absolut/path/to/file.php');
foreach ($file as $line => $code) {
if (str_contains($code, '<head>')) {
$file[$line] = str_replace('<head>', '<head>' . $headin, $code);
break;
}
}
file_put_contents('absolut/path/to/file.php', $file);

Reading and encoding html

I am trying to read and display the content of the title (contained in a h1 tag) from many HTML files. These files are all in the same folder.
This is what the html files look like :
<!DOCTYPE html PUBLIC '-//W3C//DTD HTML 4.01//EN'>
<html>
<head>
<title>A title</title>
<style type='text/css'>
... Styles here ...
</style>
</head>
<body>
<h1>Être aidant</h1>
<p>En général, les aidants doivent équilibrer...</p>
... more tags ...
</body>
I have tried to display the content from the H1 tag with this PHP script :
<?php
foreach (glob("test/*.html") as $file) {
$file_handle = fopen($file, "r");
$doc = new DOMDocument();
$doc->loadHTMLfile($file);
$title = $doc->getElementsByTagName('h1');
if ( $title && 0<$title->length ) {
$title = $title->item(0);
$content = $doc->savehtml($title);
echo $content;
}
fclose($file_handle);
}
?>
But the output contains wrong characters. For the example file, the output is :
Être aidant
How can I achieve this output?
Être aidant
You should state a charset in the <head> of your HTML document.
<meta charset="utf-8">
you need to use utf-8 encoding
change echo $content to echo utf8_encode($content);

Unable to extract og tags from webpage?

Here is the code that I am using at this point
$file = array_rand($files);
$filename = "http://example.com/".$files[$file];
echo $filename;
libxml_use_internal_errors(true);
$c = file_get_contents($filename);
$d = new DomDocument();
$d->loadHTML($c);
$xp = new domxpath($d);
foreach ($xp->query("//meta[#name='og:title']") as $el) {
echo $el->getAttribute("content");
}
foreach ($xp->query("//meta[#name='og:image']") as $el) {
echo $el->getAttribute("content");
}
$filename has correct value of URL but it does not echo the content of og:image and og:title?
EDIT
This is the typical organization of my webpages
<?php require_once("headertop.php")?>
<meta property="og:image" content="url" />
<meta property="og:title" content="content here." />
<meta property="og:description" content="description here." />
<title>Page title</title>
<?php require_once("headerbottom.php")?>
EDIT 2
From one answer I understood this. I have to use
$rootNamespace = $d->lookupNamespaceUri($d->namespaceURI);
$xpath->registerNamespace('og', $rootNamespace);
and then use
<meta property="og:image" content="url" />
Am I right?
This should work just fine:
<?php
$html = new DOMDocument();
#$html->loadHTML(file_get_contents('http://www.imdb.com/title/tt0117500/'));
foreach($html->getElementsByTagName('meta') as $meta) {
if(strpos($meta->getAttribute('property'), 'og') !==false) {
echo $meta->getAttribute('content') . '<br/>';
}
}
?>
'og' is a namespace, and so it's not going to get pulled in that fashion. You'll need to define that namespace for your DOMXPath object:
http://php.net/manual/en/domxpath.registernamespace.php
Edit: Here is an example I threw together using VICE's homepage. I pulled the Facebook OpenGraph XML namespace from their Developers site.
<?php
error_reporting(E_ERROR);
$html = file_get_contents("http://www.vice.com/");
$doc = new DomDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
$xp->registerNamespace('og', 'http://ogp.me/ns#');
print_r($xp->query("//meta[#name='og:title']")->item(0)->getAttribute('content'));

How to match and add a class name using preg_replace?

I'm trying to match the class attribute of <html> tag and to add a class name using preg_replace().
Here is what I tried so far:
$content = '<!DOCTYPE html><html lang="en" class="dummy"><head></head><body></body></html>';
$pattern = '/< *html[^>]*class *= *["\']?([^"\']*)/i';
if(preg_match($pattern, $content, $matches)){
$content = preg_replace($pattern, '<html class="$1 my-custom-class">', $content);
}
echo htmlentities($content);
But, I got only this returned:
<!DOCTYPE html><html class="dummy my-custom-class">"><head></head><body></body></html>
The attribute lang="en" is dropped out and the tag is appended with the duplicates like ">">. Please help me.
Please try this code it works, perfectly well :)
<?php
$content = '<!DOCTYPE html><html lang="en" class="dummy"><head></head><body></body></html>';
$pattern = '/(<html.*class="([^"]+)"[^>]*>)/i';
$callback_fn = 'process';
$content=preg_replace_callback($pattern, $callback_fn, $content);
function process($matches) {
$matches[1]=str_replace($matches[2],$matches[2]." # My Own Class", $matches[1]);
return $matches[1];
}
echo htmlentities($content);
?>
Remove the * in pattern for regex way
Use this pattern
/<html[^>]*class *= *["\']?([^"\']*)/i
I suggest use Dom parser for parsing the html
<?php
libxml_use_internal_errors(true);
$html="<!DOCTYPE html><html lang='en' class='dummy'><head></head><body></body></html>";
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('html') as $node) {
$node->setAttribute('class','dummy my-custom-class');
}
$html=$dom->saveHTML();
echo $html;
OUTPUT:
<!DOCTYPE html>
<html lang="en" class="dummy my-custom-class"><head></head><body></body></html>

Make relative links into absolute ones

I am requesting the source code of a website like this:
<? $txt = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
echo $txt; ?>
Bu I would like to replace the relative links with absolute ones! Basically,
<img src="/images/legend_15s.png"/> and <img src='/images/legend_15s.png'/>
should be replaced by
<img src="http://domain.com/images/legend_15s.png"/>
and
<img src='http://domain.com/images/legend_15s.png'/>
respectively. How can I do this?
This can be acheived with the following:
<?php
$input = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
$domain = 'http://stats.pingdom.com/';
$rep['/href="(?!https?:\/\/)(?!data:)(?!#)/'] = 'href="'.$domain;
$rep['/src="(?!https?:\/\/)(?!data:)(?!#)/'] = 'src="'.$domain;
$rep['/#import[\n+\s+]"\//'] = '#import "'.$domain;
$rep['/#import[\n+\s+]"\./'] = '#import "'.$domain;
$output = preg_replace(
array_keys($rep),
array_values($rep),
$input
);
echo $output;
?>
Which will output links as follows:
/something
will become,
http://stats.pingdom.com//something
And
../something
will become,
http://stats.pingdom.com/../something
But it will not edit "data:image/png;" or anchor tags.
I'm pretty sure the regular expressions can be improved though.
This code replaces only the links and images:
<? $txt = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
$txt = str_replace(array('href="', 'src="'), array('href="http://stats.pingdom.com/', 'src="http://stats.pingdom.com/'), $txt);
echo $txt; ?>
I have tested and its working :)
UPDATED
Here is done with regular expression and working better:
<? $txt = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
$domain = "http://stats.pingdom.com";
$txt = preg_replace("/(href|src)\=\"([^(http)])(\/)?/", "$1=\"$domain$2", $txt);
echo $txt; ?>
Done :D
You dont need php, you only need to use the html5 base tag, and put your php code in html body, you only need to do the following
Example :
<!doctype html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Document</title>
<base href="http://yourdomain.com/">
</head>
<body>
<? $txt = file_get_contents('http://stats.pingdom.com/qmwwuwoz2b71/522741');
echo $txt; ?>
</body>
</html>
and all the files will use the absolute url

Categories