I have scrape data from Telugu site:
when i got "Suriya’s ‘24’ in legal tangle" this kind of string then that quote is not recognized by php function and it's converted in different character(Issue Link).
Code:
//
include "simple_html_dom.php";
// Get news from telugu site
$url = "http://www.123telugu.com/category/mnews";
$html = file_get_html($url);
$divs = $html->find('div.leading');
$result = array();
$status = FALSE;
$i = 0;
foreach ($divs as $d) {
$status = TRUE;
$title = $d->find('a', 0)->plaintext;
$result[$i]['Title'] = $title;
$link = $d->find('a', 0)->href;
$result[$i]['Link'] = $link;
$title = trim(mysql_real_escape_string($title)); // code for title
$html = file_get_html($link);
// code for image
$image = '';
foreach ($html->find('div.post-content') as $im) {
$image = $im->find('img', 0)->src; // code for image
}
$image = trim(str_replace('//', '', $image));
$result[$i]['Image'] = $image;
// code for content
$content = '';
foreach ($html->find('div.post-content p') as $co) {
$content.= $co->plaintext; // code for content
}
$result[$i]['Content'] = $content;
$i++;
}
echo json_encode(array('Status' => $status, 'Data' => $result));
We have to add following code on top of the page. will solve the issue.
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Solution:
$string= iconv('utf-8', 'us-ascii//TRANSLIT', $string);
htmlspecialchars_decode() might be the function that you are looking for. Just run the final output from the scraper with this function. It should decode all the special HTML encoded characters.
Check out: http://php.net/htmlspecialchars_decode
Related
I'm trying to scrape some information from a web page with the Simple HTML Dom Parser. Some issues are causing elements with a tag to cause an off set in my counters.
The tag looks like:
// <div id="result-title-2" class="offerList-item-description-title">
<script type="text/javascript">
document.write(getContents('##UD9Jj\>2?E:4 9:;23'));
</script>Romantic hijab
</div>
I either need to be able to get the contents or make my programme skip it.
This is how I am currently grabbing all of my Elements:
foreach($html->find('.offerList-item') as $element)
{
$count++;
foreach($element->find('.offerList-item-image img') as $image)
{
//$images[] = '<img src="'.$image->src.'">'.'<br>';//$img->src;
$images[] = $image->src;//$img->src;
}
foreach($element->find('.offerList-item-description-title') as $title)
{
$titles[] = $title->innertext;
}
//foreach($element->find('.priceRange-from') as $price) {
foreach($element->find('.priceRange-from')as $price){
$pound = $price->find('text',1);
$number = $price->find('text',2);
$prices[] = $pound.' '.$number;
}
foreach($element->find('.offerList-itemWrapper') as $compare) //Get store links
{
$links[] = $idealo.$compare->getAttribute('href');
}
}
Download html page to local disk after delete script tags
$url2 = "https://www.idealo.co.uk/mscat.html?q=Dashboard+Cleaner";
//Code to get the file...
$data2 = file_get_contents($url2);
//save as?
$filename = "test.html";
//save the file...
$fh = fopen($filename,"w");
fwrite($fh,$data2);
fclose($fh);
After this codes.
Try scrape and finally delete with this
$target = array('<script type="text/javascript">', '</script>');
$convert = array('<!--<script type="text/javascript">', '</script>-->');
$result = str_replace($target, $convert, $title);
I am trying to return a json array after i parse an rss feed.
this is my code :
<?php
header('Content-Type: application/json');
$feed = new DOMDocument();
//http://www.espnfc.com/rss
//http://www.football365.com/topical-top-10/rss
$feed->load('http://www.espnfc.com/rss');
$json = array();
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
$json['item'] = array();
$i = 0;
foreach($items as $item) {
$i = $i+1;
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
$link = $item->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
$img = $item->getElementsByTagName('enclosure')->item(0)->attributes->getNamedItem('url')->value;
//$img = $item;echo($url);
$json['item'][] = array("title"=>str_replace(array("\n", "\r", "\t","'"), ' ', $title),"link"=>str_replace(array("\n", "\r", "\t","'"), ' ', $link),"img"=>str_replace(array("\n", "\r", "\t","'"), ' ', $img));
}
print_r($json['item'][0]);
//echo json_encode($json['item']);
?>
after iterating all items i finally would like to echo them as a result:
echo json_encode($json['item']);
the problem that's not showing any thing in browser. but when i moved this line into the foreach bloc it show result (of course with redundancy).
Some of the items don't have an <enclosure> tag, so the script gets an error when it tries to access the url attribute. You need to check for this.
$enclosures = $item->getElementsByTagName('enclosure');
if ($enclosures->length) {
$img = $item->getElementsByTagName('enclosure')->item(0)->attributes->getNamedItem('url')->value;
} else {
$img = '';
}
Your code returns request status "Status Code:500 Internal Server Error"
You can easy see it by browsing the network tab of your browser's web tools.
This is because on the 3rd post there is no image.
<?php
// Json Header
header('Content-Type: application/json');
// Get Feed
$feed = new DOMDocument();
$feed->load('http://www.espnfc.com/rss');
// Get Items
$items = $feed->getElementsByTagName('channel')->item(0)->getElementsByTagName('item');
// My json object
$json = array();
$json['item'] = array();
// For each item
foreach($items as $item){
// Get title
$title = $item->getElementsByTagName('title')->item(0)->firstChild->nodeValue;
// Get link
$link = $item->getElementsByTagName('link')->item(0)->firstChild->nodeValue;
// Get image if it exist
$img = $item->getElementsByTagName('enclosure');
if($img->length>0){
$img = $img->item(0)->attributes->getNamedItem('url')->value;
} else {
$img = "";
}
array_push($json['item'], array(
"title" => preg_replace('/(\n|\r|\t|\')/', ' ', $title),
"link" => preg_replace('/(\n|\r|\t|\')/', ' ', $link),
"img" => preg_replace('/(\n|\r|\t|\')/', ' ', $img)
));
}
echo json_encode($json['item']);
?>
Thanks for taking the time to read my post... I'm trying to extract some information from my website using Simple HTML Dom...
I have it reading from the HTML source ok, now I'm just trying to extract the information that I need. I have a feeling I'm going about this in the wrong way... Here's my script...
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$html = file_get_html('http://myshop.com/small_houses.html');
$html .= file_get_html('http://myshop.com/medium_houses.html');
$html .= file_get_html('http://myshop.com/large_houses.html');
//Define my variable for later
$product['image'] = '';
$product['title'] = '';
$product['description'] = '';
foreach($html->find('img') as $src){
if (strpos($src->src,"http://myshop.com") === false) {
$src->src = "http://myshop.com/$src->src";
}
$product['image'] = $src->src;
}
foreach($html->find('p[class*=imAlign_left]') as $description){
$product['description'] = $description->innertext;
}
foreach($html->find('span[class*=fc3]') as $title){
$product['title'] = $title->innertext;
}
echo $product['img'];
echo $product['description'];
echo $product['title'];
?>
I put echo's on the end for sake of testing...but I'm not getting anything... Any pointers would be a great HELP!
Thanks
Charles
file_get_html() returns a HTMLDom Object, and you cannot concatenate Objects, although HTMLDom have __toString methods when there concatenated there more then lilly corrupt in some way, try the following:
<?php
include_once('simple_html_dom.php');
// create doctype
$dom = new DOMDocument("1.0");
// display document in browser as plain text
// for readability purposes
//header("Content-Type: text/plain");
// create root element
$xmlProducts = $dom->createElement("products");
$dom->appendChild($xmlProducts);
$pages = array(
'http://myshop.com/small_houses.html',
'http://myshop.com/medium_houses.html',
'http://myshop.com/large_houses.html'
)
foreach($pages as $page)
{
$product = array();
$source = file_get_html($page);
foreach($source->find('img') as $src)
{
if (strpos($src->src,"http://myshop.com") === false)
{
$product['image'] = "http://myshop.com/$src->src";
}
}
foreach($source->find('p[class*=imAlign_left]') as $description)
{
$product['description'] = $description->innertext;
}
foreach($source->find('span[class*=fc3]') as $title)
{
$product['title'] = $title->innertext;
}
//debug perposes!
echo "Current Page: " . $page . "\n";
print_r($product);
echo "\n\n\n"; //Clear seperator
}
?>
Searching stackoverflow i've found a answer for my need, but I can't figure out how to use it exactly if someone could give me a hint It would be appreciated !
Here's my need, I'm using wordpress and I would to put automatic ID to <...> tags so I found "mario" who answer this:
If you have a coherent input like
that, then you can use regular
expressions. In this case it's both
very acceptable and simple:
$html = preg_replace_callback("#<(h[1-6])>(.*?)</\\1>#", "retitle", $html);
function retitle($match) {
list($_unused, $h2, $title) = $match;
$id = strtolower(strtr($title, " .", "--"));
return "<$h2 id='$id'>$title</$h2>"; }
The id conversion needs a bit more work. And to make the regex more reliable the innter text match pattern (.*?) could be written as ([^<>]*) for example.
H2 tag auto ID in php string
So i've tryed to apply this to my script, but that doesn't work well at all, here is my code
<?php
$html = get_the_content();
$html = preg_replace_callback("#<(h[1-6])>(.*?)</\\1>#", "retitle", $html);
function retitle($match) {
list($_unused, $h2, $title) = $match;
$id = strtolower(strtr($title, " .", "--"));
return "<$h2 id='$id'>$title</$h2>";
}
if(have_posts()) : while(have_posts()) : the_post(); //Vérifie que le contenu existe
echo $html;
endwhile;
endif;
?>
Don't use regex to solve that problem. Using domdocument:
if (empty($content)) return '';
$dom = new DomDocument();
libxml_use_internal_errors(true)
$html = '<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
</head>
<body>'.$content.'</body>
</html>';
$dom->loadHtml($html);
$hTAGs = $dom->getElementsByTagName($tag);
foreach ($hTAGs as $hTAG) {
if (!$hTAG->hasAttribute('id')) {
$title = $hTAG->nodeValue;
$id = iconv('UTF-8', 'ASCII//TRANSLIT', $title);
$id = preg_replace('/[^a-zA-Z0-9-\s]/', '', $id);
$hTAG->setAttribute('id', $id);
}
}
$content = '';
$children = $dom->getElementsByTagName('body')->item(0)->childNodes;
foreach ($children as $child) {
$content .= $dom->saveXml($child);
}
return $content;
Never, ever use RegEx for HTML, ok? Just accept this. Or read the numerous posts on here why not.
DOMDocument is ugly and evil. Use simple_html_dom instead, it's much simpler:
include 'simple_html_dom.php';
$html = str_get_html('<h2>hello</h2><h3>world</h3><h2 id='123'>how r ya</h2>');
$h2s = $html->find("h2");
foreach($h2s as $h2)
{
if(!$h2->hasAttribute("id")) $h2->id = "title";
}
echo $html->save();
from a PHP script I'm downloading a RSS feed like:
$fp = fopen('http://news.google.es/news?cf=all&ned=es_ve&hl=es&output=rss','r')
or die('Error reading RSS data.');
The feed is an spanish news feed, after I downloaded the file I parsed all the info into one var that have only the content of the tag <description> of every <item>. Well, the issue is that when I echo the var all the information have an html enconding like:
echo($result); // this print: el ministerio pãºblico investigarã¡ la publicaciã³n en la primera pã¡gina
Well I can create a HUGE case instance that searchs for every char can change it for the correspongind one, like: ã¡ for Á and so and so, but there is no way to do this with a single function??? or even better, there is no way to download the content to $fp without the html encoding? Thanks!
Actual code:
<?php
$acumula="";
$insideitem = false;
$tag = '';
$title = '';
$description = '';
$link = '';
function startElement($parser, $name, $attrs) {
global $insideitem, $tag, $title, $description, $link;
if ($insideitem) {
$tag = $name;
} elseif ($name == 'ITEM') {
$insideitem = true;
}
}
function endElement($parser, $name) {
global $insideitem, $tag, $title, $description, $link, $acumula;
if ($name == 'ITEM') {
$acumula = $acumula . (trim($title)) . "<br>" . (trim($description));
$title = '';
$description = '';
$link = '';
$insideitem = false;
}
}
function characterData($parser, $data) {
global $insideitem, $tag, $title, $description, $link;
if ($insideitem) {
switch ($tag) {
case 'TITLE':
$title .= $data;
break;
case 'DESCRIPTION':
$description .= $data;
break;
case 'LINK':
$link .= $data;
break;
}
}
}
$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, 'startElement', 'endElement');
xml_set_character_data_handler($xml_parser, "characterData");
$fp = fopen('http://news.google.es/news?cf=all&ned=es_ve&hl=es&output=rss','r')
or die('Error reading RSS data.');
while ($data = fread($fp, 4096)) {
xml_parse($xml_parser, $data, feof($fp))
or die(sprintf('XML error: %s at line %d',
xml_error_string(xml_get_error_code($xml_parser)),
xml_get_current_line_number($xml_parser)));
}
//echo $acumula;
fclose($fp);
xml_parser_free($xml_parser);
echo($acumula); // THIS IS $RESULT!
?>
EDIT
Since you're already using the XML parser, you're guaranteed the encoding is UTF-8.
If your page is encoded in ISO-8859-1, or even ASCII, you can do this to convert:
$result = mb_convert_encoding($result, "HTML-ENTITIES", "UTF-8");
Use a library that handles this for you, e.g. the DOM extension or SimpleXML. Example:
$d = new DOMDocument();
$d->load('http://news.google.es/news?cf=all&ned=es_ve&hl=es&output=rss');
//now all the data you get will be encoded in UTF-8
Example with SimpleXML:
$url = 'http://news.google.es/news?cf=all&ned=es_ve&hl=es&output=rss';
if ($sxml = simplexml_load_file($url)) {
echo htmlspecialchars($sxml->channel->title); //UTF-8
}
You can use DOMDocument from PHP to strip HTML encoding tags.
And use encoding conversion functions also from PHP to change encoding of this sting.