Simple html dom - other result than expected - php

I try to retrieve info from a webpage using simple_html_dom, like this:
<?PHP
include_once('dom/simple_html_dom.php');
$urlpart="http://w2.brreg.no/motorvogn/";
$url = "http://w2.brreg.no/motorvogn/heftelser_motorvogn.jsp?regnr=BR15597";
$html = file_get_html($url);
foreach($html->find('a') as $element)
if(preg_match('*dagb*',$element)) {
$result=$urlpart.$element->href;
$resultcontent=file_get_contents($result);
echo $resultcontent;
}
?>
The $result variable first gives me this URL:
http://w2.brreg.no/motorvogn/dagbokutskrift.jsp?dgbnr=2011365320&embnr=0&regnr=BR15597
When accessing the above URL with my browser, i get the content i expect.
When retrieving the content with $resultcontent, i get a different result, where it says in norwegian "Invalid input".
Any ideas why?

foreach($html->find('a') as $element)
if(preg_match('*dagb*',$element)) {
$result=$urlpart.$element->href;
$resultcontent=file_get_contents(html_entity_decode($result));
echo $resultcontent;
}
This should do the trick.

The problem is with your URL query parameter.
http://w2.brreg.no/motorvogn/dagbokutskrift.jsp?dgbnr=2011365320&embnr=0&regnr=BR15597
The string '&reg' in the URL will be converted to Symbol ® in file_get_contents function which stops you from getting the actual result.
You can use html_entity_decode function in line #11
$resultcontent=file_get_contents(html_entity_decode($result));

Related

PHP SimplePie Error: $item->get_enclosure() always return true

Am trying to build a news reader using php SimplePie Library. When i try to get image from feed using code
if ($enclosure = $item->get_enclosure()){
$imageLink = $enclosure->get_link();
echo "<img src=\"$imageLink\">";
}
When i fetch feed from an rss feed which dont have an enclosure, it echo image tag with source as follows.
src="//?#"
The above code is working fine with feeds which have enclosures.
I also tried with code:
if ($enclosure = $item->get_enclosure()){
if($imageLink = $enclosure->get_link()){
echo "<img src=\"$imageLink\">";
}
}
can someone tell me what i am doin wrong in these codes?
Seems like $imageLink value is //?#, so if you do
if($imageLink = $enclosure->get_link())
The result is true...
check the exact value if there is no enclosure, and then change the condition... I.E
$imageLink = $enclosure->get_link();
if($imageLink !== "//?#") {
You can check the exact value using
if ($enclosure = $item->get_enclosure()){
$imageLink = $enclosure->get_link();
var_dump($imageLink);
}
Check if $imageLink is assigned a value anywhere in your code. Most probably that could be the error. Use print_r or var_dump at each step of your code to fine where exactly is you code assigning that value to before mentioned variable

PHP Web Scraping And JSON or Array Output

I'm experimenting scraping Amazon with PHP but I don't know what I am doing wrong. The problem is that I can't access all the data I scraped. Here is my code:
<?php
$url = 'https://www.amazon.com/s/ref=nb_sb_ss_c_1_9?url=search-alias%3Daps&field-keywords=most+sold+items+on+amazon&sprefix=most+sold%2Caps%2C435&crid=348CE8G406XVG&rh=i%3Aaps%2Ck%3Amost+sold+items+on+amazon';
$html = file_get_html($url);
foreach ($html->find('h2[class=a-size-medium]') as $element) {
echo "<li>" .$element->plaintext."</li><br>";
}
?>
The foreach statement loops through and output the plain text but I want to be able to pass the plain text to a variable or array. The problem is that if I do that and output the result, I only get the last string of the plain text array. I have done lots of research to find what I'm doing wrong but I can't find it. Please any help will be appreciated. Here is what I'm trying to achieve:
<?php
$url = 'https://www.amazon.com/s/ref=nb_sb_ss_c_1_9?url=search-alias%3Daps&field-keywords=most+sold+items+on+amazon&sprefix=most+sold%2Caps%2C435&crid=348CE8G406XVG&rh=i%3Aaps%2Ck%3Amost+sold+items+on+amazon';
$hold = array();
$html = file_get_html($url);
foreach ($html->find('h2[class=a-size-medium]') as $element) {
$hold = $element->plaintext;
}
print_r($hold);
?>
The second code will output the last string of the plain text which is: "Rubbermaid LunchBlox Side Container Kit, 2-Pack, 1806176". I also tried achieving this by encoding and decoding the plain text but nothing changed. What am I doing wrong?
Instead of setting the array hold to a string...add new elements to the array:
$hold[] = $element->plaintext;

PHP - Parsing strings

I'm trying to extract data from anchor urls of a webpage i.e. :
require 'simple_html_dom.php';
$html = file_get_html('http://www.example.com');
foreach($html->find('a') as $element)
{
$href= $element->href;
$name=$surname=$id=0;
parse_str($href);
echo $name;
}
Now, the problem with this is that it doesn't work for some reason. All urls are in the following form:
name=James&surname=Smith&id=2311245
Now, the strange thing is, if i execute
echo $href;
I get the desired output. However, that string won't parse for some reason and also has a length of 43 accroding to strlen() function. If , however, i pass 'name=James&surname=Smith&id=2311245' as parse_srt() function argument, it works just fine. What could be the problem?
I'm gonna take a guess that your target page is actually one of the rare pages that correctly encodes & in its links. Example:
<a href="somepage.php?name=James&surname=Smith&id=3211245">
To parse this string, you first need to unescape the &s. You can do this with a simple str_replace if you like.
Presuming the links are absolute, you just need the query string. You can use parse_url and use an out parameter with parse_str access an array;
$html = file_get_html('http://www.example.com');
foreach($html->find('a') as $element)
{
$href= $element->href;
$url_components = parse_url($href);
parse_str($url_components['query'], $out);
var_dump($out)
}

How to get data from url link to our website

I have a problem with my website. I want to get all the schedule flight data from another website. I see its source code and get the url link for processing data. Can somebody tell me how to get the data from the current url link, then display it to our website with PHP?
You can do it using file_get_contents() function. this function return html of provided url. then use HTML Parser to get required data.
$html = file_get_contents("http://website.com");
$dom = new DOMDocument();
$dom->loadHTML($html);
$nodes = $dom->getElementsByTagName('h3');
foreach ($nodes as $node) {
echo $node->nodeValue."<br>"; // return <h3> tag data
}
Another way to extract data using preg_match_all()
$html = file_get_contents($_REQUEST['url']);
preg_match_all('/<div class="swrapper">(.*?)<\/div>/s', $html, $matches);
// specify the class to get the data of that class
foreach ($matches[1] as $node) {
echo $node."<br><br><br>";
}
Use file_get_contents
Sample code
<?php
$homepage = file_get_contents('http://www.google.com/');
echo $homepage;
?>
Yes Sure ... Use file_get_contents('$URl') function to get the source code of the target page or use curl if you prefer using curl .. and scrap all data you need with preg_match_all() function
Note : If the target url has https:// then you should use curl to get the source code
Example
http://stackoverflow.com/questions/2838253/php-curl-preg-match-extract-text-from-xhtml

SimpleXML feed showing blank arrays - how do I get the content out?

I'm trying to get the image out of a rss feed using a simpleXML feed and parsing the data out via an array and back into the foreach loop...
in the source code the array for [description] is shown as blank though I've managed to pull it out using another loop, however, I can't for the life of me work out how to pull in the next array, and subsequently the image for each post!
help?
you can view my progress here: http://dev.thebarnagency.co.uk/tfolphp.php
here's the original feed: feed://feeds.feedburner.com/TheFutureOfLuxury?format=xml
$xml_feed_url = 'http://feeds.feedburner.com/TheFutureOfLuxury?format=xml';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $xml_feed_url);
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$xml = curl_exec($ch);
curl_close($ch);
function produce_XML_object_tree($raw_XML) {
libxml_use_internal_errors(true);
try {
$xmlTree = new SimpleXMLElement($raw_XML);
} catch (Exception $e) {
// Something went wrong.
$error_message = 'SimpleXMLElement threw an exception.';
foreach(libxml_get_errors() as $error_line) {
$error_message .= "\t" . $error_line->message;
}
trigger_error($error_message);
return false;
}
return $xmlTree;
}
$feed = produce_XML_object_tree($xml);
print_r($feed);
foreach ($feed->channel->item as $item) {
// $desc = $item->description;
echo 'link<br>';
foreach ($item->description as $desc) {
echo $desc;`
}
}
thanks
Can you use
wp_remote_get( $url, $args );
Which i get from here http://dynamicweblab.com/2012/09/10-useful-wordpress-functions-to-reduce-your-development-time
Also get more details about this function http://codex.wordpress.org/Function_API/wp_remote_get
Hope this will help
I'm not entirely clear what your problem is here - the code you provided appears to work fine.
You mention "the image for each post", but I can't see any images specifically labelled in the XML. What I can see is that inside the HTML in the content node of the XML, there is often an <img> tag. As far as the XML document is concerned, this entire blob of HTML is just one string delimited with the special tokens <![CDATA[ and ]]>. If you get this string into a PHP variable (using (string)$item->content you can then find a way of extracting the <img> tag from inside it - but note that the HTML is unlikely to be valid XML.
The other thing to mention is that SimpleXML is not, as you repeatedly refer to it, an array - it is an object, and a particularly magic one at that. Everything you do to the SimpleXML object - foreach ( $nodeList as $node ), isset($node), count($nodeList), $node->childNode, $node['attribute'], etc - is actually a function call, often returning another SimpleXML object. It's designed for convenience, so in many cases writing what seems natural will be more helpful than inspecting the object.
For instance, since each item has only one description you don't need the inner foreach loop - the following will all have the same effect:
foreach ($item->description as $desc) { echo $desc; } (loop over all child elements with tag name description)
echo $item->description[0]; (access the first description child node specifically)
echo $item->description; (access the first/only description child node implicitly; this is why you can write $feed->channel->item and it would still work if there was a second channel element, it would just ignore it)
I had an issue where simplexml_load_file was returning some array sections blank as well, even though they contained data when you view the source url directly.
Turns out the data was there, but it was CDATA so it was not properly being displayed.
Is this perhaps the same issue op was having?
Anyways my solution was this:
So initially I used this:
$feed = simplexml_load_file($rss_url);
And I got empty description back like this:
[description] => SimpleXMLElement Object
(
)
But then I found this solution in comments of PHP.net site, saying I needed to use LIBXML_NOCDATA:
https://www.php.net/manual/en/function.simplexml-load-file.php
$feed = simplexml_load_file($rss_url, "SimpleXMLElement", LIBXML_NOCDATA);
After making this change, I got description like this:
[description] => My description text!

Categories