there,
I would like to create a homepage and automatically display the ads of mobile.de on it.
For this there is an API from mobile.de:
https://services.mobile.de/manual/search-api.html
I have the right access data and when I start the call via the browser:
https://services.mobile.de/search-api/search?customerNumber=503300
I get this result:
<search:search-result xmlns:seller="http://services.mobile.de/schema/seller" xmlns:ad="http://services.mobile.de/schema/ad" xmlns:search="http://services.mobile.de/schema/search" xmlns:financing="http://services.mobile.de/schema/common/financing-1.0" xmlns:resource="http://services.mobile.de/schema/resource" xmlns:error="http://services.mobile.de/schema/common/error-1.0">
<search:total>4</search:total>
<search:page-size>20</search:page-size>
<search:current-page>1</search:current-page>
<search:max-pages>1</search:max-pages>
<search:ads>
<ad:ad key="266399529" url="https://services.mobile.de/search-api/ad/266399529">
<ad:creation-date value="2018-11-19T07:53:58+01:00"/>
<ad:modification-date value="2018-11-19T07:53:58+01:00"/>
<ad:detail-page url="https://suchen.mobile.de/auto-inserat/porsche-997-gt3-rs-ruf-4-0-einzelst%C3%BCck-allrad-solms/266399529.html?source=api"/>
<ad:vehicle>
Looks good to me!
Now I would like to go through the individual ads and there are problems.
The individual ads are grouped by this line:
<ad:ad key="266399529" url="https://services.mobile.de/search-api/ad/266399529">
Through my long years of experience and especially through the Internet, I have come to the following code:
error_reporting(E_ALL);
ini_set('display_errors', true);
$process = curl_init("https://services.mobile.de/search-api/search?customerNumber=503300");
curl_setopt($process, CURLOPT_HTTPHEADER, array('Content-Type: application/xml'));
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERPWD, "username:password");
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, TRUE);
$return = curl_exec($process);
curl_close($process);
$xml = simplexml_load_string($return);
$ns = $xml->children('http://services.mobile.de/schema/ad');
foreach($ns as $ad) {
$attributes = $ad->attributes();
$key = (string) $attributes['key'];
var_dump($key);
}
Unfortunately I get exactly nothing as an answer, an empty page without error message.
The problem is that you have another element in between your root node and the <ad:ad> element. You need to go via the <search:ads> element...
$ns = $xml->children('http://services.mobile.de/schema/search')->ads
->children('http://services.mobile.de/schema/ad');
To access the details of the ads, you need to again look at the structure and see what elements you want and what namespace they are in. So for the text of the category element of each ad, you can use a loop and...
$ns = $xml->children('http://services.mobile.de/schema/search')->ads
->children('http://services.mobile.de/schema/ad');
foreach($ns as $ad) {
foreach ( $ad->vehicle as $vehicle ) {
echo (string)$vehicle->category[0]
->children("http://services.mobile.de/schema/resource")
->{'local-description'}.PHP_EOL;
}
}
A couple of things with this is that the <resource:local-description> element is in a different namespace, which is why it uses the ->children() with this other namespace. Also as the name contains a -, you have to access it using ->{'local-description'} to make it a valid name.
Lastly - as all this will return the element it points to, you should cast it to a string ( using (string) at the start) to make sure you end up with just the text from the element.
As an alternative you might also use an xpath expression using the namespace prefix:
//search:search-result/search:ads/ad:ad
For example:
$ads = $xml->xpath('//search:search-result/search:ads/ad:ad');
foreach ($ads as $ad) {
$key = (string)$ad->attributes()->key;
}
Related
I've been trying to write a simple script in PHP to pull off data from a ISBN database site. and for some reason I've had nothing but issues using the file_get_contents command.. I've managed to get something working for this now, but would just like to see if anyone knows why this wasn't working?
The below would not populate the $page with any information so the preg matches below failed to get any information. If anyone knows what the hell was stopping this would be great?
$links = array ('
http://www.isbndb.com/book/2009_cfa_exam_level_2_schweser_practice_exams_volume_2','
http://www.isbndb.com/book/uniform_investment_adviser_law_exam_series_65','
http://www.isbndb.com/book/waterworks_a02','
http://www.isbndb.com/book/winning_the_toughest_customer_the_essential_guide_to_selling','
http://www.isbndb.com/book/yale_daily_news_guide_to_fellowships_and_grants'
); // array of URLs
foreach ($links as $link)
{
$page = file_get_contents($link);
#print $page;
preg_match("#<h1 itemprop='name'>(.*?)</h1>#is",$page,$title);
preg_match("#<a itemprop='publisher' href='http://isbndb.com/publisher/(.*?)'>(.*?)</a>#is",$page,$publisher);
preg_match("#<span>ISBN10: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn10);
preg_match("#<span>ISBN13: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn13);
echo '<tr>
<td>'.$title[1].'</td>
<td>'.$publisher[2].'</td>
<td>'.$isbn10[1].'</td>
<td>'.$isbn13[1].'</td>
</tr>';
#exit();
}
My guess is you have wrong (not direct) URLs. Proper ones should be without the www. part - if you fire any of them and inspect the returned headers, you'll see that you're redirected (HTTP 301) to another URL.
The best way to do it in my opinion is to use cURL among curl_setopt with options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS.
Of course you should trim your urls beforehands just to be sure it's not the problem.
Example here:
$curl = curl_init();
foreach ($links as $link) {
curl_setopt($curl, CURLOPT_URL, $link);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl, CURLOPT_MAXREDIRS, 5); // max 5 redirects
$result = curl_exec($curl);
if (! $result) {
continue; // if $result is empty or false - ignore and continue;
}
// do what you need to do here
}
curl_close($curl);
I use this code for getting elements of left navigation bar:
function parseInit($url) {
$ch = curl_init();
$timeout = 0;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 2);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$data = parseInit("https://www.smile-dental.de/index.php");
$data = preg_replace('/<(d[ldt])( |>)/smi', '<div data-type="$1"$2', $data);
$data = preg_replace('/<\/d[ldt]>/smi', '</div>', $data);
$html = new simple_html_dom();
$html = $html->load($data);
But faced with such problem.
For example, if I use such syntax for getting elements: $html->find("div[data-type=dd].level2"), then I get ALL elements with data attributes DT, DD, DL and class name LEVEL2. If I use another syntax: $html->find("div.level2[data-type=dd]"), then I get ALL elements with data attribute DD, but with class names LEVEL1, LEVEL2 and LEVEL3 etc..
Could you explain me what the problem is? Thanks in advance!
P.S.: All DT, DL and DD elements was changed with regexp to the DIV elements with appropriate data attributes, because this parser incorrectly counts the number of these elements.
REGEXes are not made to manipulate HTML, DOM parsers are... And simple_html_dom you're using can do it easily...
The following code will do what you want just fine (check comments):
$data = parseInit("https://www.smile-dental.de/index.php");
// Create a DOM object
$html = new simple_html_dom();
$html = $html->load($data);
// Find all tags to replace
$nodes = $html->find('td, dd, dl');
// Loop through every node and make the wanted changes
foreach ($nodes as $key => $node) {
// Get the original tag's name
$originalTag = $node->tag;
// Replace it with the new tag
$node->tag = 'div';
// Set a new attribute with the original tag's name
$node->{'data-type'} = $originalTag;
}
// Clear DOM variable
$html->clear();
unset($html);
Here's is it in action
Now, for multiple attributes filtering, you can use either of the following methods:
foreach ( $html->find("div.level2") as $key => $node) {
if ( $node->{'data-type'} == 'dt' ) {
# code...
}
}
OR (courtesy to h0tw1r3):
// array containing all the filtered nodes
$dts = array_filter($html->find('div.level2'), function($node){return $node->{'data-type'} == 'dt';});
Please read the MANUAL for more details...
After struggling for 3 hours at trying to do this on my own, I have decided that it is either not possible or not possible for me to do on my own. My question is as follows:
How can I scrape the numbers in the attached image using PHP to echo them in a webpage?
Image URL: http://gyazo.com/6ee1784a87dcdfb8cdf37e753d82411c
Please help. I have tried almost everything, from using cURL, to using a regex, to trying an xPath. Nothing has worked the right way.
I only want the numbers by themselves in order for them to be isolated, assigned to a variable, and then echoed elsewhere on the page.
Update:
http://youtube.com/exonianetwork - The URL I am trying to scrape.
/html/body[#class='date-20121213 en_US ltr ytg-old-clearfix guide-feed-v2 site-left-aligned exp-new-site-width exp-watch7-comment-ui webkit webkit-537']/div[#id='body-container']/div[#id='page-container']/div[#id='page']/div[#id='content']/div[#id='branded-page-default-bg']/div[#id='branded-page-body-container']/div[#id='branded-page-body']/div[#class='channel-tab-content channel-layout-two-column selected blogger-template ']/div[#class='tab-content-body']/div[#class='secondary-pane']/div[#class='user-profile channel-module yt-uix-c3-module-container ']/div[#class='module-view profile-view-module']/ul[#class='section'][1]/li[#class='user-profile-item '][1]/span[#class='value']
The xPath I tried, which didn't work for some unknown reason. No exceptions or errors were thrown, and nothing was displayed.
Perhaps a simple XPath would be easier to manipulate and debug.
Here's a Short Self-Contained Correct Example (watch for the space at the end of the class name):
#!/usr/bin/env php
<?
$url = "http://youtube.com/exonianetwork";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html)
{
print "Failed to fetch page. Error handling goes here";
}
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$profile_items = $xpath->query("//li[#class='user-profile-item ']/span[#class='value']");
if ($profile_items->length === 0) {
print "No values found\n";
} else {
foreach ($profile_items as $profile_item) {
printf("%s\n", $profile_item->textContent);
}
}
?>
Execute:
% ./scrape.php
57
3,593
10,659,716
113,900
United Kingdom
If you are willing to try a regex again, this pattern should work:
!Network Videos:</span>\r\n +<span class=\"value\">([\d,]+).+Views:</span>\r\n +<span class=\"value\">([\d,]+).+Subscribers:</span>\r\n +<span class=\"value\">([\d,]+)!s
It captures the numbers with their embedded commas, which would then need to be stripped out. I'm not familiar with PHP, so cannot give you more complete code
I am using the following code for parsing dom document but at the end I get the error
"google.ac" is null or not an object
line 402
char 1
What I guess, line 402 contains tag and a lot of ";",
How can I fix this?
<?php
//$ch = curl_init("http://images.google.com/images?q=books&tbm=isch/");
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
// grab URL and pass it to the browser
$data = curl_exec($ch);
curl_close($ch);
$dom = new DOMDocument();
$dom->loadHTML($data);
//#$dom->saveHTMLFile('newfolder/abc.html')
$dom->loadHTML('$data');
// find all ul
$list = $dom->getElementsByTagName('ul');
// get few list items
$rows = $list->item(30)->getElementsByTagName('li');
// get anchors from the table
$links = $list->item(30)->getElementsByTagName('a');
foreach ($links as $link) {
echo "<fieldset>";
$links = $link->getElementsByAttribute('imgurl');
$dom->saveXML($links);
}
?>
There are a few issues with the code:
You should add the CURL option - CURLOPT_RETURNTRANSFER - in order to capture the output. By default the output is displayed on the browser. Like this: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);. In the code above, $data will always be TRUE or FALSE (http://www.php.net/manual/en/function.curl-exec.php)
$dom->loadHTML('$data'); is not correct and not required
The method of reading 'li' and 'a' tags might not be correct because $list->item(30) will always point to the 30th element
Anyways, coming to the fixes. I'm not sure if you checked the HTML returned by the CURL request but it seems different from what we discussed in the original post. In other words, the HTML returned by CURL does not contain the required <ul> and <li> elements. It instead contains <td> and <a> elements.
Add-on: I'm not very sure why do HTML for the same page is different when it is seen from the browser and when read from PHP. But here is a reasoning that I think might fit. The page uses JavaScript code that renders some HTML code dynamically on page load. This dynamic HTML can be seen when viewed from the browser but not from PHP. Hence, I assume the <ul> and <li> tags are dynamically generated. Anyways, that isn't of our concern for now.
Therefore, you should modify your code to parse the <a> elements and then read the image URLs. This code snippet might help:
<?php
$ch = curl_init(); // create a new cURL resource
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://images.google.com/images?q=books&tbm=isch/");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
$data = curl_exec($ch); // grab URL and pass it to the browser
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($data); // avoid warnings
$listA = $dom->getElementsByTagName('a'); // read all <a> elements
foreach ($listA as $itemA) { // loop through each <a> element
if ($itemA->hasAttribute('href')) { // check if it has an 'href' attribute
$href = $itemA->getAttribute('href'); // read the value of 'href'
if (preg_match('/^\/imgres\?/', $href)) { // check that 'href' should begin with "/imgres?"
$qryString = substr($href, strpos($href, '?') + 1);
parse_str($qryString, $arrHref); // read the query parameters from 'href' URI
echo '<br>' . $arrHref['imgurl'] . '<br>';
}
}
}
I hope above makes sense. But please note that the above parsing might fail if Google modifies their HTML.
I need to encode only part of the $delete path. Only the # in the email address and # in the property. I know how to use urlencode for the whole thing but not on just that. The way it works, is it loops through to get the properties and most of them include # in the name. Anyone who can help modify so that this works would be greatly appreciated!
The delete:
$delete = "http://admin:12345#192.168.245.133/#api/deki/DELETE:users/$user_id/properties/%s";
Here you can see $user_id this will be an email address BUT the # symbol needs to be encoded.
The properties which follow at the very end, has a # within the name, this needs to also be encoded. For example, one property name userprofile#external.created_date
Here is the code so far:
<?php
$user_id="john_smith#ourwiki.com";
$url=('http://admin:12345#192.168.245.133/#api/deki/users/=john_smith#ourwiki.com/properties');
$xmlString=file_get_contents($url);
$delete = "http://admin:12345#192.168.245.133/#api/deki/DELETE:users/$user_id/properties/%s";
$xml = new SimpleXMLElement($xmlString);
function curl_fetch($url,$username,$password,$method='DELETE')
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, $method);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // returns output as a string instead of echoing it
curl_setopt($ch,CURLOPT_USERPWD,"$username:$password"); // if your server requires basic auth do this
return curl_exec($ch);
}
foreach($xml->property as $property) {
$name = $property['name']; // the name is stored in the attribute
curl_fetch(sprintf($delete, $name),'admin','12345');
}
?>
Have you tried this? str_replace($string, array('#', '#'), array('%40', '%23'));
The urlencode function does not allow you to limit it to a subset of characters.