I've been trying to write a simple script in PHP to pull off data from a ISBN database site. and for some reason I've had nothing but issues using the file_get_contents command.. I've managed to get something working for this now, but would just like to see if anyone knows why this wasn't working?
The below would not populate the $page with any information so the preg matches below failed to get any information. If anyone knows what the hell was stopping this would be great?
$links = array ('
http://www.isbndb.com/book/2009_cfa_exam_level_2_schweser_practice_exams_volume_2','
http://www.isbndb.com/book/uniform_investment_adviser_law_exam_series_65','
http://www.isbndb.com/book/waterworks_a02','
http://www.isbndb.com/book/winning_the_toughest_customer_the_essential_guide_to_selling','
http://www.isbndb.com/book/yale_daily_news_guide_to_fellowships_and_grants'
); // array of URLs
foreach ($links as $link)
{
$page = file_get_contents($link);
#print $page;
preg_match("#<h1 itemprop='name'>(.*?)</h1>#is",$page,$title);
preg_match("#<a itemprop='publisher' href='http://isbndb.com/publisher/(.*?)'>(.*?)</a>#is",$page,$publisher);
preg_match("#<span>ISBN10: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn10);
preg_match("#<span>ISBN13: <span itemprop='isbn'>(.*?)</span>#is",$page,$isbn13);
echo '<tr>
<td>'.$title[1].'</td>
<td>'.$publisher[2].'</td>
<td>'.$isbn10[1].'</td>
<td>'.$isbn13[1].'</td>
</tr>';
#exit();
}
My guess is you have wrong (not direct) URLs. Proper ones should be without the www. part - if you fire any of them and inspect the returned headers, you'll see that you're redirected (HTTP 301) to another URL.
The best way to do it in my opinion is to use cURL among curl_setopt with options CURLOPT_FOLLOWLOCATION and CURLOPT_MAXREDIRS.
Of course you should trim your urls beforehands just to be sure it's not the problem.
Example here:
$curl = curl_init();
foreach ($links as $link) {
curl_setopt($curl, CURLOPT_URL, $link);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($curl, CURLOPT_MAXREDIRS, 5); // max 5 redirects
$result = curl_exec($curl);
if (! $result) {
continue; // if $result is empty or false - ignore and continue;
}
// do what you need to do here
}
curl_close($curl);
Related
there,
I would like to create a homepage and automatically display the ads of mobile.de on it.
For this there is an API from mobile.de:
https://services.mobile.de/manual/search-api.html
I have the right access data and when I start the call via the browser:
https://services.mobile.de/search-api/search?customerNumber=503300
I get this result:
<search:search-result xmlns:seller="http://services.mobile.de/schema/seller" xmlns:ad="http://services.mobile.de/schema/ad" xmlns:search="http://services.mobile.de/schema/search" xmlns:financing="http://services.mobile.de/schema/common/financing-1.0" xmlns:resource="http://services.mobile.de/schema/resource" xmlns:error="http://services.mobile.de/schema/common/error-1.0">
<search:total>4</search:total>
<search:page-size>20</search:page-size>
<search:current-page>1</search:current-page>
<search:max-pages>1</search:max-pages>
<search:ads>
<ad:ad key="266399529" url="https://services.mobile.de/search-api/ad/266399529">
<ad:creation-date value="2018-11-19T07:53:58+01:00"/>
<ad:modification-date value="2018-11-19T07:53:58+01:00"/>
<ad:detail-page url="https://suchen.mobile.de/auto-inserat/porsche-997-gt3-rs-ruf-4-0-einzelst%C3%BCck-allrad-solms/266399529.html?source=api"/>
<ad:vehicle>
Looks good to me!
Now I would like to go through the individual ads and there are problems.
The individual ads are grouped by this line:
<ad:ad key="266399529" url="https://services.mobile.de/search-api/ad/266399529">
Through my long years of experience and especially through the Internet, I have come to the following code:
error_reporting(E_ALL);
ini_set('display_errors', true);
$process = curl_init("https://services.mobile.de/search-api/search?customerNumber=503300");
curl_setopt($process, CURLOPT_HTTPHEADER, array('Content-Type: application/xml'));
curl_setopt($process, CURLOPT_HEADER, 0);
curl_setopt($process, CURLOPT_USERPWD, "username:password");
curl_setopt($process, CURLOPT_TIMEOUT, 30);
curl_setopt($process, CURLOPT_RETURNTRANSFER, TRUE);
$return = curl_exec($process);
curl_close($process);
$xml = simplexml_load_string($return);
$ns = $xml->children('http://services.mobile.de/schema/ad');
foreach($ns as $ad) {
$attributes = $ad->attributes();
$key = (string) $attributes['key'];
var_dump($key);
}
Unfortunately I get exactly nothing as an answer, an empty page without error message.
The problem is that you have another element in between your root node and the <ad:ad> element. You need to go via the <search:ads> element...
$ns = $xml->children('http://services.mobile.de/schema/search')->ads
->children('http://services.mobile.de/schema/ad');
To access the details of the ads, you need to again look at the structure and see what elements you want and what namespace they are in. So for the text of the category element of each ad, you can use a loop and...
$ns = $xml->children('http://services.mobile.de/schema/search')->ads
->children('http://services.mobile.de/schema/ad');
foreach($ns as $ad) {
foreach ( $ad->vehicle as $vehicle ) {
echo (string)$vehicle->category[0]
->children("http://services.mobile.de/schema/resource")
->{'local-description'}.PHP_EOL;
}
}
A couple of things with this is that the <resource:local-description> element is in a different namespace, which is why it uses the ->children() with this other namespace. Also as the name contains a -, you have to access it using ->{'local-description'} to make it a valid name.
Lastly - as all this will return the element it points to, you should cast it to a string ( using (string) at the start) to make sure you end up with just the text from the element.
As an alternative you might also use an xpath expression using the namespace prefix:
//search:search-result/search:ads/ad:ad
For example:
$ads = $xml->xpath('//search:search-result/search:ads/ad:ad');
foreach ($ads as $ad) {
$key = (string)$ad->attributes()->key;
}
I want to get the whole element <article> which represents 1 listing but it doesn't work. Can someone help me please?
containing the image + title + it's link + description
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$first_step = explode( '<article>' , $content );
$second_step = explode("</article>" , $first_step[3] );
echo $second_step[0];
?>
You should definitely be using curl for this type of requests.
function curl_download($url){
// is cURL installed?
if (!function_exists('curl_init')){
die('cURL is not installed!');
}
$ch = curl_init();
// URL to download
curl_setopt($ch, CURLOPT_URL, $url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Set your user agent here...");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = retu rn, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
for best results for your question. Combine it with HTML Dom Parser
use it like:
// Find all images
foreach($output->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($output->find('a') as $element)
echo $element->href . '<br>';
Good Luck!
I'm not sure I get you right, But I guess you need a PHP DOM Parser. I suggest this one (This is a great PHP library to parser HTML codes)
Also you can get whole HTML code like this:
$url = 'http://www.polkmugshot.com/';
$html = file_get_html($url);
echo $html;
Probably a better way would be to parse the document and run some xpath queries over it afterwards, like so:
$url = 'http://www.polkmugshot.com/';
$xml = simplexml_load_file($url);
$articles = $xml->xpath("//articles");
foreach ($articles as $article) {
// do sth. useful here
}
Read about SimpleXML here.
extract the articles with DOMDocument. working example:
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$domd=#DOMDocument::loadHTML($content);
foreach($domd->getElementsByTagName("article") as $article){
var_dump($domd->saveHTML($article));
}
and as pointed out by #Guns , you'd better use curl, for several reasons:
1: file_get_contents will fail if allow_url_fopen is not set to true in php.ini
2: until php 5.5.0 (somewhere around there), file_get_contents kept reading from the connection until the connection was actually closed, which for many servers can be many seconds after all content is sent, while curl will only read until it reaches content-length HTTP header, which makes for much faster transfers (luckily this was fixed)
3: curl supports gzip and deflate compressed transfers, which again, makes for much faster transfer (when content is compressible, such as html), while file_get_contents will always transfer plain
After struggling for 3 hours at trying to do this on my own, I have decided that it is either not possible or not possible for me to do on my own. My question is as follows:
How can I scrape the numbers in the attached image using PHP to echo them in a webpage?
Image URL: http://gyazo.com/6ee1784a87dcdfb8cdf37e753d82411c
Please help. I have tried almost everything, from using cURL, to using a regex, to trying an xPath. Nothing has worked the right way.
I only want the numbers by themselves in order for them to be isolated, assigned to a variable, and then echoed elsewhere on the page.
Update:
http://youtube.com/exonianetwork - The URL I am trying to scrape.
/html/body[#class='date-20121213 en_US ltr ytg-old-clearfix guide-feed-v2 site-left-aligned exp-new-site-width exp-watch7-comment-ui webkit webkit-537']/div[#id='body-container']/div[#id='page-container']/div[#id='page']/div[#id='content']/div[#id='branded-page-default-bg']/div[#id='branded-page-body-container']/div[#id='branded-page-body']/div[#class='channel-tab-content channel-layout-two-column selected blogger-template ']/div[#class='tab-content-body']/div[#class='secondary-pane']/div[#class='user-profile channel-module yt-uix-c3-module-container ']/div[#class='module-view profile-view-module']/ul[#class='section'][1]/li[#class='user-profile-item '][1]/span[#class='value']
The xPath I tried, which didn't work for some unknown reason. No exceptions or errors were thrown, and nothing was displayed.
Perhaps a simple XPath would be easier to manipulate and debug.
Here's a Short Self-Contained Correct Example (watch for the space at the end of the class name):
#!/usr/bin/env php
<?
$url = "http://youtube.com/exonianetwork";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html = curl_exec($ch);
if (!$html)
{
print "Failed to fetch page. Error handling goes here";
}
curl_close($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$profile_items = $xpath->query("//li[#class='user-profile-item ']/span[#class='value']");
if ($profile_items->length === 0) {
print "No values found\n";
} else {
foreach ($profile_items as $profile_item) {
printf("%s\n", $profile_item->textContent);
}
}
?>
Execute:
% ./scrape.php
57
3,593
10,659,716
113,900
United Kingdom
If you are willing to try a regex again, this pattern should work:
!Network Videos:</span>\r\n +<span class=\"value\">([\d,]+).+Views:</span>\r\n +<span class=\"value\">([\d,]+).+Subscribers:</span>\r\n +<span class=\"value\">([\d,]+)!s
It captures the numbers with their embedded commas, which would then need to be stripped out. I'm not familiar with PHP, so cannot give you more complete code
Sort of a weird question.
From 4shared video site, I get the embed code like the following:
<embed src="http://www.4shared.com/embed/436595676/acfa8f75" width="420" height="320" allowfullscreen="true" allowscriptaccess="always"></embed>
Now, if I access the url in that embed src, the video is loaded up and the URL of the page is changed with information about the video.
I am wondering if there is any way for me to access that info using PHP? I tried file_get_contents but it gives me lots of weird characters.
So, can I use PHP to load the embed url and get the information present in the address bar?
Thanks for all your help! :)
Yes, e.g. with the curl-library of php. This one will handle the redirect-headers from the server, which result in the new/real url of the video.
Here's a sample code:
<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.4shared.com/embed/436595676/acfa8f75");
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
// we want to further handle the content, so return it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// grab URL and pass it to the browser
$result = curl_exec($ch);
// did we get a good result?
if (!$result)
die ("error getting url");
// if we got a redirection http-code, split the content in
// lines and search for the Location-header.
$location = null;
if ((int)(curl_getinfo($ch, CURLINFO_HTTP_CODE)/100) == 3) {
$lines = explode("\n", $result);
foreach ($lines as $line) {
list($head, $value) = explode(":", $line, 2);
if ($head == 'Location') {
$location = trim($value);
break;
}
}
}
if ($location == null)
die("no redirect found in header");
// close cURL resource, and free up system resources
curl_close($ch);
// your location is now in here.
var_dump($location);
?>
I'm starting to help a friend who runs a website with small bits of coding work, and all the code required will be PHP. I am a C# developer, so this will be a new direction.
My first stand-alone task is as follows:
The website is informed of a new species of fish. The scientific name is entered into, say, two input controls, one for the genus (X) and another for the species (Y). These names will need to be sent to a website in the format:
http://www.fishbase.org/Summary/speciesSummary.php?genusname=X&speciesname=Y&lang=English
Once on the resulting page, there are further links for common names and synonyms.
What I would like to be able to do is to find these links, and call the URL (as this will contain all the necessary parameters to get the particular data) and store some of it.
I want to save data from both calls and, once completed, convert it all into xml which can then be uploaded to the website's database.
All I'd like to know is (a) can this be done, and (b) how difficult is it?
Thanks in advance
Martin
If I understand you correctly you want your script to download a page and process the downloaded data. If so, the answers are:
a) yes
b) not difficult
:)
Oke... here some more information: I would use the CURL extension, see:
http://php.net/manual/en/book.curl.php
<?php
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "example.com");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$output = curl_exec($ch);
curl_close($ch);
?>
I used a thing called snoopy (http://sourceforge.net/projects/snoopy/) 4 years a go.
I took about 500 customers profiles from a website that published them in a few hours.
a) Yes
b) Not difficult when have experience.
Google for CURL first, or allow_url_fopen.
file_get_contents() will do the job:
$data = file_get_contents('http://www.fishbase.org/Summary/speciesSummary.php?genusname=X&speciesname=Y&lang=English');
// Отправить URL-адрес
function send_url($url, $type = false, $debug = false) { // $type = 'json' or 'xml'
$result = '';
if (function_exists('curl_init')) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec($ch);
curl_close($ch);
} else {
if (($content = #file_get_contents($url)) !== false) $result = $content;
}
if ($type == 'json') {
$result = json_decode($result, true);
} elseif ($type == 'xml') {
if (($xml = #simplexml_load_file($result)) !== false) $result = $xml;
}
if ($debug) echo '<pre>' . print_r($result, true) . '</pre>';
return $result;
}
$data = send_url('http://ip-api.com/json/212.76.17.140', 'json', true);