I'm currently in the process of converting all of my beautiful soup code into PHP just to get used to PHP. However i've run into a bit of a problem, my php code will only work when the wiki page has 'External links' after the original run in the html (such as True Detective Wiki). I just found out that this won't always happen because there may not always be an 'External links' section. I was wondering if there was anyway to convert my beautiful soup code into php code using the same technique my beautiful soup code uses?
import requests, re
from bs4 import BeautifulSoup
def get_date(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
date = soup.find_all("table", {"class": "infobox"})
for item in date:
dates = item.find_all("th")
for item2 in dates:
if item2.text == "Original run":
test2 = item2.find_next("td").text.encode("utf-8")
mysub = re.sub(r'\([^)]*\)', '', test2)
return my sub
and here is my php code currently
<?php
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
?>
<?php
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
?>
<?php
$scraped_page = curl("http://en.wikipedia.org/wiki/The_Walking_Dead_(TV_series)"); // Downloading IMDB home page to variable $scraped_page
$scraped_data = scrape_between($scraped_page, "<table class=\"infobox vevent\" style=\"width:22em\">", "</table>"); // Scraping downloaded dara in $scraped_page for content between <title> and </title> tags
$original_run = mb_substr($scraped_data, strpos($scraped_data, "Original run")-2, strpos($scraped_data, "External links") - strpos($scraped_data, "Original run")-2);
echo $original_run;
?>
Have you considered simply using Wikipedia API? Autogenerated wiki markup is generally incredibly terrible to deal with and may change at any time.
Additionally, instead of trying to regex-parse HTML or something in PHP, just use the phpQuery library with composer, you can just search for the selector table.infobox.vevent.
Related
I have a php script that loads this webpage to extract some data from it's tables.
The following methods failed to get it's table contents:
Using file_get_contents:
$document -> file_get_contents("http://www.webpage.com/");
print_r($document);
Using cURL:
$document = curl_init('http://www.webpage.com/');
curl_setopt($document, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($document);
print_r($html);
Using loadHTMLFile:
$document->loadHTMLFile('http://www.webpage.com/');
print_r($document);
I'm not an expert in php and except the first method, the other ones are copied from StackOverflow's answers.
What am I doing wrong?
and How they do block some contents from loading?
Not the answer you're likely to want to hear, but none of the methods you describe will evaluate JavaScript and other browser resources as a normal browser client would. Instead, each of those methods retrieves the contents of only the file you've specified. A quick glance at the site you're targeting clearly shows this table in question being populated as the result of an AJAX call, which none of the methods you've tried are able to evaluate.
You'll need to lean on a library or script that has the capability for this type of emulation; namely laravel/dusk, the PHP bindings for Selenium webdriver, or something similar.
This is what I did to scrape data from a webpage using php curl:
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
// Defining the basic scraping function
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$target_url = "https://www.somesite.com";
$scraped_website = curl($target_url);
$data_set_1 = scrape_between($scraped_website, "%before%", "%after%");
$data_set_2 = scrape_between($scraped_website, "%before%", "%after%");
The %before% and %after% is data that always shows up on the webpage before and after the data you wish to grab. Could be div tags or some other html tags that are unique to the data you wish to grab.
So maybe look into using curl and and imitate the same ajax request that the site is using? When I searched for that, this is what I found:
Mimicking an ajax call with Curl PHP
I want to get the whole element <article> which represents 1 listing but it doesn't work. Can someone help me please?
containing the image + title + it's link + description
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$first_step = explode( '<article>' , $content );
$second_step = explode("</article>" , $first_step[3] );
echo $second_step[0];
?>
You should definitely be using curl for this type of requests.
function curl_download($url){
// is cURL installed?
if (!function_exists('curl_init')){
die('cURL is not installed!');
}
$ch = curl_init();
// URL to download
curl_setopt($ch, CURLOPT_URL, $url);
// User agent
curl_setopt($ch, CURLOPT_USERAGENT, "Set your user agent here...");
// Include header in result? (0 = yes, 1 = no)
curl_setopt($ch, CURLOPT_HEADER, 0);
// Should cURL return or print out the data? (true = retu rn, false = print)
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Timeout in seconds
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
// Download the given URL, and return output
$output = curl_exec($ch);
// Close the cURL resource, and free system resources
curl_close($ch);
return $output;
}
for best results for your question. Combine it with HTML Dom Parser
use it like:
// Find all images
foreach($output->find('img') as $element)
echo $element->src . '<br>';
// Find all links
foreach($output->find('a') as $element)
echo $element->href . '<br>';
Good Luck!
I'm not sure I get you right, But I guess you need a PHP DOM Parser. I suggest this one (This is a great PHP library to parser HTML codes)
Also you can get whole HTML code like this:
$url = 'http://www.polkmugshot.com/';
$html = file_get_html($url);
echo $html;
Probably a better way would be to parse the document and run some xpath queries over it afterwards, like so:
$url = 'http://www.polkmugshot.com/';
$xml = simplexml_load_file($url);
$articles = $xml->xpath("//articles");
foreach ($articles as $article) {
// do sth. useful here
}
Read about SimpleXML here.
extract the articles with DOMDocument. working example:
<?php
$url = 'http://www.polkmugshot.com/';
$content = file_get_contents($url);
$domd=#DOMDocument::loadHTML($content);
foreach($domd->getElementsByTagName("article") as $article){
var_dump($domd->saveHTML($article));
}
and as pointed out by #Guns , you'd better use curl, for several reasons:
1: file_get_contents will fail if allow_url_fopen is not set to true in php.ini
2: until php 5.5.0 (somewhere around there), file_get_contents kept reading from the connection until the connection was actually closed, which for many servers can be many seconds after all content is sent, while curl will only read until it reaches content-length HTTP header, which makes for much faster transfers (luckily this was fixed)
3: curl supports gzip and deflate compressed transfers, which again, makes for much faster transfer (when content is compressible, such as html), while file_get_contents will always transfer plain
I want to scrap some information of a webpage .It uses a table layout structure.
I want to extract the third table inside the nested table layout which contains a series of nested tables .Each publishing a result .But the code is not working
include('simple_html_dom.php');
$url = 'http://exams.keralauniversity.ac.in/Login/index.php?reslt=1';
$html = file_get_contents($url);
$result =$html->find("table", 2);
echo $result;
I Used Curl to extract website but the problem is its tags is in out of order so it cannot be extracted using simple dom element .
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL,$url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
function scrape_between($data, $start, $end){
$data = stristr($data, $start); // Stripping all data from before $start
$data = substr($data, strlen($start)); // Stripping $start
$stop = stripos($data, $end); // Getting the position of the $end of the data to scrape
$data = substr($data, 0, $stop); // Stripping all data from after and including the $end of the data to scrape
return $data; // Returning the scraped data from the function
}
$scraped_page = curl($url); // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable
$scraped_data = scrape_between($scraped_page, ' </html>', '</table></td><td></td></tr>
</table>');
echo $scraped_data;
$myfile = fopen("newfile.html", "w") or die("Unable to open file!");
fwrite($myfile, $scraped_data);
fclose($myfile);
How to scrape the result and save the pdf
Simple HTML Dom can't handle that html. So first switch to this library,
Then do:
require_once('advanced_html_dom.php');
$dom = file_get_html('http://exams.keralauniversity.ac.in/Login/index.php?reslt=1');
$rows = array();
foreach($dom->find('tr.Function_Text_Normal:has(td[3])') as $tr){
$row['num'] = $tr->find('td[2]', 0)->text;
$row['text'] = $tr->find('td[3]', 0)->text;
$row['pdf'] = $tr->find('td[3] a', 0)->href;
if(preg_match_all('/\d+/', $tr->parent->find('u', 0)->text, $m)){
list($row['day'], $row['month'], $row['year']) = $m[0];
}
// uncomment next 2 lines to save the pdf
// $filename = preg_replace('/.*\//', '', $row['pdf']);
// file_put_contents($filename, file_get_contents($row['pdf']));
$rows[] = $row;
}
var_dump($rows);
Find a sample code
?php
// Defining the basic cURL function
function curl($url) {
$ch = curl_init(); // Initialising cURL
curl_setopt($ch, CURLOPT_URL, $url); // Setting cURL's URL option with the $url variable passed into the function
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Setting cURL's option to return the webpage data
$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
curl_close($ch); // Closing cURL
return $data; // Returning the data from the function
}
?>
<?php
$scraped_website = curl("http://www.example.com"); // Executing our curl function to scrape the webpage http://www.example.com and return the results into the $scraped_website variable
$result =$substring($scraped_website ,11,7); //change values 11,7 for table
echo $result;
?>
i just want to get the name of 'channel' tag i.e. CHANNEL...the script works fine when i use it to parse the rss from Google..............but when i use it for some other provider it gives an output '#text' instead of giving 'channel' which is the intended output.......the following is my script plz help me out.
$url = 'http://ibnlive.in.com/ibnrss/rss/sports/cricket.xml';
$get = perform_curl($url);
$xml = new DOMDocument();
$xml -> loadXML($get['remote_content']);
$fetch = $xml -> documentElement;
$gettitle = $fetch -> firstChild -> nodeName;
echo $gettitle;
function perform_curl($rss_feed_provider_url){
$url = $rss_feed_provider_url;
$curl_handle = curl_init();
// Do we have a cURL session?
if ($curl_handle) {
// Set the required CURL options that we need.
// Set the URL option.
curl_setopt($curl_handle, CURLOPT_URL, $url);
// Set the HEADER option. We don't want the HTTP headers in the output.
curl_setopt($curl_handle, CURLOPT_HEADER, false);
// Set the FOLLOWLOCATION option. We will follow if location header is present.
curl_setopt($curl_handle, CURLOPT_FOLLOWLOCATION, true);
// Instead of using WRITEFUNCTION callbacks, we are going to receive the remote contents as a return value for the curl_exec function.
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, true);
// Try to fetch the remote URL contents.
// This function will block until the contents are received.
$remote_contents = curl_exec($curl_handle);
// Do the cleanup of CURL.
curl_close($curl_handle);
$remote_contents = utf8_encode($remote_contents);
$handle = #simplexml_load_string($remote_contents);
$return_result = array();
if(is_object($handle)){
$return_result['handle'] = true;
$return_result['remote_content'] = $remote_contents;
return $return_result;
}
else{
$return_result['handle'] = false;
$return_result['content_error'] = 'INVALID RSS SOURCE, PLEASE CHECK IF THE SOURCE IS A VALID XML DOCUMENT.';
return $return_result;
}
} // End of if ($curl_handle)
else{
$return_result['curl_error'] = 'CURL INITIALIZATION FAILED.';
return false;
}
}
php
it gives an output '#text' instead of giving 'channel' which is the intended output it happens because the $fetch -> firstChild -> nodeType is 3, which is a TEXT_NODE or just some text. You could select channel by
echo $fetch->getElementsByTagName('channel')->item(0)->nodeName;
and
$gettitle = $fetch -> firstChild -> nodeValue;
var_dump($gettitle);
gives you
string(5) "
"
or spaces and a new line symbol which happens to appear between the xml tags due to formatting.
ps: and RSS feed by your link fails validation at http://validator.w3.org/feed/
Take a look at the XML - it's been pretty printed with whitespace so it is being parsed correctly. The first child of the root node is a text node. I'd suggest using SimpleXML if you want an easier time of it, or use XPath queries on your DomDocument to obtain the tags of interest.
Here's how you'd use SimpleXML
$xml = new SimpleXMLElement($get['remote_content']);
print $xml->channel[0]->title;
I am looking to draw html of a webpage inside my website.
Take this scenario:
I have a website that checks availability of a hotel. But instead of hosting that hotel's images on my server. I simple curl, a specific page on the hotels website that contains their images.
Can I grab anything from the html and display it on my website? using their HTML code, but only the div(s) or images that i want to display?
I'm using this code, sourced from:
http://davidwalsh.name/download-urls-content-php-curl
As practice and arguments sake, lets try and display Google's logo from their homepage.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://www.google.com');
echo '<base href="http://www.google.com/" />';
echo $returned_content;
Thanks to #alex I have started to play with DOMDocument in PHP's lib. However, I have hit a snag.
function get_data($url)
{
$ch = curl_init();
$timeout = 5;
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch,CURLOPT_CONNECTTIMEOUT,$timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$url = "www.abc.net.au";
$html = get_data($url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$logo = $dom->getElementById("abcLogo");
var_dump($logo);
Returns: object(DOMElement)[2]
How do I parse this further? Or Simply just print/echo the contents of the DIV with that id..?
Yes, run the resulting HTML through something like DOMDocument to extract the portions you require.
Once you have found a DOM element, it can be a bit tricky to get the HTML of the element itself (rather than just its contents).
You can get the XML value of a single element very easily with DOMDocument::saveXML:
echo $dom->saveXML($logo);
This may be good enough for you. I believe there is a change coming that will add this functionality to saveHTML as well.
echo $logo->nodeValue should work because you can only have 1 element by id!