I need to scrape this HTML page using PHP ...
http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372
... I need to extract the numbers for the rows "Rosso", "Giallo", Verde" and "Bianco" (note that these numbers are dynamic so they can change when you refresh the page but it doesn't matter....).
I've seen that these rows are inside some IFrames (for example ... http://listeps.cittadellasalute.to.it/?id=01090201 ), and the values are loaded using an ajax request (for examples http://listeps.cittadellasalute.to.it/gtotal.php?id=01090101).
Are there some solutions to scrape directly (I'd like to avoid to parse singular jsons ....), these values from the original HTML page using PHP and $xpath->query?
Suggestions / examples?
I think the problem is that the values aren't in the original page, they are built once the page is loaded. So you would need to use something which will honour all the Javascript functionality (i.e. Selinium webdriver) which is a bit overkill for what you want to do (I assume). Much easier to directly process the IFrame.
You could extract the URL's of the IFrames from the original page ...
$url = "http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372";
$pageContents = file_get_contents($url);
$page = simplexml_load_string($pageContents, "SimpleXMLElement", LIBXML_NOERROR | LIBXML_ERR_NONE);
$ns = $page->getDocNamespaces();
$page->registerXPathNamespace('def', array_values($ns)[0]);
$iframes = $page->xpath("//def:iframe");
foreach ( $iframes as $frame ) {
echo "iframe:".$frame['src'].PHP_EOL;
}
Which gives (just now)
iframe:http://listeps.cittadellasalute.to.it/?id=01090101
iframe:http://listeps.cittadellasalute.to.it/?id=01090201
iframe:http://listeps.cittadellasalute.to.it/?id=01090301
iframe:http://listeps.cittadellasalute.to.it/?id=01090302
You can then process these pages.
Related
I am trying to scrape a website in order to get latitude and longitude for counties in the us(there are 3306 thus why I am trying to do it through code and not manually)
I am using the code below
function GetLatitude($countyName,$stateShortName){
//Create DOM from url
$page = file_get_contents("https://www.mapdevelopers.com/geocode_tool.php?$countyName,$stateShortName");
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById("display_lat");
var_dump($doc);
}
GetLatitude("Guilford County","NC");
This returns nothing but if I change the url to get without the parameters like "https://www.mapdevelopers.com/geocode_tool.php" then I can see that $doc now has some information in it but that is not useful because the value I need (latitude) is dependent upon the parameters passed into the url.
How do I solve this issue?
EDIT:
Based on the suggestion to encode the parameters I changed my code to this and now the document contains information but appears as though it is ignoring the parameters
<?
function GetLatitude($countyName,$stateShortName){
$countyName = urlencode($countyName);
$stateShortName = urlencode($stateShortName);
//Create DOM from url
$page = file_get_contents("https://www.mapdevelopers.com/geocode_tool.php?address=$countyName,$stateShortName");
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById("display_lat");
var_dump($doc);
}
GetLatitude("Clarke County","AL");
?>
Your issue is that the latitude information etc isn't present on page load, and java script puts it there
You're going to have a hard time trying to run a webpage with JS and scraping it from PHP without something in the middle, maybe re-try this project with something like puppet or phantomjs so you can run your script against a real browser.
Searching the page there is a ajax request to https://www.mapdevelopers.com/data.php
Sending a POST or GET request will give you the response you are looking for
for security reasons we need to disable a php/mysql for a non-profit site as it has a lot of vulnerabilities. It's a small site so we want to just rebuild the site without database and bypass the vulnerability of an admin page.
The website just needs to stay alive and remain dormant. We do not need to keep updating the site in future so we're looking for a static-ish design.
Our current URL structure is such that it has query strings in the url which fetches values from the database.
e.g. artist.php?id=2
I'm looking for a easy and quick way change artist.php so instead of fetching values from a database it would just include data from a flat html file so.
artist.php?id=1 = fetch data from /artist/1.html
artist.php?id=2 = fetch data from /artist/2.html
artist.php?id=3 = fetch data from /artist/3.html
artist.php?id=4 = fetch data from /artist/4.html
artist.php?id=5 = fetch data from /artist/5.html
The reason for doing it this way is that we need to preserve the URL structure for SEO purposes. So I do not want to use the html files for the public.
What basic php code would I need to achieve this?
To do it exactly as you ask would be like this:
$id = intval($_GET['id']);
$page = file_get_contents("/artist/$id.html");
In case $id === 0 there was something else besides numbers in the query parameter. You could also have the artist information in an array:
// datafile.php
return array(
1 => "Artist 1 is this and that",
2 => "Artist 2..."
)
And then in your artist.php
$data = include('datafile.php');
if (array_key_exists($_GET['id'], $data)) {
$page = $data[$_GET['id']];
} else {
// 404
}
HTML isn't your best option, but its cousin is THE BEST for static data files.
Let me introduce you to XML! (documentation to PHP parser)
XML is similar to HTML as structure, but it's made to store data rather than webpages.
If instead your html pages are already completed and you just need to serve them, you can use the url rewriting from your webserver (if you're using Apache, see mod_rewrite)
At last, a pure PHP solution (which I don't recommend)
<?php
//protect from displaying unwanted webpages or other vulnerabilities:
//we NEVER trust user input, and we NEVER use it directly without checking it up.
$valid_ids = array(1,2,3,4,5 /*etc*/);
if(in_array($_REQUEST['id'])){
$id = $_REQUEST['id'];
} else {
echo "missing artist!"; die;
}
//read the html file
$html_page = file_get_contents("/artist/$id.html");
//display the html file
echo $html_page;
What I want to accomplish might be a little hardcore, but I want to know if it's possible:
The question:
My question is the same as PHP-Retrieve content from page, but I want to use it on multiple pages.
The situation:
I'm using a website about TV shows. All the TV shows have the same URL and then the name of the show:
http://bierdopje.com/shows/NAME_OF_SHOW
On every show page, there's a line which tells you if the show is cancelled or still running. I want to retrieve that line to make an overview of the cancelled shows (the website only supports an overview of running shows, so I want to make an extra functionality).
The real question:
How can I tell DOM to retrieve all the shows and check for the status of the show?
(http://bierdopje.com/shows/*).
The Note:
I understand that this process may take a while because it is reading the whole website (or is it too much data?).
use this code to fetch only the links from the single website.
include_once('simple_html_dom.php');
$html = file_get_html('http://www.couponrani.com/');
// Find all links
foreach($html->find('a') as $element)
echo $element->href . '<br>';
I use phpquery to fetch data from a web page, like jQuery in Dom.
For example, to get the list of all shows, you can do this :
<?php
require_once 'phpQuery/phpQuery/phpQuery.php';
$doc = phpQuery::newDocumentHTML(
file_get_contents('http://www.bierdopje.com/shows')
);
foreach (pq('.listing a') as $key => $a) {
$url = pq($a)->attr('href'); // will give "/shows/07-ghost"
$show = pq($a)->text(); // will give "07 Ghost"
}
Now you can process all shows individualy, make a new phpQuery::newDocumentHTML for each show and with an selector extract the information you need.
Get the status of a show
$html = file_get_contents('http://www.bierdopje.com/shows/alcatraz');
$doc = phpQuery::newDocumentHTML($html);
$status = pq('.content>span:nth-child(6)')->text();
I'm trying to make a recent news like functionality for my site. For this i've made a web crawler and have being able to collect links from a page up till now by doing the following
$dom = new domDocument;
#$dom->loadHTML(file_get_contents($url));
$dom->preserveWhiteSpaces = false;
$linksToStore = $dom->getElementsByTagName('a');
foreach($linksToStore as $tag){
$links[$tag->getAttribute('href')]= $tag->childNodes->item(0)->nodeValue;
}
how can i get contents from the pages pointed by those links related to a particular domain which in my case is 'Medical'??
Use this http://simplehtmldom.sourceforge.net/ library to extract contents from the page. The selector works same as of jQuery, which makes it very familier and efficient to extract the contents.
Also, check this http://davidwalsh.name/php-notifications to know more
i am new to Javascript and i have created the code below it works fine no problem at all however i want to know what is i want to pull the image dynamically using php and javascript from mysql database how can i refactor my code bellow. thanks in advance for your contribution.
var myimage = document.getelementById("mainImage");
var imageArray =["images/overlook.jpg","images/garden.jpg","images/park.jpg"];
var imageIndex =0;
function changeimage(){
myimage.setAttribute("src",imageArray[imageIndex]);
imageIndex++;
if(imageIndex >= imageArray.length){
imageIndex = 0;
}
setInterval(changeimage, 5000);
One of several options.
Query the database for the column with the URL of the images.
$query = mysql_query("SELECT url FROM images");
Then something like this to get an array out of it:
$images = array();
while($row = mysql_fetch_array($query)){
$images[] = $row['url'];
}
Then generate this string (that you use in the Javascript provided):
var imageArray = ["images/overlook.jpg","images/garden.jpg","images/park.
using the array you retrieved from the database. You could use json_encode in PHP for this if you don't want to mess around with error prone string building.
$imagesAsJsonArray = json_encode($images);
Echo it. Done.
Not the most elegant of solutions. But it gives you something to play with. Check out a few PHP tutorials online and you'll soon get the hang of it.
Two choices:
Using PHP when your page is created, put an array of images in the page and use page-level javascript to cycle among them.
Using Ajax in the page, call from the page to the server to get the next image and then use client-side javascript to make that returned image visible on the page.