$page = $curl->post($baseUrl.'/submit.php', array('url'=>$address,'phase'=>'1','randkey'=>$randKey[0],'id'=>'c_1'));
$exp = explode('recaptcha_image',$page);
The id recaptcha_image is not found although if i echo $page; the webpage will be displayed and surprisingly even the recpatcha div (with the image itself). Curl shouldn't load the image for recaptcha but somehow it does though when i try to find the div, it is not there. Is there a way to capture the url of the recaptcha image?
You'll want to use an HTML parser like this PHP Simple HTML DOM Parser.
Something like this will work then:
<?php
$page = $curl->post($baseUrl.'/submit.php', array('url'=>$address,'phase'=>'1','randkey'=>$randKey[0],'id'=>'c_1'));
$html->load($page);
$ret = $html->find('script[src^=http://api.recaptcha.net/]',0);
$src = $ret->src;
//I'm not sure how you get an url with your library, so this might or might not work
$page = $curl->get($src);
preg_match("%challenge\ :\ '([a-zA-Z0-9-_]*)',%", $page, $matches);
$img = "http://api.recaptcha.net/image?c=".$matches[1];
?>
This first fetches the page, parses it for the script URL, then opens that URL for the challenge which is then appended to the URL itself. The image will be in the $img variable.
Related
My goal is to scrape search results with PHP Simple HTML DOM Parser
which is working fine for me. But after every one or two days, Google changes their HTML structure and my code stop working.
Here's my code that was working before:
include("simple_html_dom.php");
$data = file_get_contents('https://www.google.com/search?q=stackoverflow');
$html = str_get_html($data);
$i=0;
$linkObjs = $html->find('h3[class=r] a');
foreach ($linkObjs as $linkObj) {
$i++;
$url = trim($linkObj->href);
$trim = substr($url, 0, 7);
if ($trim=="/url?q=") {
$url = substr($url, 7);
}
$trim_2 = stripos($url, '&sa=U');
if ($trim_2 != false) {
$url = substr($url, 0, $trim_2);
}
echo "$i:".$url.'<br>';
}
They usually change class names and tag name along with HTML links structure
I had the same problem. Try
$linkObjs = $html->find('div[class=jfp3ef] a');
and it will work again.
I had a similar experience. When I search Google from the ordinary user interface, the URLs of the "hit" pages are still showing up in an A tag (of course) after a div class 'r'. But when I run my scraping program with the exact same search terms and parameters, the 'r' changes to 'kCrYT'. I changed that in my code and got the program working again. (Yay!)
But I suspect the class will change regularly when Google detects that someone is submitting the search programmatically. So this might not be a permanent solution.
Maybe I could add a little extra code that determines what class name is currently being used for this, so that my program could automatically adapt to these changes.
I am trying to scrape a website in order to get latitude and longitude for counties in the us(there are 3306 thus why I am trying to do it through code and not manually)
I am using the code below
function GetLatitude($countyName,$stateShortName){
//Create DOM from url
$page = file_get_contents("https://www.mapdevelopers.com/geocode_tool.php?$countyName,$stateShortName");
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById("display_lat");
var_dump($doc);
}
GetLatitude("Guilford County","NC");
This returns nothing but if I change the url to get without the parameters like "https://www.mapdevelopers.com/geocode_tool.php" then I can see that $doc now has some information in it but that is not useful because the value I need (latitude) is dependent upon the parameters passed into the url.
How do I solve this issue?
EDIT:
Based on the suggestion to encode the parameters I changed my code to this and now the document contains information but appears as though it is ignoring the parameters
<?
function GetLatitude($countyName,$stateShortName){
$countyName = urlencode($countyName);
$stateShortName = urlencode($stateShortName);
//Create DOM from url
$page = file_get_contents("https://www.mapdevelopers.com/geocode_tool.php?address=$countyName,$stateShortName");
$doc = new DOMDocument();
$doc->loadHTML($page);
$node = $doc->getElementById("display_lat");
var_dump($doc);
}
GetLatitude("Clarke County","AL");
?>
Your issue is that the latitude information etc isn't present on page load, and java script puts it there
You're going to have a hard time trying to run a webpage with JS and scraping it from PHP without something in the middle, maybe re-try this project with something like puppet or phantomjs so you can run your script against a real browser.
Searching the page there is a ajax request to https://www.mapdevelopers.com/data.php
Sending a POST or GET request will give you the response you are looking for
I need to scrape this HTML page using PHP ...
http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372
... I need to extract the numbers for the rows "Rosso", "Giallo", Verde" and "Bianco" (note that these numbers are dynamic so they can change when you refresh the page but it doesn't matter....).
I've seen that these rows are inside some IFrames (for example ... http://listeps.cittadellasalute.to.it/?id=01090201 ), and the values are loaded using an ajax request (for examples http://listeps.cittadellasalute.to.it/gtotal.php?id=01090101).
Are there some solutions to scrape directly (I'd like to avoid to parse singular jsons ....), these values from the original HTML page using PHP and $xpath->query?
Suggestions / examples?
I think the problem is that the values aren't in the original page, they are built once the page is loaded. So you would need to use something which will honour all the Javascript functionality (i.e. Selinium webdriver) which is a bit overkill for what you want to do (I assume). Much easier to directly process the IFrame.
You could extract the URL's of the IFrames from the original page ...
$url = "http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372";
$pageContents = file_get_contents($url);
$page = simplexml_load_string($pageContents, "SimpleXMLElement", LIBXML_NOERROR | LIBXML_ERR_NONE);
$ns = $page->getDocNamespaces();
$page->registerXPathNamespace('def', array_values($ns)[0]);
$iframes = $page->xpath("//def:iframe");
foreach ( $iframes as $frame ) {
echo "iframe:".$frame['src'].PHP_EOL;
}
Which gives (just now)
iframe:http://listeps.cittadellasalute.to.it/?id=01090101
iframe:http://listeps.cittadellasalute.to.it/?id=01090201
iframe:http://listeps.cittadellasalute.to.it/?id=01090301
iframe:http://listeps.cittadellasalute.to.it/?id=01090302
You can then process these pages.
Using the PHP GD image library I have successfully outputted an image with text from url parameters (a, b, c).
I need to be able to send these images to the Facebook sharing url so that they can be sent to social media.
https://www.facebook.com/sharer/sharer.php?u=http://example.com/script.php?a=1&b=2&c=3
However, the sharing link does not seem to accept my php parameters. When I test the url, it pulls the image but does not send any numbers resulting in no text carried through.
Is there a way to save the complete image with parameters and have it sent to the Facebook sharing url? I am doing this through a link embedded in email, so it cannot use anything more complicated than basic HTML.
You'll likely need to encode your url such that the ?, = and & aren't read by facebooks php script.
See here for details of encoding.
? is %3F, = is %3D and & is %26
So your url would be :
https://www.facebook.com/sharer/sharer.php?u=http://example.com/script.php%3Fa%3D1%26b%3D2%26c%3D3
Note: I've not tested this as I don't want to post to facebook :)
So I did eventually end up solving this. After giving up on pushing a php image to Facebook with url parameters included, I attempted to place the image into an email. This worked well in nearly every client EXCEPT Gmail. I had to convert the url parameters to get around the Gmail proxy which allowed the image to be displayed AND it also happened to now be usable in the Facebook sharer. Double hooray!
The original way I had it set up was to link the php image with url parameters and use $_GET to place them on the image:
script-wrong.php
$var1 = $_GET['a'];
$var2 = $_GET['b'];
$var3 = $_GET['c'];
The correct way to do this is the following:
script.php
$uri = $_SERVER['REQUEST_URI'];
$path = substr($uri, strpos($uri, "a=")); // start your url parameters here
$delim = 'abc=&'; // enter all characters you use in parameters (a, b, c, =, &)
$tok = strtok($path, $delim);
$tokens = array();
while ($tok !== false) {
array_push($tokens, $tok);
$tok = strtok($delim);
}
$var1 = $tokens[0];
$var2 = $tokens[1];
$var3 = $tokens[2];
What this does is look at the url and pull specified characters ($delim) from it to place into an array. Then using what follows those characters, set their value to a token and place that token on the image.
Here is how I set up my php image to display in the email:
<img src="http://example.com/script.php/a=1&b=2&c=3">
And my share URL:
https://www.facebook.com/sharer/sharer.php?u=http://example.com/script.php/a=1%26b=2%26c=3
I am trying to find next page's link of a particular page(i call that particular page as current page here).The current page in program i am using is
http://en.wikipedia.org/wiki/Category:1980_births
The next page link which i am extracting from the current page is the below one
http://en.wikipedia.org/w/index.php?title=Category:1980_births&pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages
But ,, when file_get_contents() function load the next page link it's getting the the current page contents ,,,
The code is
<?php
$string = file_get_contents("http://en.wikipedia.org/wiki/Category:1980_births"); //Getting contents of current page ,
preg_match_all("/\(previous page\) \(<a href=\"(.*)\" title/", $string,$matches); // extracting the next_page_link from the current page contents
foreach ($matches[1] as $match) {
break;
}
$next_page_link = $match;
$next_page_link = "http://en.wikipedia.org" . $next_page_link; //the next_link will have only the path , does't contain the domain name ,,, so i am adding the domain name here, this does't make any impact on the problem statement
$string1 = file_get_contents($next_page_link);
echo $next_page_link;
echo $string1;
?>
As per the code string1 should have next_page_link's content ,, but instead it just getting the current page's content.
In the source of the original web site, the links have entity-encoded ampersands (See Do I encode ampersands in <a href…>?). The browser decodes them normally when you click the anchor, but your scraping code does not. Compare
http://en.wikipedia.org/ ... &pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages
versus
http://en.wikipedia.org ... &pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages
This malformed querystring is what you are in fact passing into file_get_contents. You can convert them back to regular ampersands like this:
// $next_page_link = $match;
$next_page_link = html_entity_decode($match);