I'm trying to parse a curl response for some recaptcha image sadly my regex experience isn't that good, how could I find the src link in this code using regex?
<img id="recaptcha_challenge_image" alt="reCAPTCHA-Bild" height="57" width="300" src="https://www.google.com/recaptcha/api/image?c=03AHJ_VuvfHOCmjv_2G6NuZOIpbC-S7FGqqg-WM4ytRKoMD7oeXGHV_fvqim2YJSw3qkD7XcSxSlCOLAeBd4vy0S7KTrLiL7xSwc2YvGnc5Q9EUXllGa-56zXarU7Pq4btSzHn0anQPToXtBgDBdu69P-SwUpnzs1q08w9GTL8TEtpMAYYcW1hAXqpQEnPQO_h5tmtEDyr9WlgvVVTNcMwDsTJOnjsGmH4kg&th=,skNXQ2Kw6Nd0dqSGIn-2Ei0m-GKXNcjwAAAAcaAAAADlawORO7hNRBFlts00ZWNVrY7qTydBWIpzMqbUQsFjKKE1a8QhNJQqFgffS8OED15Kp3u0DRvaI7SCeuXwN6hkhGhMqCqgpe9cY2VyxbkFK8pRpnNcz5gnlz43NZsAyBZmvlX-BmLj6Hgq-6DzW0pWn3WFpGmM2UhRNudpy91Vyg7OwtD0xNhMGubBALOcAmR35Em3TsWYVh9mCL37V0nJ2xEBjjBQqCtn1Yoz5fLgxd3xp7229BR1zIremO1wJMCVljmT5uxYHyArQsN5PYqF-C7u4NkOSKS6PmkRTKAdV1NihEGAKxREfCsttZbVOY7MOVBF4zemvUDwIwA8pMJrnbbwwwXEinA2-w1NNS0tCxrubtqKCxu2EQsHRQ37V5NzI-Y_0qYMdxDa3wuXdS2ojVsjwXzVnBbYnG7LFJ1BJlZev4lugPtZh2siHolIbmHT8L1z3MMk0DEH0QnhNkd9x6e66tRyqYs7PsoteXmqa76sbqb945WtI5jsiJY4wo9yuKtGH03HmxdqhftgOk9OM6Gjjvhu1lxfW8tkOhehrGD5Td7z0L5fywtXmexRSlEQ5B4_OA3LEVmoCMUjW18GXDBj_lZPjAQ-mp6zV4a19ht88ilWfFTanLZ6d9FKsRrdwlNIS6cDzVBKT90mgXKARhibHrrSXujgo1l-gDbJ0o6xJBqSIugP155OVvwhJHW_ofOnvBuxgbvvsvOfskyGcFdnPoBIwrK-47AHx4H2jryUbCc3wLAtOcUicS_I2PRxKSUuUmYUk-bQq00scg0mDoI6QlD12pkPvmNA_QDyPqKjv5z8fc5HLVIAqFdBFdbWImHFKku0clxNX_qebl7r-C7e7LNBTngIFRdtFzAX_VjZHqRouemq2y89UA30WP65JSzzbUPt-z-tb6eKW3QD0eOlm28YkbYib9mdl85bIy61rS8bCHtFuKlcTMSzZyqMJhH25faKCTPkkXHhkPnO7IkMEmyll3LA5kjkc9RwTWFgF64RqLC-BqLscVi0GbVcCodMSVy1-kRGRqPr2ZaMwbLJSJq94Dy7reaex9rgiWEfpM0jEj_b1UeGUAEENhcPM3N63bPF3_F39H1YX3oBve4UXURVo7JkU2-C6o1HmB7Xr74JMEpPl8Vj1zImRk7SSB8Z6KEGv8Nj2f2Pq0hgaiehokt9I1JpnFprXtlEQW3vvgDZa01jYb6kmKVJNdaq0mIuvg">
My actual code:
<?php
class get_recaptcha {
function __construct(){
$this->website = 'http://registration.zwinky.com/registration/register.jhtml';
}
function curl_post(){
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $this->website);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
function parse_content(){
preg_match('regexFilter', $this->curl_post(), $match);
$captcha_image = $match[1];
return $captcha_image;
}
}
$get_captcha = new get_recaptcha();
$captcha_response = $get_captcha->parse_content();
echo $captcha_response;
?>
Could anyone maybe help me setting up the regex filter for this?
It appears that the captcha is within an iframe so the url you would need to use the url for the iframe I think to get the captcha. Whilst not a RegEx solution this does get the desired information. If the url of the iframe changes then you'd need to muck about with the code to get the iframe src url - hope it helps.
$url='http://www.google.com/recaptcha/api/noscript?k=6Ldx8OgSAAAAAOQu76OwUC1XwCxpEZU576k0gHIR';
$html=file_get_contents( $url );
$dom=new DOMDocument;
$dom->loadHTML($html);
$col=$dom->getElementsByTagName('img');
foreach($col as $n)echo $n->getAttribute('src');
echo $src;
Related
I tried to extract the download url from the webpage.
the code which tried is below
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$start = preg_quote('<script type="text/x-component">', '/');
$end = preg_quote('</script>', '/');
$rx = preg_match("/$start(.*?)$end/", $value1, $matches);
var_dump($matches);
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
this way i am getting the tags info not the content inside the script tag. how to get the info inside.
expected result is:
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.6.exe,
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.6.msi
i am very much new in writing these regular expressions. can any help me pls.
Instead of using regex, using DOMDocument and XPath allows you to have more control of the elements you select.
Although XPath can be difficult (same as regex), this can look more intuitive to some. The code uses //script[#type="text/x-component"][contains(text(), "macURL")] which broken down is
//script = any script node
[#type="text/x-component"] = which has an attribute called type with
the specific value
[contains(text(), "macURL")] = who's text contains the string macURL
The query() method returns a list of matches, so loop over them. The content is JSON, so decode it and output the values...
function getbinaryurl ($url)
{
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FRESH_CONNECT, true);
$value1 = curl_exec($curl);
curl_close($curl);
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($value1);
libxml_use_internal_errors(false);
$xp = new DOMXPath($doc);
$srcs = $xp->query('//script[#type="text/x-component"][contains(text(), "macURL")]');
foreach ( $srcs as $src ) {
$content = json_decode( $src->textContent, true);
echo $content['params']['macURL'] . PHP_EOL;
echo $content['params']['windowsURL'] . PHP_EOL;
echo $content['params']['enterpriseURL'] . PHP_EOL;
}
}
$url = "https://www.sourcetreeapp.com/download-archives";
getbinaryurl($url);
which outputs
https://product-downloads.atlassian.com/software/sourcetree/ga/Sourcetree_4.0.1_234.zip
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourceTreeSetup-3.3.8.exe
https://product-downloads.atlassian.com/software/sourcetree/windows/ga/SourcetreeEnterpriseSetup_3.3.8.msi
I am attempting to convert an image url provided by the facebook api into base64 format with cURL.
the api provides a url as such:
https://fbcdn-sphotos-g-a.akamaihd.net/hphotos-ak-xfp1/v/t1.0-9/p180x540/72099_736078480783_68792122_n.jpg?oh=f3698c5eed12c1f2503b147d221f39d1&oe=54C5BA4E&__gda__=1418090980_c7af12de6b0dd8abe752f801c1d61e0d
The issue is that the url only works with the oh, oe and gda parameters included in the url string, there is no direct img url. Removing the params send you to a facebook error page.
With the parameterized url my curl_exec is not getting correct image data. Is there a way to get the base64 data from facebook, or is there something I can do to get access the pure image url given the parameterized url?
This is what my decode scrip looks like:
header('Access-Control-Allow-Origin: *');
$url = $_GET['url'];
try {
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 3);
$result = curl_exec($c);
curl_close ($c);
if(false===$result) {
echo 'fail';
} else {
$base64 = "data:image/jpeg;charset=UTF-8;base64,".base64_encode($result);
echo $base64;
}
} catch ( \ErrorException $e ) {
echo 'fail';
}
To address your specific problem, your script is likely failing because the required oh, oe, __gda__ parameters are getting separated during the GET request and therefore are not included in $_GET['url'].
Make sure you're using a URL-encoded string so any unencoded & characters aren't handled as delimiters. Then just decode the string before passing it on to cURL.
...
$url = urldecode($_GET['url']);
...
For anyone curious, you can still load any Facebook image from any one of their legacy CDNs without needing the new parameters:
https://scontent-a-iad.xx.fbcdn.net/hphotos-frc3/
https://scontent-b-iad.xx.fbcdn.net/hphotos-frc3/
https://scontent-c-iad.xx.fbcdn.net/hphotos-frc3/
Just append the original image filename to the URL et voila.
Disclaimer: I have no idea how long this little trick will work for so don't use it on anything important in production.
Maybe this won't help much but it seems that the original picture (ending with _o) does not need gda nor oe oh parameters
to get the original profile picture you can do:
var username_or_id = "name.lastname" //Example
get_url ("http://graph.facebook.com/$username_or_id/picture?width=9999")
hth
I had similar problem. My solution:
$url = urldecode($url);
return base64_encode(file_get_contents($url));
Where the URL is to Graph API:
https://graph.facebook.com/$user_id/picture?width=160
(You probably want to also check, if file_get_contents returns something)
You just need to add the CURLOPT_SSL_VERIFYPEER set to false as the url from facebook is https and not http., or you could just as well request the url without ssl by replacing https with http.
Try the code below
$url = 'https://fbcdn-sphotos-g-a.akamaihd.net/hphotos-ak-xfp1/v/t1.0-9/p180x540/72099_736078480783_68792122_n.jpg?oh=f3698c5eed12c1f2503b147d221f39d1&oe=54C5BA4E&__gda__=1418090980_c7af12de6b0dd8abe752f801c1d61e0d';
try {
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_CONNECTTIMEOUT, 3);
/***********************************************/
// you need the curl ssl_opt_verifypeer
curl_setopt($c, CURLOPT_SSL_VERIFYPEER, false);
/***********************************************/
$result = curl_exec($c);
curl_close ($c);
if(false===$result) {
echo 'fail';
} else {
$base64 = '<img alt="Embedded Image" src="data:image/jpeg;charset=UTF-8;base64,'.base64_encode($result).'"/>';
echo $base64;
}
}
catch ( \ErrorException $e ) {
echo 'fail';
}
Sort of a weird question.
From 4shared video site, I get the embed code like the following:
<embed src="http://www.4shared.com/embed/436595676/acfa8f75" width="420" height="320" allowfullscreen="true" allowscriptaccess="always"></embed>
Now, if I access the url in that embed src, the video is loaded up and the URL of the page is changed with information about the video.
I am wondering if there is any way for me to access that info using PHP? I tried file_get_contents but it gives me lots of weird characters.
So, can I use PHP to load the embed url and get the information present in the address bar?
Thanks for all your help! :)
Yes, e.g. with the curl-library of php. This one will handle the redirect-headers from the server, which result in the new/real url of the video.
Here's a sample code:
<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.4shared.com/embed/436595676/acfa8f75");
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
// we want to further handle the content, so return it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// grab URL and pass it to the browser
$result = curl_exec($ch);
// did we get a good result?
if (!$result)
die ("error getting url");
// if we got a redirection http-code, split the content in
// lines and search for the Location-header.
$location = null;
if ((int)(curl_getinfo($ch, CURLINFO_HTTP_CODE)/100) == 3) {
$lines = explode("\n", $result);
foreach ($lines as $line) {
list($head, $value) = explode(":", $line, 2);
if ($head == 'Location') {
$location = trim($value);
break;
}
}
}
if ($location == null)
die("no redirect found in header");
// close cURL resource, and free up system resources
curl_close($ch);
// your location is now in here.
var_dump($location);
?>
Recently our hosting disabled allow_url_fopen, it seems simplehtmldom needs it turned on I saw a work arround with allow_url_fopen in this site simplehtmldom.sourceforge.net...aq.htm#hosting, "Use curl to get the page, then call "str_get_dom" to create DOM object". but still to no luck. can you tell me if I did it properly or am I missing something?
<?php
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'www.weather.bm/');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
$html= str_get_html($str);
?>
<?php
$element = $html->find("div");
$element[66]->class = "mwraping66";
foreach($html->find('.mwraping66 img') as $e)
$doc = phpQuery::newDocumentHTML( $e ); $containers = pq('.mwraping66', $doc);
foreach ( $containers as $container ) { $div = pq('img', $container);
$div->eq(1)->removeAttr('style')->addClass('thumbnail')->html( pq( 'img', $div->eq(1))- >removeAttr('height')->removeAttr('width')->removeAttr('alt') );
} print $doc;
?>
<?php
$element = $html->find("div");
$element[31]->class = "mwraping31";
foreach($html->find('.mwraping31') as $e)
echo $e->plaintext;
?>.................................
compared to:
<?php
include('simple_html_dom.php');
include ('phpQuery.php');
// Create DOM from URL
$html = file_get_html('www.weather.bm/');
?>
<?php
$element = $html->find("div");
$element[66]->class = "mwraping66";
foreach($html->find('.mwraping66 img') as $e).....
Thanks you for your help
I know this is too late to answer this query but i have found similar questions and answer in this forum.. this is the link to that Using simple html dom .. i am not sure whether this will answer your query because i am also new to dom .try to use this modified simple_html_dom.php file http://webarto.com/82/php-simple-html-dom-curl it uses curl instead of file_get_content; this file is working for me and its usage is also same as the original simple_html_dom.php
Now preg has always been a tool to me that i like but i cant figure out for the life if me if what i want to do is possible let and how to do it is going over my head
What i want is preg_match to be able to return me a div's innerHTML the problem is the div im tring to read has more divs in it and my preg keeps closing on the first tag it find
Here is my Actual code
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
preg_match('% <div id="torrent_details">(.*)</div> %six', $data, $match);
print_r($match);
This has been updated for TomcatExodus's help
Live at :: http://megatorrentz.com/beta/details.php?hash=98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6
<?php
$scrape_address = "http://isohunt.com/torrent_details/133831593/98e034bd6382e0f4ecaa9fe2b5eac01614edc3c6?tab=summary";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("torrent_details");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
Using regular expression leads often to problems when parsing markup documents.
XPath version - independent of the source layout. The only thing you need is a div with that id.
loadHTMLFile($url);
$xp = new domxpath($dom);
$result = $xp->query("//*[#id = 'torrent_details']");
$div=$result->item(0);
if($result->length){
$out =new DOMDocument();
$out->appendChild($out->importNode($div, true));
echo $out->saveHTML();
}else{
echo "No such id";
}
?>
And this is the fix for Maerlyn solution. It didn't work because getElementById() wants a DTD with the id attribute specified. I mean, you can always build a document with "apple" as the record id, so you need something that says "id" is really the id for this tag.
validateOnParse = true;
#$domd->loadHTML($data);
//this doesn't work as the DTD is not specified
//or the specified id attribute is not the attributed called "id"
//$div = $domd->getElementById("torrent_details");
/*
* workaround found here: https://fosswiki.liip.ch/display/BLOG/GetElementById+Pitfalls
* set the "id" attribute as the real id
*/
$elements = $domd->getElementsByTagName('div');
if (!is_null($elements)) {
foreach ($elements as $element) {
//try-catch needed because of elements with no id
try{
$element->setIdAttribute('id', true);
}catch(Exception $e){}
}
}
//now it works
$div = $domd->getElementById("torrent_details");
//Print its content or error
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Has no element with the given ID\n";
}
?>
Both of the solutions work for me.
You can do this:
/]>(.)<\/div>/i
Which would give you the largest possible innerHTML.
You cannot. I will not link to the famous question, because I dislike the pointless drivel on top. But still regular expressions are unfit to match nested structures.
You can use some trickery, but this is neither reliable, nor necessarily fast:
preg_match_all('#<div id="1">((<div>.*?</div>|.)*?)</div>#ims'
Your regex had a problem due to the /x flag not matching the opening div. And you used a wrong assertion notation.
preg_match_all('% <div \s+ id="torrent_details">(?<innerHtml>.*)</div> %six', $html, $match);
echo $match['innerHtml'];
That one will work, but you should only need preg_match not preg_match_all if the pages are written well, there should only be one instance of id="torrent_details" on the given page.
I'm retracting my answer. This will not work properly. Use DOM for navigating the document.
haha did it with a bit of tampering thanks for the DOMDocument idea i just to use simple
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
$doc = new DOMDocument();
libxml_use_internal_errors(false);
$doc->strictErrorChecking = FALSE;
libxml_use_internal_errors(true);
$doc->loadHTML($data);
$xml = simplexml_import_dom($doc);
print_r($xml->body->table->tr->td->table[2]->tr->td[0]->span[0]->div);