Right now, I have the current php code:
<?php
include('simple_html_dom.php');
# set up the request parameters
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'https://www.google.com/search?q=sport+news');
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_MAXREDIRS, 0);
$result = curl_exec($curl);
curl_close($curl);
echo $result;
?>
when this code is run, it returns a google page with the search results corresponding to the search sport news. Although, when you try to click on any one of theese links, it redirects you to 'localhost:/--url--'. How do I prevent curl from redirecting to localhost and rather redirect to the actual site?
I am currently using wampserver for testing.
This happens because Google's result page is using relative URLs in the links.
<a href="/url?q=https://www.bbc.co.uk/sport/43634915&sa=U&ved=2ahUKEwjX (...)
Notice that the href starts with: / not with a domain such as href="https://foobar.com/url?q=.
Therefore, the links will use the hostname of the page serving the results.
Te reason you get localhost in the results when clicking them, is that you are serving this code from localhost.
One solution could be to use the DOMDocument PHP extension to parse links, and add a hostname, so that the result links are absolute, rather than relative.
For example:
// Ignore HTML errors
libxml_use_internal_errors(true);
// Instantiate parser
$dom = new DOMDocument;
// Load HTML into DOM document parser
$dom->loadXML($result);
// Select anchor tags
$books = $dom->getElementsByTagName('a');
// Iterate through all links
foreach ($links as $link) {
// Get relative link value
$relativePath = $link->getAttribute('href');
// Check if this is a relative link
if (substr($relativePath, 0, 1) === '/') {
// Prepend Google domain
$link->setAttribute('href', "https://google.com/" . $relativePath);
}
}
echo $dom->saveHTML();
Related
I'm trying to parse a title from redirected page. Here is my code:
<?php
include_once('simple_html_dom.php');
$link="https://duckduckgo.com/?q=!ducky+google";
$html = file_get_html($link);
foreach ($html->find('title') as $text){
echo $text->plaintext."<br/>";
}
?>
The result should be "Google". Thanks
I'm not sure I 100% understood your request, but here are a few things to help you move on !
Three things :
the "!" in the $link redirects you to google. Delete it if you want to access the ducky result page.
simple-html-dom can't access the ducky result page. Did you try to echo the $html to see what you get ? I tried and was blocked by a captcha ... you'll need to figure out how to bypass it. Then and only then you'll have access to the titles.
Finally, your titles are H2 ... it might be easier to reach h2 tags with the parser.
Does this help ?
If you find a way to bypass the captcha let me know ! I'm interested :)
After use Simle DOM find URL redirected page :
<?php
$url="https://duckduckgo.com/?q=!ducky+google";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // Must be set to true so that PHP follows any "Location:" header
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch); // $a will contain all headers
$url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL); // This is what you need, it will return you the last effective URL
echo $url; // your url hire *_* AppsGM
?>
AND now use your code
<?php
include_once('simple_html_dom.php');
$html = file_get_html($url);
foreach ($html->find('title') as $text){
echo $text->plaintext."<br/>";
}
?>
I have code that get the playlist link through the page source
Then redirects to this playlist link
Through this php file run the playlist on the vlc program by user ageint
He worked without problems but now shows me 403 forbidden
image
https://i.postimg.cc/7Yw1Fs2f/image.png
Even though I copy the link to the playlist and put it directly inside
vlc with user ageint , Works without any problems..
image
https://i.postimg.cc/RVd1477G/image.png
Please help me in checking the code
<?php
$html = file_get_contents("http://wssfree.com/WSSphp/wssbeinsports1/wssbeinsports1.php");
preg_match_all(
'/(http.*?wmsAuthSign\=[^\&\">]+)/',
$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
header('Location:' .$link);
exit;
}
?>
and user agent = freeapppsss
Try this to get the HTML code:
<?php
$url = "http://wssfree.com/WSSphp/wssbeinsports1/wssbeinsports1.php";
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_HEADER, false);
$html = curl_exec($curl);
curl_close($curl);
echo '<textarea>'.$html.'</textarea>';
You can then use the variable $html to pull out the URL you want.
I do note that that you have a for loop that has an exit(); command for each item, which will kill the PHP page, so you don't need preg_match_all, just preg_match.
I am doing data scrapping on a website, where I'm trying to get all product URL's from a category page. I am not sure why simple_html_dom isn't returning the product URLs from the category page. Here is my PHP code.
// Require simplehtmldom
require_once 'includes/simplehtmldom_1_5/simple_html_dom.php';
// Category page URL
$srcurl = 'http://www.lastcall.com/Hers/Womens-Apparel/Dresses/Cocktail/cat11210008_cat5900001_cat6150001/c.cat#userConstrainedResults=true&refinements=&page=1&pageSize=120&sort=&definitionPath=/nm/commerce/pagedef_rwd/template/EndecaDriven&locationInput=&radiusInput=100&onlineOnly=&allStoresInput=false&rwd=true&catalogId=cat11210008';
$html = file_get_html($srcurl); // get DOM from URL or file
The file_get_html wasn't displaying any HTML element from "lastcall" (was working fine for other website URLs). So, I used PHPs CURL like this,
// Line 1 to 4 same
// $html = file_get_html($srcurl); // get DOM from URL or file
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $srcurl);
curl_setopt($curl, CURLOPT_REFERER, $srcurl);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);
$html = new simple_html_dom(); // Create a DOM object
$html->load($str); // Load HTML from a string
echo $html; // Disply data on test page
After using CURL I am getting only the header and footer data from the above URL but the page is not displaying products block, from where I can actually extract all the products link. I just need help with displaying the products block, later I can implement match case to get the product links. Thanks in advance.
Regards,
Ankur
I am working on codeigniter(PHP) and I am trying to implement a "POST COMMENT" functionality. I want to ask that how to fetch the heading and an image from a specific link when you type that link in textarea.
$post=$_POST['post']; //post is the name of text area that contains link
$result=file_get_contents($post);
echo $result;
if I enter www.google.com in my textarea it takes me to that site, rather than returning me the main image and heading
You should extract Url from $_POST['post'] content as you assuming it contains link
after that write a function to get html
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
return #curl_exec($ch);
}
Get Heading :-
$html=getHTML("http://www.website.com",10);
$res = preg_match("/<title>(.*)<\/title>/siU", $html, $matches);
if (!$res)
return null;
// Clean up title: remove EOL's and excessive whitespace.
$title = preg_replace('/\s+/', ' ', $matches[1]);
$title = trim($title);
echo $title;
Similarly you you can extract image from html content. Its depends on you how many images you need from html. If you need single image from html content then you can extract as above using preg_match. If you need multiple then you can use below solution.
Yoy can also use SIMPLEHTML library to extract image like this:-
include_once("simple_html_dom.php");
$html=getHTML("http://www.website.com",10);
// Find all images on webpage
foreach($html->find("img") as $element)
echo $element->src . '<br>';
I am trying to retrieve the content of web pages and check if the page contain certain error keywords I am monitoring. (instead of manually loading each URL everytime to check on the sites, I hope to do this programmatically and flag out errors when they occur)
I have tried XMLHttpRequest. I am able to get the HTML content, like what I see when I "view source" on the page. But the pages I monitor runs on Sharepoint and the webparts are dynamically generated. I believe if error occurs when loading these parts I would not be able to flag them out as the HTML I pull will not contain the errors but just usual paths to the webparts.
cURL seems to do the same. I just read about DOMDocument and I was wondering if DOMDocument process the codes or does it just break the HTML into a hierarchical structure.
I only wish to have the content of the URL. (like what you get when you save website as txt in IE, not the HTML). Or if I can further process the HTML then it would be good too. How can I do that? Any help will be really appreciated. :)
Why do you want to strip the HTML? It's better to use it!
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
// libxml_use_internal_errors(true);
$oDom = new DomDocument();
$oDom->loadHTML($data);
// Go through DOM and look for error (it's similar if it'd be
// <p class="error">error message</p> or whatever)
$errors = $oDom->getElementsByTagName( "error" ); // or however you get errors
foreach( $errors as $error ) {
if(strstr($error->nodeValue, 'SOME ERROR')) {
echo 'SOME ERROR occurred';
}
}
If you don't want to do that, you can just do:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 5);
$data = curl_exec($ch);
curl_close($ch);
if(strstr($data, 'SOME_ERROR')) {
echo 'SOME ERROR occurred';
}