I'm unable to scrape data from few websites using curls.
I'm using CURL to scrape website from url's. It works great in 80% of the urls I use. But some url's don't seem "scrapeable". For example, when I try to scrape https://www.nextdoorhub.com/ and https://www.atknsn.com/, it doesn't work. the website keeps showing blanks and at the end it doesn't return a result.
This is my code:
<center>
<br/>
<form method="post" name="scrap_form" id="scrap_form" action="scrape_data.php">
<b>Enter Website URL To Scrape Data:</b>
<input type="input" name="website_url" id="website_url">
<input type="submit" name="submit" value="Submit" >
</form>
</center>
<?php
error_reporting(E_ALL ^ E_NOTICE );
$website_url = $_POST['website_url'];
$result = scrapeWebsiteData($website_url);
function scrapeWebsiteData($website_url){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $website_url);
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_BINARYTRANSFER,1);
$result = curl_exec($curl);
curl_close($curl);
return $result;
}
$regextit = '<div id="case_textlist">(.*?)<\/div>/s';
preg_match_all($regextit, $result, $list);
/* echo "<pre>";
print_r($list[1]); die; */
$regex = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($regex, $result, $url_matches);
$count = count($url_matches[1]);
// set the local path of image
$local_path = 'C:\udeytech\htdocs\tests\images\\';
for($i=0; $i<$count; $i++)
{
preg_match_all('!.*?/!', $url_matches[1][$i], $matches);
$last_part = end($matches[0]);
////match image name last part of anything .jpg|jpeg|gif|png
preg_match("!$last_part(.*?.(jpg|jpeg|gif|png))!", $url_matches[1][$i], $matche);
$secons_part = $matche[0];
$info = pathinfo($secons_part);
$image_name = $info['basename'];
//save image url in a variable
$image_url = $url_matches[1][$i];
$image_path = scrapeWebsiteData($image_url);
$file_open = fopen($local_path.$image_name, 'w');
fwrite($file_open, $image_path);
fclose($file_open);
}
?>
Have you tried to load either of these sites in your browser and look at the responses?
nextdoorhub is using angular and atknsn looks to be heavy on jQuery. Long story short, these sites need to run javascript to render the full HTML you're intending to scrape.
Using PHP + cURL alone won't cut it. Look at threads that discuss scraping angular and that will point you in the right direction. (Hint: you need to scrape these sites with node.js)
Related
Bear with my inexperience here, but can anyone point me in the right direction for how I can change the PHP script below to output each variable that is parsed from the XML file (title, link, description, etc) as a POST method instead of just to an HTML page?
<?php
$html = "";
$url = "http://api.brightcove.com/services/library?command=search_videos&any=tag:SMGV&output=mrss&media_delivery=http&sort_by=CREATION_DATE:DESC&token= // this is where the API token goes";
$xml = simplexml_load_file($url);
$namespaces = $xml->getNamespaces(true); // get namespaces
for($i = 0; $i < 80; $i++){
$title = $xml->channel->item[$i]->video;
$link = $xml->channel->item[$i]->link;
$title = $xml->channel->item[$i]->title;
$pubDate = $xml->channel->item[$i]->pubDate;
$description = $xml->channel->item[$i]->description;
$titleid = $xml->channel->item[$i]->children($namespaces['bc'])->titleid;
$html .= "<h3>$title</h3>$description<p>$pubDate<p>$link<p>Video ID: $titleid<p>
<iframe width='480' height='270' src='http://link.brightcove.com/services/player/bcpid3742068445001?bckey=AQ~~,AAAABvaL8JE~,ufBHq_I6FnyLyOQ_A4z2-khuauywyA6P&bctid=$titleid&autoStart=false' frameborder='0'></iframe><hr/>";/* this embed code is from the youtube iframe embed code format but is actually using the embedded Ooyala player embedded on the Campus Insiders page. I replaced any specific guid (aka video ID) numbers with the "$guid" variable while keeping the Campus Insider Ooyala publisher ID, "eb3......fad" */
}
echo $html;
?>
#V.Radev Here's another PHP script using cURL that I think will work with the API I'm trying to send data to:
<?PHP
$url = 'http://api.brightcove.com/services/post';
//open connection
$ch = curl_init($url);
//set the url, number of POST vars, POST data
curl_setopt($ch,CURLOPT_POST, 1);
curl_setopt($ch,CURLOPT_POSTFIELDS, '$title,$descripton,$url' . stripslashes($_POST['$title,$description,$url']));
curl_setopt($ch,CURLOPT_RETURNTRANSFER, true);
// Enable for Charles debugging
//curl_setopt($ch,CURLOPT_PROXY, '127.0.0.1:8888');
$result = curl_exec($ch);
curl_close($ch);
print $result;
?>
My question is, how can I pass the variables from my feed parsing script (title, description, URL) to this new script?
I have this code from Brightcove, can I just output the variables from my parser script and send to this PHP script so that the data goes to the API?
<?php
// This code example uses the PHP Media API wrapper
// For the PHP Media API wrapper, visit http://docs.brightcove.com/en/video-cloud/open-source/index.html
// Include the BCMAPI Wrapper
require('bc-mapi.php');
// Instantiate the class, passing it our Brightcove API tokens (read, then write)
$bc = new BCMAPI(
'[[READ_TOKEN]]',
'[[WRITE_TOKEN]]'
);
// Create an array of meta data from our form fields
$metaData = array(
'name' => $_POST['bcVideoName'],
'shortDescription' => $_POST['bcShortDescription']
);
// Move the file out of 'tmp', or rename
rename($_FILES['videoFile']['tmp_name'], '/tmp/' . $_FILES['videoFile']['name']);
$file = '/tmp/' . $_FILES['videoFile']['name'];
// Create a try/catch
try {
// Upload the video and save the video ID
$id = $bc->createMedia('video', $file, $metaData);
echo 'New video id: ';
echo $id;
} catch(Exception $error) {
// Handle our error
echo $error;
die();
}
?>
Post is a request method to access a specific page or resource. With echo you are sending data which means that you are responding. In this page you can only add response headers and access it with a request method such as post, get, put etc.
Edit for API request as mentiond in the comments:
$curl = curl_init('your api url');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_POST, true);
curl_setopt($curl, CURLOPT_POSTFIELDS, $your_data_to_send);
$result_from_api = curl_exec($curl);
curl_close($curl);
I have been searching for days now to find a script or other solution that could help me find specific information from companies. I want to collect the name, city, and province (dutch) of each company. Nothing more.
At first I was thinking I could curl the page and then use "if...then".
I found a script to get the page.
Now I want to get information that is between specific HTML tags.
Is that possible?
Could someone please tell me were to look? In what direction?
Thanks!
EDIT:
I use the following code to get the HTML page:
function get_data($url) {
$ch = curl_init();
$timeout = 5;
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch);
return $data;
}
$returned_content = get_data('http://www.detelefoongids.nl/rijschool/zuid-holland/3-1/?what=rijschool&where=Zuid+Holland&page=2&splitType=regular&sortBy=relevance&collapsing=true&mostDominantHeading=Auto-rijscholen');
echo $returned_content;
The URL contains information I want to have. As you can see, for example, the name of the company (lets use the first result: Dubbeldam BV Autorijschool Piet
And the location (cityname): Barendrecht These two I want to get in the database.
But how?
My preference is to use the preg_match() and preg_match_all() to parse the required fields from the html document using regex. For example:
$html = '<b>Name: </b><div id="xyz">alex</div>';
preg_match('|<b>Name:\s*</b><div id="xyz">(.*?)</div>|', $html, $m);
print "Name: $m[1]";
I have found a solution. please feel free to edit / tune the script :)
I fixed it with the SIMPLE DOM
$adres = 'http://www.izee.nl';
require_once 'simple_html_dom.php'; //file SIMPLE HTML DOM
$html = file_get_html($adres); //the address I want to "strip"
// code from the Simple HTML DOM
foreach($html->find('div.infoData') as $school) {
$item['schoolnaam'] = $school->find('h4/a[itemprop=name]', 0)->plaintext;
$item['schoolplace'] = $school->find('span.city', 0)->plaintext;
$scholen[] = $item;
$data = array_filter($scholen);
//connection with de database
$con = mysqli_connect("localhost","username","password","db_schools");
if(mysqli_connect_errno()) {
echo 'There is something really bad going on...: ' . mysqli_connect_error();
exit();
}
//put stripped info in the dabatse
$result = mysqli_query($con,"INSERT INTO tbl_scholen (schoolnamen,schoolplaces) VALUES('$item[schoolnaam]', '$item[schoolplace]')");
mysqli_query($con,"UPDATE tbl_scholen set schoolnamen = TRIM(schoolnaam);");
mysqli_query($con,"UPDATE tbl_scholen set schoolplaces = TRIM(schoolplace);");
}
print_r($scholen);
I'm creating an system that uses online compiler. IDEONE give me this feature (throgh an Web Service), but with a price for an high volume of compilations.
Then I'm trying to use codepad, but it doesn't have an Web Service... codepad has an initial page, and clicking it's submit button, apparently the same page loads (the form's action is "/")...
I'm using curl to load page, but I'm getting "Internal Server Error". This is my code: pastebin Code, I'm using 000webhost, I don't know if i did something wrong or if my webserver doesn't support it.
Have you trued removing the TIMEOUT? or maybe extending it?
Try this maybe:
<html>
<div align="center">
<form action="compilador.php" method="POST">
<textarea id="source" name="source"></textarea>
<input type="submit" value="Enviar" />
<?php
if(isset($_POST['source']) && $_POST['source'] != "")
{
$ch = curl_init();
/**
* Set the URL of the page or file to download.
*/
curl_setopt($ch, CURLOPT_URL, "http://codepad.org");
/**
* Ask cURL to return the contents in a variable instead of simply echoing them to the browser.
*/
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$post_data=$_POST;
$post_data['lang'] = 'C';
$post_data['private'] = True;
$post_data['run'] = True;
foreach($post_data as $key => $value)
{
$post_items[] = $key . '=' . $value;
}
$post_string = implode ('&', $post_items);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
$result = curl_exec ($ch);
/**
* Close cURL session
*/
curl_close($ch);
echo "<br /><br />RESULT: {".$result."}";
}
?>
</form>
</div>
</html>
Sort of a weird question.
From 4shared video site, I get the embed code like the following:
<embed src="http://www.4shared.com/embed/436595676/acfa8f75" width="420" height="320" allowfullscreen="true" allowscriptaccess="always"></embed>
Now, if I access the url in that embed src, the video is loaded up and the URL of the page is changed with information about the video.
I am wondering if there is any way for me to access that info using PHP? I tried file_get_contents but it gives me lots of weird characters.
So, can I use PHP to load the embed url and get the information present in the address bar?
Thanks for all your help! :)
Yes, e.g. with the curl-library of php. This one will handle the redirect-headers from the server, which result in the new/real url of the video.
Here's a sample code:
<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.4shared.com/embed/436595676/acfa8f75");
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_NOBODY, 1);
// we want to further handle the content, so return it
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// grab URL and pass it to the browser
$result = curl_exec($ch);
// did we get a good result?
if (!$result)
die ("error getting url");
// if we got a redirection http-code, split the content in
// lines and search for the Location-header.
$location = null;
if ((int)(curl_getinfo($ch, CURLINFO_HTTP_CODE)/100) == 3) {
$lines = explode("\n", $result);
foreach ($lines as $line) {
list($head, $value) = explode(":", $line, 2);
if ($head == 'Location') {
$location = trim($value);
break;
}
}
}
if ($location == null)
die("no redirect found in header");
// close cURL resource, and free up system resources
curl_close($ch);
// your location is now in here.
var_dump($location);
?>
I need any link that has a "a href=" tag when clicked to be received via curl. I can't hard code these links as they are from a dynamic site so could be anything. How would I achieve this?
Thanks
Edit: Let me explain more. I have an app on my pc that uses a web front end. It catalogs files and gives yo options to rename delete etc. I want to add a public view however if I put it as is online then anyone can delete rename files. If I curl the pages I can remove the menu bars and editing options through the use of a different css. That part all works. The only part that isn't working is if I click on a link on the page it directs me back to the original link address and that defeats the point as the menu bars are back. I need it to curl the clicked links. Hope that makes more sense..
Here is my code that fetches the original link and curls that and changes the css to point to my own css. It points the java script to the original as I dont need to change that. I now need to make the "a href" links on the page when clicked be called by curl and not go to the original destination
<?php
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, 'http://192.168.0.14:8081/home/');
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
$curl_response = curl_exec($ch);
curl_close($ch);
//Change link url
$link = $curl_response;
$linkgo = '/sickbeard_public';
$linkfind = 'href="';
$linkreplace = 'href="' . $linkgo ;
$link = str_replace($linkfind, $linkreplace, $link);
//Change js url
$js = $link;
$jsgo = 'http://192.168.0.14:8081';
$jsfind = 'src="';
$jsreplace = 'src="' . $jsgo ;
$js = str_replace($jsfind, $jsreplace, $js);
//Fix on page link errors
$alink = $js;
$alinkgo = 'http://192.168.0.14:8081/';
$alinkfind = 'a href="/sickbeard_public/';
$alinkreplace = 'a href="' . $alinkgo ;
$alink = str_replace($alinkfind, $alinkreplace, $alink);
//Echo page back
echo $alink;
?>
You could grab all the URLs using a regular expression
// insert general warning about how parsing HTML using regex is evil :-)
preg_match('/href="([^"]+)"/', $html, $matches);
$urls = array_slice($matches, 1);
// Now just loop through the array and fetch the URLs with cUrl...
While I can't imagine why you would do that I think you should use ajax.
Attach an event on every a tag and send them to a script on your server where the magic of curl would happen.
Anyway you should explain why you need to fetch data with curl.
As far as I can understand your question you need to get the contents of URL via CURL... so here is the solution
Click here to get via curl
Then attach an event with the above <a> tag, e.g. in JQuery
$("#my_link").click(function(){
var target_url = $(this).attr("href");
//Send an ajax call to some of your page like cURL_wrapper.php with target_url as parameter in get
});
then in cURL_wrapper.php do follwoing
<?php
//Get the $target_url here from $_GET[];
$ch = curl_init($your_domain");
$fp = fopen("$target_url", "r");
curl_setopt($ch, CURLOPT_FILE, $fp);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_exec($ch);
curl_close($ch);
fclose($fp);
?>