My function:
function raspislinks($url)
{
$chs = curl_init($url);
curl_setopt($chs, CURLOPT_URL, $url);
curl_setopt($chs, CURLOPT_COOKIEFILE, 'cookies.txt'); //Подставляем куки раз
curl_setopt($chs, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($chs, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36 OPR/29.0.1795.60");
curl_setopt($chs, CURLOPT_SSL_VERIFYPEER, 0); // не проверять SSL сертификат
curl_setopt($chs, CURLOPT_SSL_VERIFYHOST, 0); // не проверять Host SSL сертификата
curl_setopt($chs, CURLOPT_COOKIEJAR, 'cookies.txt'); //Подставляем куки два
$htmll = curl_exec($chs);
$pos = strpos($htmll, '<strong><em><font color="green"> <h1>');
$htmll = substr($htmll, $pos);
$pos = strpos($htmll, '<!-- </main>-->');
$htmll = substr($htmll, 0, $pos);
$htmll = end(explode('<strong><em><font color="green"> <h1>', $htmll));
$htmll = str_replace('<a href ="', '<a href ="https://nfbgu.ru/timetable/fulltime/', $htmll);
$GLOBALS['urls'];
preg_match_all("/<[Aa][ \r\n\t]{1}[^>]*[Hh][Rr][Ee][Ff][^=]*=[ '\"\n\r\t]*([^ \"'>\r\n\t#]+)[^>]*>/", $htmll, $urls);
curl_close($chs);
}
How can I use a variable $urls outside the function? It is array.
"return $urls"not working or am I doing something wrong. Help me please.
As you load a value into $GLOBALS['urls']; in the function, you can then use $urls in code outside this function.
The $GLOBALS array holds one occurance for each of the variables available in the global scope, so once $GLOBALS['urls']; is set a value that value can also be referenced as $urls
Like
function raspislinks($url) {
...
//$GLOBALS['urls'];
preg_match_all("/<[Aa][ \r\n\t]{1}[^>]*[Hh][Rr][Ee][Ff][^=]*=[ '\"\n\r\t]*([^ \"'>\r\n\t#]+)[^>]*>/",
$htmll,
$GLOBALS['urls']
);
}
raspislinks('google.com');
foreach ( $urls as $url) {
}
A simpler way would be to put the data in a simple varibale and return it from the function
function raspislinks($url) {
...
//$GLOBALS['urls'];
preg_match_all("/<[Aa][ \r\n\t]{1}[^>]*[Hh][Rr][Ee][Ff][^=]*=[ '\"\n\r\t]*([^ \"'>\r\n\t#]+)[^>]*>/",
$htmll,
$t
);
return $t;
}
$urls = raspislinks('google.com');
foreach ( $urls as $url) {
}
Related
i am using two function for get the url or video play
1. for extract the tiktok for video with watermark
public function getDetails()
{
$url = $this->url;
$resp = $this->getContent($url);
$check = explode("\"contentUrl\":\"", $resp);
if (count($check) > 1) {
$video = explode("\"", $check[1])[0];
$videoWithoutWaterMark = $this->WithoutWatermark($url);
$thumb = explode("\"", explode("\"thumbnailUrl\":[\"", $resp)[1])[0];
$username = explode("/", explode("#", explode("\"", explode("\"url\":\"", $resp)[1])[0])[1])[0];
$result = [
'video'=>$video,
'withoutWaterMark'=>$videoWithoutWaterMark,
'user'=>$username,
'thumb'=>$thumb,
'error'=>false,
'message'=>false
];
}
else
{
$result = [
'video'=>false,
'withoutWaterMark'=>false,
'user'=>false,
'thumb'=>false,
'error'=>true,
'message'=>"Please double check your url and try again."
];
}
return $result;
}
private function cUrl($url)
{
$user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result = curl_exec($ch);
curl_close($ch);
return $result;
}
and another function for get the video url without water mark is
private function WithoutWatermark($url)
{
//videi id for example 6795008547961752326
$dd = explode("video/",$url);
$url = "https://api2.musical.ly/aweme/v1/playwm/?video_id=".$dd[1];
return $url;
}
Please help me to find tiktok video id, or any way to create download link of video without watermark. how can i find the video id of the video so i will use this video id for create a download link " https://api2.musical.ly/aweme/v1/playwm/?video_id=v09044b90000bpfdj5q91d8vtcnie6o0";
Your function WithoutWatermark doesn't work.
If you have an url like: tiktok.com/#user/video/123456
then you can make a curl:
$data = cUrl($url)
You'll get a page from tiktok, with regex you can extract url video:
https://v16.muscdn.com/123etc
Then again curl with this above url, the response is bytes and inside with regex you can find something like this vid:yourvideoid
I'm trying to get reviews in Google Business. The goal is to get access via curl and then get value from pane.rating.moreReviews label jsaction.
How I can fix code below to get curl?
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
$html = curl("https://www.google.com/maps?cid=12909283986953620003");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'pane.rating.moreReviews';
$nodes = $finder->query("//*[contains(#jsaction, '$classname')]");
foreach ($nodes as $node) {
$check_reviews = $node->nodeValue;
$ses_key = preg_replace('/[^0-9]+/', '', $check_reviews);
}
// result should be: 166
echo $ses_key;
If I try do var_dump($html);, I'm getting:
string(348437) " "
And this number is changing on each page refresh.
Get Google-Reviews with PHP cURL & without API Key
How to find the CID - If you have the business open in Google Maps:
Do a search in Google Maps for the business name
Make sure it’s the only result that shows up.
Replace http:// with view-source: in the URL
Click CTRL+F and search the source code for “ludocid”
CID will be the numbers after “ludocid\u003d” and till the last number
or use this tool: https://ryanbradley.com/tools/google-cid-finder/
Example
ludocid\\u003d16726544242868601925\
HINT: Use the class ".quote" in you CSS to style the output
The PHP
<?php
/*
💬 Get Google-Reviews with PHP cURL & without API Key
=====================================================
How to find the CID - If you have the business open in Google Maps:
- Do a search in Google Maps for the business name
- Make sure it’s the only result that shows up.
- Replace http:// with view-source: in the URL
- Click CTRL+F and search the source code for “ludocid”
- CID will be the numbers after “ludocid\\u003d” and till the last number
or use this tool: https://pleper.com/index.php?do=tools&sdo=cid_converter
Example
-------
```TXT
ludocid\\u003d16726544242868601925\
```
> HINT: Use the class ".quote" in you CSS to style the output
###### Copyright 2019 Igor Gaffling
*/
$cid = '16726544242868601925'; // The CID you want to see the reviews for
$show_only_if_with_text = false; // true OR false
$show_only_if_greater_x = 0; // 0-4
$show_rule_after_review = false; // true OR false
/* ------------------------------------------------------------------------- */
$ch = curl_init('https://www.google.com/maps?cid='.$cid);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla / 5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko / 20070725 Firefox / 2.0.0.6");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
$result = curl_exec($ch);
curl_close($ch);
$pattern = '/window\.APP_INITIALIZATION_STATE(.*);window\.APP_FLAGS=/ms';
if ( preg_match($pattern, $result, $match) ) {
$match[1] = trim($match[1], ' =;'); // fix json
$reviews = json_decode($match[1]);
$reviews = ltrim($reviews[3][6], ")]}'"); // fix json
$reviews = json_decode($reviews);
//$customer = $reviews[0][1][0][14][18];
//$reviews = $reviews[0][1][0][14][52][0];
$customer = $reviews[6][18]; // NEW IN 2020
$reviews = $reviews[6][52][0]; // NEW IN 2020
}
if (isset($reviews)) {
echo '<div class="quote"><strong>'.$customer.'</strong><br>';
foreach ($reviews as $review) {
if ($show_only_if_with_text == true and empty($review[3])) continue;
if ($review[4] <= $show_only_if_greater_x) continue;
for ($i=1; $i <= $review[4]; ++$i) echo '⭐'; // RATING
if ($show_blank_star_till_5 == true)
for ($i=1; $i <= 5-$review[4]; ++$i) echo '☆'; // RATING
echo '<p>'.$review[3].'<br>'; // TEXT
echo '<small>'.$review[0][1].'</small></p>'; // AUTHOR
if ($show_rule_after_review == true) echo '<hr size="1">';
}
echo '</div>';
}
Source: https://github.com/gaffling/PHP-Grab-Google-Reviews
Please try below code
$html = curl("https://maps.googleapis.com/maps/api/place/details/json?cid=12909283986953620003&key=<google_apis_key>", "Mozilla 5.0");
$datareview = json_decode($html);// get all data in array
Ex. : http://meetingwords.com/QiIN1vaIuY
It will work for you.
Create Google Key From google console developer
https://developers.google.com/maps/documentation/embed/get-api-key
I am receiving an invalid cookie string when trying capture the cookie using file_get_contents and curl. The cookie received while browsing directly from the browser is valid/active. But, the cookie captured from file_get_contents and curl seems to be invalid.
I am trying to capture from file_get_contents like this
$context = array(
'http' => array(
'method' => 'GET',
'header' => array('Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*\/*;q=0.8', 'User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/63.0.3239.84 Chrome/63.0.3239.84 Safari/537.36'),
)
);
$cxContext = stream_context_create($context);
file_get_contents($url, false, $cxContext);
$cookies = array();
foreach ($http_response_header as $hdr) {
if (preg_match('/^Set-Cookie:\s*([^;]+)/', $hdr, $matches)) {
$cookies = $matches[1];
}
}
return $cookies;
I tried playing around with this, by setting headers, but the cookies returned always is either expired or simply invalid.
But, through a browser the cookie I get is always valid.
Anyone faced a similar problem, don't know how to tackle this issue.
There are several unanswered questions from my above comment, but I'll share this bit of code for example purposes. It's what I've used in the past as a base class for browser emulation using cURL:
<?php
if(!function_exists("curl_init")) { throw new Exception("CurlBrowser requires the cURL extension, which is not enabled!"); }
class CurlBrowser
{
public $userAgent = "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0";
/*
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:9.0.1) Gecko/20100101 Firefox/9.0.1");
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:25.0) Gecko/20100101 Firefox/25.0
*/
public $cookiesFile = null;
public $proxyURL = null;
public $saveLastOutput = "";
public $caBundle = "cacert.pem";
public $httpHeaders = array();
public function __construct($UseCookies = true)
{
if(is_bool($UseCookies) && $UseCookies)
{
$this->cookiesFile = dirname(__FILE__)."/cookies.txt";
}
elseif(is_string($UseCookies) && ($UseCookies != ""))
{
$this->cookiesFile = $UseCookies;
}
}
public function SetCustomHTTPHeaders($arrHeaders)
{
$this->httpHeaders = $arrHeaders;
}
public function SetProxy($proxy)
{
$this->proxyURL = $proxy;
}
public function Get($url)
{
return $this->_request($url);
}
public function Post($url,$data = array())
{
return $this->_request($url,$data);
}
private function _request($form_url,$data = null)
{
$ch = curl_init($form_url);
// CA bundle
$caBundle = $this->caBundle;
if(file_exists($caBundle))
{
// Detect and convert relative path to absolute path
if(basename($caBundle) == $caBundle)
{
$caBundle = getcwd() . DIRECTORY_SEPARATOR . $caBundle;
}
// Set CA bundle
curl_setopt($ch, CURLOPT_CAINFO, $caBundle);
}
// Cookies
if($this->cookiesFile !== null)
{
curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookiesFile);
curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookiesFile);
}
// User Agent
curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);
// Misc
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_ENCODING, "gzip, deflate");
// Optional proxy
if($this->proxyURL !== null)
{
curl_setopt($ch, CURLOPT_PROXY, $this->proxyURL);
}
// Custom HTTP headers
if(count($this->httpHeaders))
{
curl_setopt($ch, CURLOPT_HTTPHEADER, $this->httpHeaders);
}
// POST data
if($data !== null)
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $data);
}
// Run operation
$result = curl_exec($ch);
if($result === false)
{
throw new Exception(curl_error($ch));
}
else
{
if(!empty($this->saveLastOutput))
{
file_put_contents($this->saveLastOutput,$result);
}
return $result;
}
}
}
?>
You'd use it like so:
<?php
$browser = new CurlBrowser();
$html = $browser->Get("https://....");
...etc...
My gut guess is that you're simply missing a cookie jar in your original code, but that's mostly based on gut feeling, since we don't have all your problem code at this time.
I finally got my script to work but it takes a long time to do the search (via ajax). Basically by entering a keyword, it searches the page and captures all the titles, urls, and thumbnails of the videos. But the problem arose to me to capture the tags that were inside each video, so I had to forcibly access each video to capture the tags, the only way I could think of was to add a loop inside the loop that captures the found videos that is to say:
For each video found -> Capture title, thumbnail, URL -> With captured URL -> Go to that URL and capture your tags.
The code I used is basically the following, I need to know if there is any other method to speed up searches, either by optimizing the code or using another way:
My parse function:
<?php
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, "Accept-language: en-US");
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
return $dom;
}
?>
My script:
$buscartag = str_replace(' ', '+', $_POST['buscartag']);
$urlparse = "https://example.com/?k=".$buscartag;
$paginas = rand(0, 50);
$html = dlPage($urlparse."&p=".$paginas);
$counter = 0;
foreach($html->find('div.video-box') as $videos) {
if ($videos) {
$titulo = $videos->find('div.video-box>p[!class])>a[!class]',0)->attr['title'];
$pathvideo = str_replace('_', '', $videos->attr['id']);
$link = "https://www.example.com/".$pathvideo."/";
$thumb = $videos->find('div.thumb')->innertext
//HERE MY SECOND BUCLE FOR TAGS!!!
$gettags2 = array();
$html_tags = file_get_html($link);
foreach ($html_tags->find('a.nu') as $gettags){
$gettags2[] = $gettags->innertext;
if (!empty($titulo) && !empty($link) && !empty($idvideo) && !empty($urlimagen)){
$counter++;
//here will echo all variables
}}
I'm using php and cURL to scrape a web page, but this web page is poorly designed (as in no classes or ids on tags), so I need to search for specific text, then go to the tag holding it (ie <p>) then move to the next child (or next <p>) and get the text.
There are various things I need to get from the page, some also being the text within an <a onclick="get this stuff here">. So basically I feel that I need to use cURL to scrape the source code to a php variable, then I can use php to kind of parse through and find the stuff I need.
Does this sound like the best method to do this? Does anyone have any pointers or can demonstrate how I can put source code from cURL into a variable?
Thanks!
EDIT (Working/Current Code) -----------
<?php
class Scrape
{
public $cookies = 'cookies.txt';
private $user = null;
private $pass = null;
/*Data generated from cURL*/
public $content = null;
public $response = null;
/* Links */
private $url = array(
'login' => 'https://website.com/login.jsp',
'submit' => 'https://website.com/LoginServlet',
'page1' => 'https://website.com/page1',
'page2' => 'https://website.com/page2',
'page3' => 'https://website.com/page3'
);
/* Fields */
public $data = array();
public function __construct ($user, $pass)
{
$this->user = $user;
$this->pass = $pass;
}
public function login()
{
$this->cURL($this->url['login']);
if($form = $this->getFormFields($this->content, 'login'))
{
$form['login'] = $this->user;
$form['password'] =$this->pass;
// echo "<pre>".print_r($form,true);exit;
$this->cURL($this->url['submit'], $form);
//echo $this->content;//exit;
}
//echo $this->content;//exit;
}
// NEW TESTING
public function loadPage($page)
{
$this->cURL($this->url[$page]);
echo $this->content;//exit;
}
/* Scan for form */
private function getFormFields($data, $id)
{
if (preg_match('/(<form.*?name=.?'.$id.'.*?<\/form>)/is', $data, $matches)) {
$inputs = $this->getInputs($matches[1]);
return $inputs;
} else {
return false;
}
}
/* Get Inputs in form */
private function getInputs($form)
{
$inputs = array();
$elements = preg_match_all('/(<input[^>]+>)/is', $form, $matches);
if ($elements > 0) {
for($i = 0; $i < $elements; $i++) {
$el = preg_replace('/\s{2,}/', ' ', $matches[1][$i]);
if (preg_match('/name=(?:["\'])?([^"\'\s]*)/i', $el, $name)) {
$name = $name[1];
$value = '';
if (preg_match('/value=(?:["\'])?([^"\']*)/i', $el, $value)) {
$value = $value[1];
}
$inputs[$name] = $value;
}
}
}
return $inputs;
}
/* Perform curl function to specific URL provided */
public function cURL($url, $post = false)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 30);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
// "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_VERBOSE, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, $this->cookies);
curl_setopt($ch, CURLOPT_COOKIEFILE, $this->cookies);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 120);
curl_setopt($ch, CURLOPT_TIMEOUT, 120);
if($post) //if post is needed
{
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post));
}
curl_setopt($ch, CURLOPT_URL, $url);
$this->content = curl_exec($ch);
$this->response = curl_getinfo( $ch );
$this->url['last_url'] = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
}
}
$sc = new Scrape('user','pass');
$sc->login();
$sc->loadPage('page1');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page2');
echo "<h1>TESTTESTEST</h1>";
$sc->loadPage('page3');
echo "<h1>TESTTESTEST</h1>";
(note: credit to #Ramz scrape a website with secured login)
You can divide your problem in several parts.
Retrieving the data from the data source.
For that, you can possibly use CURL or file_get_contents() depending on your requirements. Code examples are everywhere. http://php.net/manual/en/function.file-get-contents.php and http://php.net/manual/en/curl.examples-basic.php
Parsing the retrieved data.
For that, i would start by looking into "PHP Simple HTML DOM Parser" You can use it to extract data from an HTML string. http://simplehtmldom.sourceforge.net/
Building and generating the output.
This is simply a question of what you want to do with the data that you have extracted. For example, you can print it, reformat it, or store it to a database/file.
I suggest you use a rready made scaper. I use Goutte (https://github.com/FriendsOfPHP/Goutte) which allows me to load website content and traverse it in the same way you do with jQuery. i.e. if I want the content of the <div id="content"> I use $client->filter('#content')->text()
It even allows me to find and 'click' on links and submit forms to retreive and process the content.
It makes life soooooooo mucn easier than using cURL or file_get_contentsa() and working your way through the html manually