PHP - Simple Html Dom load multiple pages speed - php

I finally got my script to work but it takes a long time to do the search (via ajax). Basically by entering a keyword, it searches the page and captures all the titles, urls, and thumbnails of the videos. But the problem arose to me to capture the tags that were inside each video, so I had to forcibly access each video to capture the tags, the only way I could think of was to add a loop inside the loop that captures the found videos that is to say:
For each video found -> Capture title, thumbnail, URL -> With captured URL -> Go to that URL and capture your tags.
The code I used is basically the following, I need to know if there is any other method to speed up searches, either by optimizing the code or using another way:
My parse function:
<?php
function dlPage($href) {
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, "Accept-language: en-US");
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $href);
curl_setopt($curl, CURLOPT_REFERER, $href);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.4 (KHTML, like Gecko) Chrome/5.0.375.125 Safari/533.4");
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$dom = new simple_html_dom();
// Load HTML from a string
$dom->load($str);
return $dom;
}
?>
My script:
$buscartag = str_replace(' ', '+', $_POST['buscartag']);
$urlparse = "https://example.com/?k=".$buscartag;
$paginas = rand(0, 50);
$html = dlPage($urlparse."&p=".$paginas);
$counter = 0;
foreach($html->find('div.video-box') as $videos) {
if ($videos) {
$titulo = $videos->find('div.video-box>p[!class])>a[!class]',0)->attr['title'];
$pathvideo = str_replace('_', '', $videos->attr['id']);
$link = "https://www.example.com/".$pathvideo."/";
$thumb = $videos->find('div.thumb')->innertext
//HERE MY SECOND BUCLE FOR TAGS!!!
$gettags2 = array();
$html_tags = file_get_html($link);
foreach ($html_tags->find('a.nu') as $gettags){
$gettags2[] = $gettags->innertext;
if (!empty($titulo) && !empty($link) && !empty($idvideo) && !empty($urlimagen)){
$counter++;
//here will echo all variables
}}

Related

How to select specific text from a string generated by a PHP script?

I've been trying to scrape a HLS file from Twitch using several PHP scripts. The first one runs a cURL command to get the HLS URL through a Python script that returns said URL and converts the generated string to plain text, and the second (which is the one that isn't working) is supposed the extract the M3U8 file and make it able to be played.
First script (extract.php)
<?php
header('Content-Type: text/plain; charset=utf-8');
$url = "https://pwn.sh/tools/streamapi.py?url=twitch.tv/cgtn_live_russian&quality=1080p60";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
//for debug only!
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$resp = curl_exec($curl);
curl_close($curl);
var_dump($resp);
$undesirable = array("}");
$cleanurl = str_replace($undesirable,"");
echo substr($cleanurl, 39, 898);
?>
This script (let's call it extract.php) works, and it returns (in plain text) the same information the Python script would return, which is this:
string(904) "{"success": true, "urls": {"1080p60": "https://video-weaver.fra05.hls.ttvnw.net/v1/playlist/[token].m3u8"}}"
Second script (play.php)
<?php
$opts = array(
'http'=>array(
'method'=>"GET",
'header'=>"Referer:https://myserver.com/" .
"User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:51.0) Gecko/20100101 Firefox/51.0"
));
$html = file_get_contents("extract.php");
preg_match_all(
'/(http.*?\.m3u8[^&">]+)/',
$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
header("Location: $link");
}
?>
This second script (let's call it play.php) should theoretically return the M3U8 file (without string(904) "{"success": true, "urls": {"1080p60":) and make it able to be played in a media player, such as VLC, but it doesn't return anything.
Can someone tell me what's wrong? Did I make a syntax or regex error when making these PHP files or is the second file not working because of the other elements of the string?
Thanks in advance.
I think you can rely on the regex to get the URL out instead of trying to clean the string manually. The other way would be to use json_decode().
Anyways the idea is to define a variable in extract.php, in this case it is $resp. Doing it via echo as you are now will not make it available in the parent script.
You can then reference that variable in play.php once extract.php has been included.
<?php
//extract.php
$resp = '';
$url = "https://pwn.sh/tools/streamapi.py?url=twitch.tv/cgtn_live_russian&quality=1080p60";
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
//for debug only!
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
$resp = curl_exec($curl);
curl_close($curl);
//play.php
include('./extract.php');
//$resp is set in extraact.php
preg_match_all(
'/(http.*?\.m3u8)/',
$resp,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
}
header("Location: $link");
die();

simple_html_dom: 403 Access denied

I implemented this function in order to parse HTML pages using two different "methods".
As you can see both are using the very handy class called simple_html_dom.
The difference is the first method is also using curl to load the HTML while the second is not using curl
Both methods are working fine on a lot of pages but I'm struggling with this specific call:
searchThroughDOM('https://fr.shopping.rakuten.com/offer/buy/3458931181/new-york-1997-4k-ultra-hd-blu-ray-blu-ray-bonus-edition-boitier-steelbook.html', 'simple_html_dom');
In both cases, I end up with a 403 access denied response.
Did I do something wrong?
Or is there another method in order to avoid this type of denial?
function searchThroughDOM ($url, $method)
{
echo '$url = '.$url.'<br>'.'$method = '.$method.'<br><br>';
$time_start = microtime(true);
switch ($method) {
case 'curl':
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36');
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
break;
case 'simple_html_dom':
$html = new simple_html_dom();
$html->load_file($url);
break;
}
$collection = $html->find('h1');
foreach($collection as $x => $x_value) {
echo 'x = '.$x.' => value = '.$x_value.'<br>';
}
$html->save('result.htm');
$html->clear();
$time_end = microtime(true);
echo 'Elapsed Time (DOM) = '.($time_end - $time_start).'<br><br>';
}
From my point of view , there is nothing wrong with "simple_html_dom"
you may remove the simple html dom "part" of the code , leave only for the CURL
which I assume is the source of the problem.
There are lots of reasons cause the curl Not working on page
first of all I can see you add
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, false);
you should also try to add CURLOPT_SSL_VERIFYHOST , false
Secondly , check your curl version, see if it is too old
third option, if none of above working , you may want to enable cookie , it may possible the cookie disabled cause the website detect it is machine, not real person send the request .
lastly , if all above attempt failed , try other library or even file_get_content ,
Curl is not your only option, of cause it is the most powerful one.

How to get reviews from Google Business using CURL PHP

I'm trying to get reviews in Google Business. The goal is to get access via curl and then get value from pane.rating.moreReviews label jsaction.
How I can fix code below to get curl?
function curl($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.94 Safari/537.36');
$html = curl_exec($ch);
curl_close($ch);
return $html;
}
$html = curl("https://www.google.com/maps?cid=12909283986953620003");
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$finder = new DomXPath($DOM);
$classname = 'pane.rating.moreReviews';
$nodes = $finder->query("//*[contains(#jsaction, '$classname')]");
foreach ($nodes as $node) {
$check_reviews = $node->nodeValue;
$ses_key = preg_replace('/[^0-9]+/', '', $check_reviews);
}
// result should be: 166
echo $ses_key;
If I try do var_dump($html);, I'm getting:
string(348437) " "
And this number is changing on each page refresh.
Get Google-Reviews with PHP cURL & without API Key
How to find the CID - If you have the business open in Google Maps:
Do a search in Google Maps for the business name
Make sure it’s the only result that shows up.
Replace http:// with view-source: in the URL
Click CTRL+F and search the source code for “ludocid”
CID will be the numbers after “ludocid\u003d” and till the last number
or use this tool: https://ryanbradley.com/tools/google-cid-finder/
Example
ludocid\\u003d16726544242868601925\
HINT: Use the class ".quote" in you CSS to style the output
The PHP
<?php
/*
💬 Get Google-Reviews with PHP cURL & without API Key
=====================================================
How to find the CID - If you have the business open in Google Maps:
- Do a search in Google Maps for the business name
- Make sure it’s the only result that shows up.
- Replace http:// with view-source: in the URL
- Click CTRL+F and search the source code for “ludocid”
- CID will be the numbers after “ludocid\\u003d” and till the last number
or use this tool: https://pleper.com/index.php?do=tools&sdo=cid_converter
Example
-------
```TXT
ludocid\\u003d16726544242868601925\
```
> HINT: Use the class ".quote" in you CSS to style the output
###### Copyright 2019 Igor Gaffling
*/
$cid = '16726544242868601925'; // The CID you want to see the reviews for
$show_only_if_with_text = false; // true OR false
$show_only_if_greater_x = 0; // 0-4
$show_rule_after_review = false; // true OR false
/* ------------------------------------------------------------------------- */
$ch = curl_init('https://www.google.com/maps?cid='.$cid);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla / 5.0 (Windows; U; Windows NT 5.1; en - US; rv:1.8.1.6) Gecko / 20070725 Firefox / 2.0.0.6");
curl_setopt($ch, CURLOPT_TIMEOUT, 60);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookies.txt');
$result = curl_exec($ch);
curl_close($ch);
$pattern = '/window\.APP_INITIALIZATION_STATE(.*);window\.APP_FLAGS=/ms';
if ( preg_match($pattern, $result, $match) ) {
$match[1] = trim($match[1], ' =;'); // fix json
$reviews = json_decode($match[1]);
$reviews = ltrim($reviews[3][6], ")]}'"); // fix json
$reviews = json_decode($reviews);
//$customer = $reviews[0][1][0][14][18];
//$reviews = $reviews[0][1][0][14][52][0];
$customer = $reviews[6][18]; // NEW IN 2020
$reviews = $reviews[6][52][0]; // NEW IN 2020
}
if (isset($reviews)) {
echo '<div class="quote"><strong>'.$customer.'</strong><br>';
foreach ($reviews as $review) {
if ($show_only_if_with_text == true and empty($review[3])) continue;
if ($review[4] <= $show_only_if_greater_x) continue;
for ($i=1; $i <= $review[4]; ++$i) echo '⭐'; // RATING
if ($show_blank_star_till_5 == true)
for ($i=1; $i <= 5-$review[4]; ++$i) echo '☆'; // RATING
echo '<p>'.$review[3].'<br>'; // TEXT
echo '<small>'.$review[0][1].'</small></p>'; // AUTHOR
if ($show_rule_after_review == true) echo '<hr size="1">';
}
echo '</div>';
}
Source: https://github.com/gaffling/PHP-Grab-Google-Reviews
Please try below code
$html = curl("https://maps.googleapis.com/maps/api/place/details/json?cid=12909283986953620003&key=<google_apis_key>", "Mozilla 5.0");
$datareview = json_decode($html);// get all data in array
Ex. : http://meetingwords.com/QiIN1vaIuY
It will work for you.
Create Google Key From google console developer
https://developers.google.com/maps/documentation/embed/get-api-key

Manipulate dom with php to scrape data

I am currently trying to manipulate dom throuhg php to extract views from an fb video page. The below code was working until a bit ago. However now it doesnt find the node that contains the views count. This information is inside a div with id fbPhotoPageMediaInfo. What would be the best way to manipulate the dom through php to get views of an fb video page?
private function _callCurl($url)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Linux; Android 5.0.1; SAMSUNG-SGH-I337 Build/LRX22C; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/42.0.2311.138 Mobile Safari/537.36');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 20);
curl_setopt($ch, CURLOPT_URL, $url);
$response = curl_exec($ch);
$http = curl_getinfo($ch, CURLINFO_HTTP_CODE);
curl_close($ch);
return array(
$http,
$response,
);
}
function test()
{
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
$request = callCurl($url);
if ($request[0] == 200) {
$dom = new DOMDocument();
#$dom->loadHTML($request[1]);
$elm = $dom->getElementById('fbPhotoPageMediaInfo');
if (isset($elm->nodeValue)) {
$views = preg_replace('/[^0-9]/', '', $elm->nodeValue);
} else {
$views = null;
}
} else {
echo "Error!";
}
return isset($views) ? $views : null;
}
Here is what I've determined...
If you var_dump() on $request you can see that it's giving you a 302 code (redirect) rather than a 200 (ok).
Changing CURLOPT_FOLLOWLOCATION to true or commenting it out entirely makes the error go away, but now we're getting a different page from the one expected.
I ran the following to see where I was being redirected to:
$htm = file_get_contents("https://www.facebook.com/TaylorSwift/videos/10153665021155369/");
var_dump($htm);
This gave me a page saying I was using an outdated browser, and needed to update it. So apparently Facebook doesn't like the User Agent.
I updated it as follows:
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/44.0.2');
That appears to solve the problem.
Personally I prefer to use Simplehtmldom.
FB like other high traffic sites do update their source to help prevent scraping. You may in the future have to adjust your node search.
<?php
$ua = "Mozilla/5.0 (Windows NT 5.0) AppleWebKit/5321 (KHTML, like Gecko) Chrome/13.0.872.0 Safari/5321"; // must be a valid User Agent
ini_set('user_agent', $ua);
require_once('simplehtmldom/simple_html_dom.php'); // http://simplehtmldom.sourceforge.net/
Function Scrape_FB_Views($url) {
IF (!filter_var($url, FILTER_VALIDATE_URL) === false) {
// Create DOM from URL
$html = file_get_html($url);
IF ($html) {
IF (($html->find('span[class=fcg]', 3))) { // 4th instance of span with fcg class
$text = trim($html->find('span[class=fcg]', 3)->plaintext); // get content of span as plain text
$result = preg_replace('/[^0-9]/', '', $text); // replace all non-numeric characters
}ELSE{
$result = "Node is no longer valid."
}
}ELSE{
$result = "Could not get HTML.";
}
}ELSE{
$result = "URL is invalid.";
}
return $result;
}
$url = "https://www.facebook.com/TaylorSwift/videos/10153665021155369/";
echo("<p>".Scrape_FB_Views($url)."</p>");
?>

copy particular div from Flipkart.com web scraping using Curl and Php

I want to copy particular div contain data from flipkart product web page and display it.
<table cellspacing="0" class="specTable">
///// contains /////
</table>
its table value are variable in some web page have 10 tables in same class and some page have more, how i can get all table value from this ?
Also wants to get specific specsValue, is it possible to get it also ?
<td class="specsKey">Brand</td><td class="specsValue">Apple</td>
Web page address: http://www.flipkart.com/apple-iphone-6/p/itme8ra5z7yx5c9j?pid=MOBEYHZ2JHVFHFBG
Sample code
$url = "http://dl.flipkart.com/dl/apple-iphone-6/p/itme8ra5z7yx5c9j?pid=MOBEYHZ2JHVFHFBG";
$response = getPriceFromFlipkart($url);
echo json_encode($response);
/* Returns the response in JSON format */
function getPriceFromFlipkart($url) {
$curl = curl_init($url);
curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 10.10; labnol;) ctrlq.org");
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($curl);
curl_close($curl);
$regex = '/<meta itemprop="price" content="([^"]*)"/';
preg_match($regex, $html, $price);
$regex = '/<h1[^>]*>([^<]*)<\/h1>/';
preg_match($regex, $html, $title);
$regex = '/data-src="([^"]*)"/i';
preg_match($regex, $html, $image);
if ($price && $title && $image) {
$response = array("price" => $price[1], "title" => $title[1], "image" => $image[1]);
} else {
$response = array("status" => "404", "error" => "We could not find the product details on Flipkart $url");
}
return $response;
}
?>
Flipkart now change its interface and you can fetch the product price and all by using Flipkart API.
Currently I'm also using their API.
But I also want to fetch the product details using below curl command, if anyone is doing the same without any problem please share what else i have to add here to fetch the product webpage content, while debugging this by using getinfo() it will return 301 Moved Permanentlywith Status Code 0
$curl_handle=curl_init();
curl_setopt($curl_handle,CURLOPT_URL,<flipkart_url>);
curl_setopt($curl_handle,CURLOPT_CONNECTTIMEOUT,100);
curl_setopt($curl_handle,CURLOPT_RETURNTRANSFER,1);
curl_setopt($curl_handle, CURLOPT_REFERER, 'http://www.flipkart.com/');
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
$str = curl_exec($curl_handle);
$html = new simple_html_dom();
$html->load($str);

Categories