scraping "next pages" using PHP, cURL, simplehtmldom

scraping "next pages" using PHP, cURL, simplehtmldom - php

I finally got my scraper working (somewhat), but now I'd like to know how I can automatically go to the next page and scrape the same info from there. I'm using cURL to copy the entire page (otherwise I get a 500 error). Here's my code:
<?
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "http://example.com/results.asp?&j=t&page_no=1");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// $output contains the output string
$html = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
// print $html . "\n";
require 'simple_html_dom.php';
$dom = new simple_html_dom();
$dom->load($html);
foreach($dom->find("div[#id='schoolsearch'] tr") as $data){
$tds = $data->find("td");
if(count($tds)==3){
$record = array(
'school' => $tds[1]->plaintext,
'city' => $tds[2]->plaintext
);
print json_encode($record) . "\n";
file_put_contents('schools.csv', json_encode($record) . "\n", FILE_APPEND);
}
}
?>
It's not perfect, but it's what works right now! Anyone know how I can move to the next pages?

Wrap it in a loop:
$maxPages = 10;
for ($i = 1; $i <= $maxPages; $i++) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://example.com/results.asp?&j=t&page_no=$i");
etc...
}
You'll need to tidy up a bit more,to avoiding including that file on every page, but you get the idea.

Related

PHP Simple HTML Dom parser returns 0

I use PHP Simple HTML Dom parser to get some elements of a page. Unfortunately, I get as a result 0 or 1... I would like to get the innerHTML instead.
Here is a photo of the dom:
And here is my code:
include('simple_html_dom.php');
// We take the url we want to scrape
$URL = 'https://www.legifrance.gouv.fr/affichTexte.do?cidTexte=JORFTEXT000033011065&dateTexte=20160821';
// Curl init
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $URL);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
curl_close($ch);
// We get the html
$html = new simple_html_dom();
$html->load($result);
// Find all article blocks
foreach($html->find('div.data') as $article) {
$item['title'] = $article->find('.titreSection', 0) ->plaintext;
$resultat[] = "<p>" + $item['title']."</p></br>";
}
include 'vue_scrap.php';
?>
Here is the code of my view:
foreach ($resultat as $result){
echo $result;
}
Thank you for your help.

In fact I just did a mistake with that line:
$resultat[] = "<p>" + $item['title']."</p></br>";
The correct version is:
$resultat[] = "<p>".$item['title']."</p></br>";

Optimizing PHP Code and Storing JSON response to parse it faster?

so I'm trying to figure out why does this PHP code takes too long to run to output the results.
for example this is my apitest.php and here is my PHP Code
<?php
function getRankedMatchHistory($summonerId,$serverName,$apiKey){
$k
$d;
$a;
$timeElapsed;
$gameType;
$championName;
$result;
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($ch);
curl_close($ch);
$matchHistory = json_decode($response,true); // Is the Whole JSON Response saved at $matchHistory Now locally as a variable or is it requested everytime $matchHistory is invoked ?
for ($i = 9; $i >= 0; $i--){
$farm1 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["minionsKilled"];
$farm2 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralMinionsKilled"];
$farm3 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledTeamJungle"];
$farm4 = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["neutralminionsKilledEnemyJungle"];
$elapsedTime = $matchHistory["matches"][$i]["matchDuration"];
settype($elapsedTime, "integer");
$elapsedTime = floor($elapsedTime / 60);
$k = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["kills"];
$d = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["deaths"];
$a = $matchHistory["matches"][$i]["participants"]["0"]["stats"]["assists"];
$championIdTmp = $matchHistory["matches"][$i]["participants"]["0"]["championId"];
$championName = call_user_func('getChampionName', $championIdTmp); // calls another function to resolve championId into championName
$gameType = preg_replace('/[^A-Za-z0-9\-]/', ' ', $matchHistory["matches"][$i]["queueType"]);
$result = (($matchHistory["matches"][$i]["participants"]["0"]["stats"]["winner"]) == "true") ? "Victory" : "Defeat";
echo "<tr>"."<td>".$gameType."</td>"."<td>".$result."</td>"."<td>".$championName."</td>"."<td>".$k."/".$d."/".$a."</td>"."<td>".($farm1+$farm2+$farm3+$farm4)." in ". $elapsedTime. " minutes". "</td>"."</tr>";
}
}
?>
What I'd like to know is how to make the page output faster as it takes around
10~15 seconds to output the results which makes the browser thinks the website is dead like a 500 Internal error or something like it .
Here is a simple demonstration of how long it can take : Here
As you might have noticed , yes I'm using Riot API which is sending the response as a JSON encoded type.
Here is an example of the response that this function handles : Here
What I thought of was creating a temporarily file called temp.php at the start of the CURL function and saving the whole response there and then reading the variables from there so i can speed up the process and after reading the variables it deletes the temp.php that was created thus freeing up disk space. and increasing the speed.
But I have no idea how to do that in PHP Only.
By the way I'd like to tell you that i just started using PHP today so I'd prefer some explanation with the answers if possible .
Thanks for your precious time.

Try benchmarking like this:
// start the timer
$start_curl = microtime(true);
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "https://".$serverName.".api.pvp.net/api/lol/".$serverName."/v2.2/matchhistory/".$summonerId."?api_key=".$apiKey);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
// debugging
curl_setopt($ch, CURLOPT_VERBOSE, true);
// start another timer
$start = microtime(true);
$response = curl_exec($ch);
echo 'curl_exec() in: '.(microtime(true) - $start).' seconds<br><br>';
// start another timer
$start = microtime(true);
curl_close($ch);
echo 'curl_close() in: '.(microtime(true) - $start).' seconds<br><br>';
// how long did the entire CURL take?
echo 'CURLed in: '.(microtime(true) - $start_curl).' seconds<br><br>';

Reading URL's and parse information

I have a txt file with 5000 lines with URL's. WHat i'm trying to do is to open every url to extract every url (that first url have).
My problem is, the first line the script opens the URL and tell me how many links i have with no problem. But for the rest of the URL's in the file isnt showing anything...the array show something like this:
Array
(
)
Array
(
)
My code:
$homepage = file_get_contents('***mytxt file****');
$pathComponents = explode(",", trim($homepage)); //line breaker
//echo "<pre>";print_r($pathComponents);echo "</pre>";
$count_nlines = count($pathComponents);
for ($i=0;$i<3;$i++) {
$request_url = $pathComponents[$i];
//echo $request_url . "<br>";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $request_url); // The url to get links from
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // We want to get the respone
$result = curl_exec($ch);
$regex='|<a.*?href="(.*?)"|';
preg_match_all($regex,$result,$parts);
$links=$parts[1];
echo "<pre>";print_r($links);echo "</pre>";
curl_close($ch);
}
Any ideas?!

It looks like you're looping through the wrong thing. Try changing this:
for ($i=0;$i<3;$i++) {
To this:
for ($i = 0; $i <= count($pathComponents); $i++)

php - xml data parsing

I am working on getting some xml data into a php variable so I can easily call it in my html webpage. I am using this code below:
$ch = curl_init();
$timeout = 5;
$url = "http://www.dictionaryapi.com/api/v1/references/collegiate/xml/define?key=0b03f103-f6a7-4bb1-9136-11ab4e7b5294";
$definition = simplexml_load_file($url);
echo $definition->entry[0]->def;
However my results are: .
I am not sure what I am doing wrong and I have followed the php manual, so I am guessing it is something obvious but I am just not understanding it correctly.
The actual xml results from that link used in cURL are visible by clicking the link below , I did not post it because it is rather long:
http://www.dictionaryapi.com/api/v1/references/sd3/xml/test?key=9d92e6bd-a94b-45c5-9128-bc0f0908103d

<?php
$ch = curl_init();
$timeout = 5;
$url = "http://www.dictionaryapi.com/api/v1/references/collegiate/xml/define?key=0b03f103-f6a7-4bb1-9136-11ab4e7b5294";
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
$data = curl_exec($ch);
curl_close($ch); // you were missing a semicolon
$definition = new SimpleXMLElement($data);
echo '<pre>';
print_r($definition->entry[0]->def);
echo '</pre>';
// this returns the SimpleXML Object
// to get parts, you can do something like this...
foreach($definition->entry[0]->def[0] as $entry) {
echo $entry[0] . "<br />";
}
// which returns
transitive verb
14th century
1 a
:to determine or identify the essential qualities or meaning of
b
:to discover and set forth the meaning of (as a word)
c
:to create on a computer
2 a
:to fix or mark the limits of :
b
:to make distinct, clear, or detailed especially in outline
3
:
intransitive verb
:to make a
Working Demo

php strange looping problem

Sorry for the long code, I'm really losing it.
This code is supposed to get a list of urls through POST, in a textarea with breaklines between each url. The script should download each url, go through the html and take some links, then go in those links, get some data and echo it out.
For some reason, visually it looks as if I'm running getDetails() only once, as I'm getting only one set of results.
I have checked multiple times if the foreach loop takes each url separately and that part is working
Can anyone spot the problem?
require_once('simple_html_dom.php');
function getDetails($html) {
$dom = new simple_html_dom;
$dom->load($html);
$title = $dom->find('h1', 0)->find('a', 0);
foreach($dom->find('span[style="color:#333333"]') as $element) {
$address = $element->innertext;
}
$address = str_replace("<br>"," ",$address);
$address = str_replace(","," ",$address);
$title->innertext = str_replace(","," ",$title->innertext);
if ($address == "") {
$exp = explode("<strong><strong>",$html);
$exp2 = explode("</strong>",$exp[1]);
$address = $exp2[0];
}
echo $title->innertext . "," . $address . "<br>";
}
function getHtml($Url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $Url);
curl_setopt($ch, CURLOPT_REFERER, "http://www.google.com/");
curl_setopt($ch, CURLOPT_USERAGENT, "MozillaXYZ/1.0");
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
function getdd($u) {
$html = getHtml($u);
$dom = new simple_html_dom;
$dom->load($html);
foreach($dom->find('a') as $element) {
if (strstr($element->href,"display_one.asp")) {
$durls[] = $element->href;
}
}
return $durls;
}
if (isset($_POST['url'])) {
$urls = explode("\n",$_POST['url']);
foreach ($urls as $u) {
$durls2 = getdd($u);
$durls2 = array_unique($durls2);
foreach ($durls2 as $durl) {
$d = getHtml("http://www.example.co.il/" . $durl);
getDetails($d);
}
}
}

You're only assigning the last element in the loop, it looks like. You'll need to concatenate. Something like $address .= $element->innertext; inside the loop (note the .= instead of =).
edit: unless I'm mistaking what it's supposed to be doing. I think I may've been focusing on the wrong part of the code.

When you use DOMDocument on html you load it with $dom->loadHTMLFile() or $dom->loadHTML() you should also call libxml_use_internal_errors(true) before hand so that it will not crash because of improperly formatted html.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

scraping "next pages" using PHP, cURL, simplehtmldom - php

Wrap it in a loop: $maxPages = 10; for ($i = 1; $i <= $maxPages; $i++) { $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, "http://example.com/results.asp?&j=t&page_no=$i"); etc... } You'll need to tidy up a bit more,to avoiding including that file on every page, but you get the idea.

Related

PHP Simple HTML Dom parser returns 0

Optimizing PHP Code and Storing JSON response to parse it faster?

Reading URL's and parse information

php - xml data parsing

php strange looping problem

Categories

Resources