I am trying to skip when InnerText is empty but it put a default value.
This is my code:
if (strip_tags($result[$c]->innertext) == '') {
$c++;
continue;
}
This is the output:
Thanks
EDIT2: I did the var_dump
var_dump($result[$c]->innertext)
and I got this:
how can I fix it please?
EDIT3: This is my code; I extract in this way the names of the teams and the results, but the last one not works in the best way when We have postponed matches
<?php
include('../simple_html_dom.php');
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return #curl_exec($ch);
}
$response=getHTML("https://www.betexplorer.com/soccer/japan/j3-league/results/",10);
$html = str_get_html($response);
$titles = $html->find("a[class=in-match]"); // name match
$result = $html->find("td[class=h-text-center]/a"); // result match
$c=0; $b=0; $o=0; $z=0; $h=0; // set counters
foreach ($titles as $match) { //get all data
$match_status = $result[$h++];
if (strip_tags($result[$c]->innertext) == 'POSTP.') { //bypass postponed match but it doesn't work anymore
$c++;
continue;
}
list($num1, $num2) = explode(':', $result[$c++]->innertext); // <- explode
$num1 = intval($num1);
$num2 = intval($num2);
$num3 = ($num1 + $num2);
$risultato = ($num1 . '-' . $num2);
list($home, $away) = explode(' - ', $titles[$z++]->innertext); // <- explode
$home = strip_tags($home);
$away = strip_tags($away);
$matchunit = $home . ' - ' . $away;
echo "<tr><td class='rtitle'>".
"<td> ".$matchunit. "</td> / " . // name match
"<td class='first-cell'>" . $risultato . "</td> " .
"</td></tr><br/>";
} //close foreach
?>
By browsing the content of the website you will always be dependent on the changes made in the future.
However, I will use PHP's native libxml DOM extension.
By doing the following:
<?php
function getHTML($url,$timeout)
{
$ch = curl_init($url); // initialize curl with given url
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER["HTTP_USER_AGENT"]); // set useragent
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); // write the response to a variable
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow redirects if any
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout); // max. seconds to execute
curl_setopt($ch, CURLOPT_FAILONERROR, 1); // stop when it encounters an error
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, false);
return #curl_exec($ch);
}
$response=getHTML("https://www.betexplorer.com/soccer/japan/j3-league/results/",10);
// "Create" the new DOM document
$doc = new DomDocument();
// Load HTML from a string, and disable xml error handling
$doc->loadHTML($response, LIBXML_NOERROR);
// Creates a new DOMXPath object
$xpath = new DomXpath($doc);
// Evaluates the given XPath expression and get all tr's without first line from table main
$row = $xpath->query('//table[#class="table-main js-tablebanner-t js-tablebanner-ntb"]/tr[position()>1]');
echo '<table>';
// Parse the values in row
foreach ($row as $tr) {
// Only get 2 first td's
$col = $xpath->query('td[position()<3]', $tr);
// Do not show POSTP and Round values
if (!str_contains($tr->nodeValue, 'POSTP') && !str_contains($tr->nodeValue, 'Round')) {
echo '<tr><td>'.$col->item(0)->nodeValue.'</td><td>'.$col->item(1)->nodeValue.'</td></tr>';
}
}
echo '</table>';
You obtain:
<tr><td>Nagano - Tegevajaro Miyazaki</td><td>3:2</td></tr>
<tr><td>YSCC - Toyama</td><td>1:2</td></tr>
...
Related
After transferring my previously working code on 5.6 PHP to a new host and server with PHP 7.2, I'm now getting this Fatal error: Uncaught Error: Call to a member function find() on array .... How do I fix this?
<?php
// Get Source
$html = file_get_html('URL');
// Get needed table
$table = $html->find('table',1);
// Find each row, starting with the 2nd, and echo the Cells
foreach($table->find('tr') as $rowNumber => $row) {
if ( $rowNumber < 1 ) continue;
$cell = $row->find('td', 0)->plaintext;
echo $cell;
$cell2 = $row->find('td', 1)->plaintext;
echo $cell2;
}
?>
UPDATE
So it seems the source of the error is the file_get_html code which doesn't work perfectly with PHP 7.
I've found two go-arounds:
1) Through curl
// Curl-Verbindung zu HTM-Datei
$base = 'FULL PATH';
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);
// Create a DOM object
$html = new simple_html_dom();
// Load HTML from a string
$html->load($str);
2) Another through str_get_html
$html = str_get_html(file_get_contents('RELATIVE PATH'));
I guess the second is better?
Just check the file URL on line 3.
$html = file_get_html('URL');
Try to use absolute URL if you are using a relative one.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
<?php
for ($x = 0; $x <= 25; $x++) {
$ch = curl_init("https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
//curl_setopt($ch, CURLOPT_POST, true);
//curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); //timeout in seconds
$trustpilot = curl_exec($ch);
// Check if any errorccurred
if(curl_errno($ch))
{
die('Fatal Error Occoured');
}
}
?>
This code will get all 25 pages of reviews for example.com, what I then want to do is then put all the results into a JSON array or something.
I attempted the code below in order to maybe retrieve all of the names:
<?php
$trustpilot = preg_replace('/\s+/', '', $trustpilot); //This replaces any spaces with no spaces
$first = explode( '"name":"' , $trustpilot );
$second = explode('"' , $first[1] );
$result = preg_replace('/[^a-zA-Z0-9-.*_]/', '', $second[0]); //Don't allow special characters
?>
This is clearly a lot harder than I anticipated, does anyone know how I could possibly get all of the reviews into JSON or something for however many pages I choose, for example in this case I choose 25 pages worth of reviews.
Thanks!
do not parse HTML with regex.
use DOMDocument & DOMXPath to parse em. also, you create a new curl handle for each page, but you never close them, which is a resource/memory leak in your code, but also a waste of cpu because you could just keep re-using the same curl handle over and over (instead of creating a new curl handle for each page, which takes cpu), and protip: this html compress rather well, so you should use CURLOPT_ENCODING to download the pages compressed,
e.g:
<?php
declare(strict_types = 1);
header("Content-Type: text/plain;charset=utf-8");
$ch = curl_init();
curl_setopt($ch, CURLOPT_ENCODING, ''); // enables compression
$reviews = [];
for ($x = 0; $x <= 25; $x ++) {
curl_setopt($ch, CURLOPT_URL, "https://uk.trustpilot.com/review/example.com?languages=all&page=$x");
// curl_setopt($ch, CURLOPT_POST, true);
// curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
curl_setopt($ch, CURLOPT_TIMEOUT, 30); // timeout in seconds
$trustpilot = curl_exec($ch);
// Check if any errorccurred
if (curl_errno($ch)) {
die('fatal error: curl_exec failed, ' . curl_errno($ch) . ": " . curl_error($ch));
}
$domd = #DOMDocument::loadHTML($trustpilot);
$xp = new DOMXPath($domd);
foreach ($xp->query("//article[#class='review-card']") as $review) {
$id = $review->getAttribute("id");
$reviewer = $xp->query(".//*[#class='content-section__consumer-info']", $review)->item(0)->textContent;
$stars = $xp->query('.//div[contains(#class,"star-item")]', $review)->length;
$title = $xp->query('.//*[#class="review-info__body__title"]', $review)->item(0)->textContent;
$text = $xp->query('.//*[#class="review-info__body__text"]', $review)->item(0)->textContent;
$reviews[$id] = array(
'reviewer' => mytrim($reviewer),
'stars' => ($stars),
'title' => mytrim($title),
'text' => mytrim($text)
);
}
}
curl_close($ch);
echo json_encode($reviews, JSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE | (defined("JSON_UNESCAPED_LINE_TERMINATORS") ? JSON_UNESCAPED_LINE_TERMINATORS : 0) | JSON_NUMERIC_CHECK);
function mytrim(string $text): string
{
return preg_replace("/\s+/", " ", trim($text));
}
output:
{
"4d6bbf8a0000640002080bc2": {
"reviewer": "Clement Skau Århus, DK, 3 reviews",
"stars": 5,
"title": "Godt fundet på!",
"text": "Det er rigtig fint gjort at lave et example domain. :)"
}
}
because there is only 1 review here for the url you listed. and 4d6bbf8a0000640002080bc2 is the website's internal id (probably a sql db id) for that review.
i have a very weird issue with curl and url defined inside an array.
I have an array of url and i want perform an http GET on those urls with curl
for ($i = 0, $n = count($array_station) ; $i < $n ; $i++)
{
$station= curl_init();
curl_setopt($station, CURLOPT_VERBOSE, true);
curl_setopt($station, CURLOPT_URL, $array_station[$i]);
curl_setopt($station, CURLOPT_RETURNTRANSFER, true);
curl_setopt($station, CURLOPT_FOLLOWLOCATION, true);
$response = curl_exec($station);
curl_close($station);
}
If i define my $array_station in the way below
$array_station=array("http://www.example.com","http://www.example2.com");
the code above with curl working flawlassy,but since my $array_station is build in the way below (i perform a scan of directory searchin a specific filename, then i clean the url), the curl does not work, no error showed and nothing happens..
$di = new RecursiveDirectoryIterator(__DIR__,RecursiveDirectoryIterator::SKIP_DOTS);
$it = new RecursiveIteratorIterator($di);
$array_station=array();
$i=0;
foreach($it as $file) {
if (pathinfo($file, PATHINFO_FILENAME ) == "db_insert") {
$string = str_replace('/web/htdocs/', 'http://', $file.PHP_EOL);
$string2 = str_replace('/home','', $string);
$array_station[$i]=$string2;
$i++;
}
}
Doyou have some ideas? i'm giving up :-(
I'm on mobile right now so i cannot test it, but why are you adding a new line (PHP_EOL) to the url? Try to remove the new line or trim() the url at the end.
Add the lines of code below.
If there is a curl error it will report the error number.
If the request is made, it will show the HTTP request and response headers. The request is in $info and response header is in $head
for ($i = 0, $n = count($array_station) ; $i < $n ; $i++)
{
$station= curl_init();
curl_setopt($station, CURLOPT_VERBOSE, true);
curl_setopt($station, CURLOPT_URL, $array_station[$i]);
curl_setopt($station, CURLOPT_RETURNTRANSFER, true);
curl_setopt($station, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
$response = curl_exec($station);
if (curl_errno($station)){
$response .= 'Retreive Base Page Error: ' . curl_error($station);
}
else {
$skip = intval(curl_getinfo($station, CURLINFO_HEADER_SIZE));
$head = substr($response ,0,$skip);
$response = substr($response ,$skip);
$info = var_export(curl_getinfo($station),true);
}
echo $head;
echo $info;
curl_close($station);
}
i am trying to get data from a url using curl. i've made a recursive function for this. i get the data successfully , but the problem what i am facing is that when no result is found against curl call, then the page show me nothing, only a blank page is shown.. no error at all. i've used var_dump() too for testing the response. but found nothing.
here is my recursive function
function recursive_get_scrap($offset, $page_size, $urls, $original_array){
ini_set('max_execution_time', 1800);
$of_set = $offset;
$pg_size = $page_size;
$off_sets = 'offset='.$of_set.'&page_size='.$pg_size.'';
$url = $urls.$off_sets;
$last_correct_array = $original_array;
$ch1 = curl_init();
// Disable SSL verification
curl_setopt($ch1, CURLOPT_SSL_VERIFYPEER, false);
// Will return the response, if false it print the response
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch1, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch1, CURLOPT_URL,$url);
// Execute
$result2 = curl_exec($ch1);
$info = curl_getinfo($ch1);
if(curl_errno($ch1))
{
echo 'error:' . curl_error($ch1);
//return $last_correct_array;
}
// Closing
curl_close($ch1);
if(!$result2 || strlen(trim($result2)) == 0 || $result2 == false){
echo 'no array';
}
if(isset($result2) && !empty($result2)){
echo 'in recursive function <br>';
$a1 = json_decode( $original_array, true );
$a2 = json_decode( $result2, true );
$temp_array = array_merge_recursive($a1, $a2 );
$last_correct_array = $temp_array;
$offset += 100;
$page_size = 100;
recursive_get_scrap($offset, $page_size, $urls, json_encode($last_correct_array));
}
}
now what i only want it that if noting is get against curl call then no array message should be displayed.
Use this option to curl_setopt():
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
This will make curl_exec return the data instead of outputting it.
To see if it was successful you can then check $result and also
curl_error().
I am trying to add the results from the while loop into an xml request. But it is not showing up correctly.. Below I first create a function, then I try to make that function work inside the xml request curl tag. It is hard to explain and easier to see:
//Getting room groups
function roomGroups(){
$i = 2;
while(isset($_GET['ad'.$i]))
{
$adult = $_GET['ad'.$i];
$child = $_GET['ch'.$i];
$childAge = $_GET['ch'.$i];
echo "<RoomGroup><Room><numberOfAdults>".$adult."</numberOfAdults><numberOfChildren>".$child."<childAges>".$childAge."</childAges></Room></RoomGroup>";
$i++;
}}
//Room availablility Request
$ch1 = curl_init();
$fp1 = fopen('room_request.xml','w');
curl_setopt($ch1, CURLOPT_URL, "http://api.ean.com/ean-services/rs/hotel/v3/avail?cid=379849&minorRev=13&apiKey=4sr8d8bsn75tpcuja6ypx5g3&locale=en_US¤cyCode=USD&customerIpAddress=67.20.125.193&customerUserAgent=Mozilla/5.0+(Windows+NT+6.1)+AppleWebKit/535.11+(KHTML,+like+Gecko)+Chrome/17.0.963.79+Safari/535.11&customerSessionId=0ABAA856-8502-E913-6982-E2210F904B72&xml=<HotelRoomAvailabilityRequest><hotelId>".$hid."</hotelId><arrivalDate>".$arrivalDate."</arrivalDate><departureDate>".$departingDate."</departureDate><RoomGroup><Room><numberOfAdults>".$adults."</numberOfAdults><numberOfChildren>".$children."</numberOfChildren><childAges>".$ages."</childAges></Room></RoomGroup>".roomGroups()."<includeDetails>true</includeDetails></HotelRoomAvailabilityRequest>");
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch1, CURLOPT_HTTPHEADER, array('Accept: application/xml'));
curl_setopt($ch1, CURLOPT_HEADER, 0);
curl_setopt($ch1, CURLOPT_FILE, $fp1);
$val1 = curl_exec($ch1);
curl_close($ch1);//Close curl session
fclose($fp1); //Close file overwrite
$avail = simplexml_load_file('room_request.xml');
//The url that is passing the data looks like this: /hotel_request.php?hid=370111&d=05/02/2012&a=04/30/2012&r=3&ad2=2&ch2=22=16,17,&ad3=3&ch3=33=7,13,16,&
I get nothing back. Any help on this would be GREATLY appreciated!
roomGroups() doesn't return anything, it just echos to the screen.
Try something like this:
function roomGroups(){
$i = 2;
$ret = '';
while(isset($_GET['ad'.$i])){
$adult = $_GET['ad'.$i];
$child = $_GET['ch'.$i];
$childAge = $_GET['ch'.$i];
$ret .= "<RoomGroup><Room><numberOfAdults>".$adult."</numberOfAdults><numberOfChildren>".$child."<childAges>".$childAge."</childAges></Room></RoomGroup>";
$i++;
}
return $ret;
}