PHP Website Crawler Data Extraction Multiple Loop Error 404 - php

I am looking to crawl multiple gig listing websites to compile a ultimate listing guide with links back to the original websites.
A lot of these websites don't have an API so I have to use a rather crudely put together php script that will extract the data I require. (eg date, venue, country etc)
Most sites have a fairly easy to use directory of gigs, but on certain sites, they require manually inputting information to get "relevant" shows to you.
So to get around this, I created a loop that worked on the basis of:
page.php?id=$counter+1
So it finds the last inserted gig into the db and carries on getting data for the next 100 or so.
BUT this only works on the condition that the gigs on the site will continue numerically accurately, and of course, they don't due to cancellations etc.
This leaves me with the wonderful
Warning: file_get_contents(http://www.domain.com/show/page.php?id=123456) [function.file-get-contents]: failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in...
How is it possible to create a loop that will be able to skip these errors and carry on rather than just sitting on them?
Below is the entire code (Limit of +5 at the moment for testing)
include_once('simple_html_dom.php');
$cntqry = mysql_query("SELECT * FROM `gigs_tbl` ORDER BY `counter` DESC LIMIT 1");
$cntnum = mysql_num_rows($cntqry);
if($cntnum!=0)
{
$cntget = mysql_fetch_assoc($cntqry);
$start = $cntget['counter'];
}
else {
$start = 10767799;
}
$counter = 0;
$limit = $start +5;
for($start; $start < $limit; $start++) {
$counter = $start + 1;
$target_url = "http://www.domain.com/show/page.php?id=$counter";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('div[class=vevent]') as $showrow){
$artist = strip_tags($showrow->find('h2',0));
$genre = strip_tags($html->find('span[class=genre]',0));
$venue = strip_tags($showrow->find('span[class=location]',0));
$street = strip_tags($html->find('span[itemprop=streetAddress]',0));
$locality = strip_tags($html->find('span[itemprop=addressLocality]',0));
$postcode = strip_tags($html->find('span[itemprop=postalCode]',0));
$country = strip_tags($html->find('span[itemprop=addressRegion]',0));
$originalDate = strip_tags($html->find('meta[itemprop=startDate]',0)->content);
$newDate = date("U", strtotime($originalDate));
// INSERT
mysql_query("INSERT INTO `gigs_tbl` VALUES('','$counter','$newDate','$venue','$street','$locality','$postcode','$country','$gen re','$artist','reverbnation')");
}
}
Ten virtual high fives to anyone who can guess which website is causing this issue ;)

find() returns NULL if nothing found... So, a way to do what you want is to exploit this :)
Since your didnt provide the real link, here's an example explaining how:
$start = 'u';
for($start; $start < 'x'; $start++) {
// The only correct url is => http://sourceforge.net/p/mingw/bugs/
$target_url = "http://sourceforge.net/p/ming".$start."/bugs/";
echo "<br/> Link: $target_url";
// #: supresses the errors when the page doesnt exist
$data_string = #file_get_contents($target_url);
$html = new simple_html_dom();
// Load HTML from a string
$html->load($data_string);
// Find returns NULL if nothing found
$search_elements = $html->find('#nav_menu_holder h1');
if($search_elements) {
echo "<br/> Page FOUND. Title => " . $search_elements[0]->plaintext;
}
else {
echo "<br/> Page NOT FOUND !!";
}
echo "<hr>";
// Clear DOM object
$html->clear();
unset($html);
}
PHP-Fiddle DEMO

Related

I am getting really long wait times getting results using googles distance matrix

I am building a small project to help find locations to a user and give the distance between the two. I found some code using the distancematrix that works like a charm, that is until I started including more locations to find the distance between the user and the locations. The code below worked great for three or four locations, but when it gets to 20 or above it will take up to eight or nine seconds. After spending a few hours reading into others projects utilizing the distance matrix this should not be the case and should be able to produce hundreds of results in seconds. By results I mean the distance from point A to point B, yet I am getting bogged down with only 20 queries.
Am I utilizing the distance matrix in an incorrect way?
Note: $orgin is outside the scope of the loop, but had the user address.
Any help or recommendations would be greatly appreciated.
foreach ($AllC as $key=>$item){
echo $item;
$userQueryResult2 = mysqli_query($conn2, "SELECT id, addressP, cityP, StateP, zipP, milesTravel FROM `Performers` WHERE id = $item");
while($row = mysqli_fetch_array($userQueryResult2)){
$finaelId = $row['id'];
$AdP = $row['addressP'];
$CityP = $row['cityP'];
$StateP = $row['StateP'];
$ZipP = $row['zipP'];
$miles = $row['milesTravel'];
$origin = "$AD.', '.$city.' '.$State.' '.$zip";
$destination = "$AdP.', '.$CityP.' '.$StateP.' '.$ZipP";
$distance_data = file_get_contents('https://maps.googleapis.com/maps/api/distancematrix/json?units=imperial&origins='.urlencode($origin).'&destinations='.urlencode($destination).'&key=Hidden');
$distance_arr = json_decode($distance_data);
if ($distance_arr->status=='OK') {
$destination_addresses = $distance_arr->destination_addresses[0];
$origin_addresses = $distance_arr->origin_addresses[0];
} else {
//Need to Send to an error page
echo "<p>The request was Invalid. Please Contact Support</p>";
exit();
}
if ($origin_addresses=="" or $destination_addresses=="") {
//Need to send to an error page
echo "<p>Destination or origin address not found</p>";
exit();
}
// Get the elements as array
$elements = $distance_arr->rows[0]->elements;
$distance = $elements[0]->distance->text;
$duration = $elements[0]->duration->text;
echo "From: ".$origin_addresses."<br/> To: ".$destination_addresses."<br/> Distance: <strong>".$distance ."</strong><br/>";
echo "Duration: <strong>".$duration."";
echo "<br>";
if($distance > $miles){
array_push($tooFar, $finaelId);
} else {
array_push($alldone, $finaelId);
}
}
}
//}
echo $countall;
exit();

Fetch users from Azure AD

I can't figure out why my loop isn't working at all. I have successfully connected to my clients directory and I am able to fetch some users. I have followed the PHP instructions. But this tutorial doesn't include example for fetching all users only the default page size of 100 users.
I am aware of the skipToken (explained here) but for some reason I am not been able to get it work with my loop.
Basically first I define an array, and two sub arrays.
$myArray = array();
$myArray['skipToken'] = "";
$myArray['users'] = "";
Then I'll perform the first fetch so I can get skipToken and bunch of users that come along.
require_once("GraphServiceAccessHelper.php");
$users = GraphServiceAccessHelper::getFeed('users');
Pushing values into already existing arrays.
$myArray['skipToken'] = $users->{'odata.nextLink'};
$myArray['users'][] = $users->{'value'};
Now they are filled with information. Now its time to loop!
for($i = 0; $i < 2; $i++){
if($myArray['skipToken'] != ""){
$skipToken = $myArray['skipToken'];
require_once("GraphServiceAccessHelper.php");
$users = GraphServiceAccessHelper::getNextFeed('users', $skipToken);
$myArray['skipToken'] = $users->{'odata.nextLink'};
$myArray['users'][] = $users->{'value'};
}
}
Console fires up from error, that points to loop skipToken defining part:
Notice: Undefined property: stdClass::$odata.nextLink
$myArray['skipToken'] = $users->{'odata.nextLink'};
Okay I figured it out.
First I had to remove everything before actual token.
$skipToken = $users->{'odata.nextLink'};
$skipToken = substr($skipToken, strpos($skipToken, "=") + 1);
Then inside the loop use that get new skipToken and do the same like above:
$new = GraphServiceAccessHelper::getNextFeed('users', $skipToken);
if(isset($new->{'odata.nextLink'})){
$skipToken = empty($new->{'odata.nextLink'});
} else{
break;
}
$skipToken = substr($skipToken, strpos($skipToken, "=") + 1);
$myArray['tokens'] = $skipToken;
$myArray['users'][] = $new->{'value'};
By checking if 'odata.nextLink" exists I can easily stop the while loop since lastpage doesn't contain 'odata.nextLink'.
if(isset($new->{'odata.nextLink'})){
$skipToken = empty($new->{'odata.nextLink'});
} else{
break;
}
I am appending each 100 user array to another array that I can call easily use it outside PHP.

SimpleXML to get specific data from an XML file

I am making a plugin for JomSocial that will display billing information for a logged-in user, based on an XML file. I have made good headway creating the plugin, I just cant seem to get the syntax right to create php statements so I can populate data in various places on the page. Here is the XML file:
<Inquiry>
<Billing>
<Version>4.5.1</Version>
<startTime><![CDATA[4/15/2014 11:09 PM]]></startTime>
<endTime><![CDATA[4/15/2014 11:12 PM]]></endTime>
<Date>20140415</Date>
<MemberId ID="0ESING">
<BillingInfo>
<StatementEndDate>20140430</StatementEndDate>
<BillingSubAccount>
</BillingSubAccount>
<BalanceForward>628.32</BalanceForward>
<BalanceDue>372</BalanceDue>
<Payments>-300</Payments>
</MemberId>
<MemberId ID="F00421">
</BillingInfo>
<BillingInfo>
<StatementEndDate>20140430</StatementEndDate>
<BillingSubAccount>
</BillingSubAccount>
<BalanceForward>1158.36</BalanceForward>
<BalanceDue>93.45</BalanceDue>
<Payments>-1158.36</Payments>
Here is the PHP so far:
$user =& CFactory::getRequestUser();
$cuser = CFactory::getUser();
$owner = CFactory::getUser($row->user->id);
$ptype = $cuser->getProfileType();
$billingid = $owner->getInfo('FIELD_BILLINGID');
$lastname = $owner->getInfo('FIELD_FAMILYNAME');
$uname = $cuser->username;
$memid = $cuser->id;
$name = $cuser->getDisplayName();
$isMine = COwnerHelper::isMine($cuser->id, $user->id);
$config = CFactory::getConfig();
$source = file_get_contents('data/201404.xml');
$xml = new SimpleXMLElement($source);
$balance = $xml->Billing->MemberId->BillingInfo->BalanceDue;
$BalanceForward = $xml->Billing->MemberId->BillingInfo->BalanceForward;
$Payments = $xml->Billing->MemberId->BillingInfo->Payments;
ob_start();
if( $isMine ) {
if($ptype == '2') {
if(strcasecmp($uname, $billingid) == 0) {
Then, in page to call the fields:
<?php echo "<div>Balance Due: $". $balance ." | Balance Forward: $" . $BalanceForward . " | Payment: $" . $Payments . "</div>"; ?>
This pulls in the first record of the XML file. I was trying something like this for hours:
$source = file_get_contents('data/201404.xml');
$xml = new SimpleXMLElement($source);
$balance = $xml->Billing->MemberId[.$uname.]->BillingInfo->BalanceDue;
$BalanceForward = $xml->Billing->MemberId[.$uname.]->BillingInfo->BalanceForward;
$Payments = $xml->Billing->MemberId[.$uname.]->BillingInfo->Payments;
to no avail. I would like to 'pull' the child node from the XML where the MemberId ID= "yadayada" is equal to the $uname. I hope I am being clear, this is my first post on Stackoverflow!
Using the square bracket notation accesses the attribute by it's name, so you are asking for member[0ESING] which isn't right because the attribute is named ID.
You can iterate of the members to find the match like so:
foreach($xml->Billing->MemberId as $member){
if($member['ID'] == $uname){
$balance = $member->BillingInfo->BalanceDue;
$BalanceForward = $member->BillingInfo->BalanceForward;
$Payments = $member->BillingInfo->Payments;
}
}
Why are you not using simplexml_load_file() instead? It saves you the hassle in loading the file manually and putting it in a simplexml_element instead. Furthermore, I think your issues arrise because the XML file itself is invalid, it need 1 root element but instead seems to contain zero (or multiple, depends on how one would read the file).

Codeigniter post view count function -

Hello i'm trying to make a post count function based on the codeigniter framework and what i done so far works great looking by my side. The function is complete and everything works fine.
But i have a question here from some experts to tell me that this is the correct ussage of the function and non of the above will harm my page loading. The function is accesed by jQuery get function on every click on the post.
First i'm checking if there is an ip address and if the ip address date inserted is more then 24 hours if its more i'm deleting the current row and then checking again becuase of the previous select its still remembering the last ip address and inserting again with new datime.
And other question should i make cleanjob every week or similar for all ip addreses ?
Here is my code:
function show_count($post_id){
$ip = $this->input->ip_address();
$this->db->select('ip_address,data');
$this->db->from('post_views');
$this->db->where('ip_address',$ip);
$query = $this->db->get();
foreach ($query->result() as $row){
$ip_address = $row->ip_address;
$data = $row->data;
}
if(isset($ip_address) && time() >= strtotime($data) + 8640){
$this->db->where('ip_address',$ip);
$this->db->delete('post_views');
}
$this->db->select('ip_address');
$this->db->from('post_views');
$this->db->where('ip_address',$ip);
$query = $this->db->get();
foreach ($query->result() as $row){
$ip_address_new = $row->ip_address;
}
if(!isset($ip_address_new) && $ip_address_new == false){
$date = new DateTime('now', new DateTimeZone('Europe/Skopje'));
$this->db->set('views', 'views+ 1', false);
$this->db->where('post_id',$post_id);
$this->db->update('posts');
$data = array(
'ip_address'=>$ip,
'data'=>$date->format("Y-m-d H:i:s")
);
$this->db->insert('post_views',$data);
}
}
Thanks, any suggestions will be appreciate.
Instead of doing lots of queries to increment unique views on your posts, you should use and set cookies and have a fallback method if cookies are not enabled.
$post_id = "???"
if(isset($_COOKIE['read_articles'])) {
//grab the JSON encoded data from the cookie and decode into PHP Array
$read_articles = json_decode($_COOKIE['read_articles'], true);
if(isset($read_articles[$post_id]) AND $read_articles[$post_id] == 1) {
//this post has already been read
} else {
//increment the post view count by 1 using update queries
$read_articles[$post_id] = 1;
//set the cookie again with the new article ID set to 1
setcookie("read_articles",json_encode($read_articles),time()+60*60*24);
}
} else {
//hasn't read an article in 24 hours?
//increment the post view count
$read_articles = Array();
$read_articles[$post_id] = 1;
setcookie("read_articles",json_encode($read_articles),time()+60*60*24);
}

ExpressionEngine: how to get the path of a page given its entry_id (with the structure plug-in)

I'm trying to build an extension that would create pages for automatic redirections of short URLs, and to make it short, I need to get the path of a page given its entry_id.
Say, I have a page with the path: http://server.tld/index.php/path/to/my/page
But, in the code, I only know the entry_id of this page.
If I look on the exp_channel_titles table, I can get the url_title field. But it will only contain "page". And I'd like to get "/path/to/my/page". And there doesn't seem to be any API for this.
Do you know how I could proceed?
Thanks a lot.
I can't remember exactly where in the documentation it is, but I think your issue is coming from the fact that the page uris are not retrieved directly from the database.
They are instead located in the Expressionengine global configuration variables. I've been able to do a url lookup using the entry_id using the following code:
Note: This assumes you are using structure, pages module, etc.
<?php
$this->EE =& get_instance(); // Get global configuration variable
$site_id = $this->EE->config->item('site_id'); // Get site id (MSM safety)
$site_pages = $this->EE->config->item('site_pages'); // Get pages array
/** The array is indexed as follows:
* $site_pages[{site_id}]['url'] = '{site_url}
* $site_pages[{site_id}]['uris']['entry_id'] = {page_uri}
**/
$page_url = $site_pages[$site_id]['uris'][$entry_id];
?>
EDIT:
I initially stated that the uris are not in the database which is not strictly speaking true... Pages are actually stored as a hashed string in exp_sites.site_pages indexed by site id.
I did not find anything better than the following code:
//
//
// Don't look the following code just yet . . .
//
//
// You'll be pulling your hair out of your head. Just read me first.
// Following is a huge SQL query, that assume that pages are not nested
// than 9 times. This is actually a hard limitation of EE, and I'm using
// that to get the information I need in only one query instead of nine.
//
// The following code is just to get the path of the current entry we are
// editing, so the redirect page will know where to redirect. And I did
// not find any function or API to get this information. Too bad.
//
// If you find any solution to that, please answer
// http://stackoverflow.com/questions/8245405/expressionengine-how-to-get-the-path-of-a-page-given-its-entry-id-with-the-str
// it might save other people all the trouble.
// S
//
// P
//
// O
//
// I
//
// L
//
// E
//
// R
// First, we get all the entry_id of all the elements that are parent to
// the current element in the structure table (created by the Structure
// plugin).
$q = $this->EE->db->query(
"SELECT
S1.entry_id AS entry9,
S2.entry_id AS entry8,
S3.entry_id AS entry7,
S4.entry_id AS entry6,
S5.entry_id AS entry5,
S3.entry_id AS entry4,
S7.entry_id AS entry3,
S8.entry_id AS entry2,
S9.entry_id AS entry1
FROM
exp_structure AS S1,
exp_structure AS S2,
exp_structure AS S3,
exp_structure AS S4,
exp_structure AS S5,
exp_structure AS S6,
exp_structure AS S7,
exp_structure AS S8,
exp_structure AS S9
WHERE
S1.entry_id = $entry_id AND
S1.parent_id = S2.entry_id AND
S2.parent_id = S3.entry_id AND
S3.parent_id = S4.entry_id AND
S4.parent_id = S5.entry_id AND
S5.parent_id = S6.entry_id AND
S6.parent_id = S7.entry_id AND
S7.parent_id = S8.entry_id AND
S8.parent_id = S9.entry_id");
// Then, we construct a similar query to get all the url_title attributes
// for these pages.
$path = array();
$sql = array("SELECT" => "SELECT", "FROM" => " FROM", "WHERE" => " WHERE");
$j = 1;
for($i = 1; $i <= 9; ++$i){
$id = $q->row("entry$i");
if($id > 0){
$sql['SELECT'] .= " CT$j.url_title AS title$j,";
$sql['FROM'] .= " exp_channel_titles as CT$j,";
$sql['WHERE'] .= " CT$j.entry_id = $id AND";
$j++;
}
}
$sql['SELECT'] = rtrim($sql['SELECT'], ",");
$sql['FROM'] = rtrim($sql['FROM'], ",");
$sql['WHERE'] = preg_replace("/ AND$/", "", $sql['WHERE']);
$sql = $sql['SELECT'] . $sql['FROM'] . $sql['WHERE'];
$q = $this->EE->db->query($sql);
// Finally, we can construct the path for the current page
$path = "/";
for($i = 1; $i < $j; ++$i){
$path .= $q->row("title$i") . '/';
}
//
// Blood and bloody ashes.
//
May I suggest asking the Structure devs via their support forum?

Categories