simple_html_dom.php - php

I am using "simple_html_dom.php" to scrap the data from the Wikipedia site. If I run the code in scraperwiki.com it's throwing an error as exit status 139 and if run the same code in my xampp sever, the server is hanging.
I have a set of links
I'm trying to get Literacy value from all the sites
If I run the code with one link there is no problem and it's returning the expected result
If I try to get data from all the sites in one go I'm facing the above problem
The code is:
<?php
$test=array
(
0 => "http://en.wikipedia.org/wiki/Andhra_Pradesh",
1 => "http://en.wikipedia.org/wiki/Arunachal_Pradesh",
2 => "http://en.wikipedia.org/wiki/Assam",
3 => "http://en.wikipedia.org/wiki/Bihar",
4 => "http://en.wikipedia.org/wiki/Chhattisgarh",
5 => "http://en.wikipedia.org/wiki/Goa",
for($ix=0;$ix<=9;$ix++){
$content = file_get_html($test[$ix]);
$tables = $content ->find('#mw-content-text table',0);
foreach ($tables ->children() as $child1) {
foreach($child1->find('th a') as $ele){
if($ele->innertext=="Literacy"){
foreach($child1->find('td') as $ele1){
echo $ele1->innertext;
}}} }}
Guide me where am wrong. Is there any memory problem??? Is there any xampp configuration???

<?php
require 'simple_html_dom.php';
$test = array(
0 => "http://en.wikipedia.org/wiki/Andhra_Pradesh",
1 => "http://en.wikipedia.org/wiki/Arunachal_Pradesh",
2 => "http://en.wikipedia.org/wiki/Assam",
3 => "http://en.wikipedia.org/wiki/Bihar",
4 => "http://en.wikipedia.org/wiki/Chhattisgarh",
5 => "http://en.wikipedia.org/wiki/Goa");
for($ix=0;$ix<=count($test);$ix++){
$content = file_get_html($test[$ix]);
$tables = $content ->find('#mw-content-text table',0);
foreach ($tables ->children() as $child1) {
foreach($child1->find('th a') as $ele){
if($ele->innertext=="Literacy"){
foreach($child1->find('td') as $ele1){
echo $ele1->innertext;
}
}
}
}
$content->clear();
}
?>
but these URLs are too much. You may get a fatal error of max execution time execeeded or you may get error 324.

Related

code needs to loop over minimum 2000 times in php foreach

I am having the foreach loop that will run minimum 2000 loops
foreach ($giftCardSchemeData as $keypreload => $preload) {
for ($i=0; $i <$preload['quantity'] ; $i++) {
$cardid = new CarddetailsId($uuidGenerator->generate());
$cardnumber = self::getCardNumber();
$cardexistencetype = ($key == "giftCardSchemeData") ? "Physical" : "E-Card" ;
$giftCardSchemeDataDb = array('preload' => array('value' => $preload['value'], 'expirymonths' => $preload['expiryMonths']));
$otherdata = array('cardnumber' => $cardnumber, 'cardexistencetype' => $cardexistencetype, 'isgiftcard' => true , 'giftcardamount' => $preload['value'],'giftCardSchemeData' => json_encode($giftCardSchemeDataDb), 'expirymonths' => $preload['expiryMonths'], 'isloyaltycard' => false, 'loyaltypoints' => null,'loyaltyCardSchemeData' => null, 'loyaltyRedeemAmount' => null, 'pinnumber' => mt_rand(100000,999999));
$output = array_merge($data, $otherdata);
// var_dump($output);
$carddetailsRepository = $this->get('oloy.carddetails.repository');
$carddetails = $carddetailsRepository->findByCardnumber($cardnumber);
if (!$carddetails) {
$commandBus->dispatch(
new CreateCarddetails($cardid, $output)
);
} else {
self::generateCardFunctionForErrorException($cardid, $output, $commandBus);
}
}
}
Like above foreach I am having totally 5 of them. When I call the function each time the 5 foreach runs and then return the response. It take more time and the php maximum execution time occurs.
Is there a any way to send the response and then we could run the foreach in server side and not creating the maximum execution time issue.Also need an optimization for the foreach.
Also In symfony I have tried the try catch method for the existence check in the above code it return the Entity closed Error. I have teprorily used the existence check in Db but need an optimization
There seems to be a lot wrong (or to be optimized) with this code, but let's focus on your questions:
First I think this code shouldn't be in code that will be triggered by a visitor.
You should seperate 2 processes:
1. A cronjob that runs that will generate everything that must be generated and saved that generated info to a database. The cronjob can take as much time as it needs. Look at Symfony's console components
2. A page that displays only the generated info by fetching it from the database and passing it to a Twig template.
However, looking at the code you posted I think it can be greatly optimized as is. You seem to have a foreach loop that fetches variable data, and in that you have a for-loop that does not seem to generate much variability at all.
So most of the code inside the for loop is now being executed over and over again without making any actual changes.
Here is a concept that would give much higher performance. Ofcourse since I don't know the actual context of your code you will have to "fix it".
$carddetailsRepository = $this->get('oloy.carddetails.repository');
$cardexistencetype = ($key == "giftCardSchemeData") ? "Physical" : "E-Card";
foreach ($giftCardSchemeData as $keypreload => $preload) {
$cardnumber = self::getCardNumber();
$carddetails = $carddetailsRepository->findByCardnumber($cardnumber);
$giftCardSchemeDataDb = array('preload' => array('value' =>
$preload['value'], 'expirymonths' => $preload['expiryMonths']));
$otherdata = array('cardnumber' => $cardnumber, 'cardexistencetype' =>
$cardexistencetype, 'isgiftcard' => true , 'giftcardamount' =>
$preload['value'],'giftCardSchemeData' =>
json_encode($giftCardSchemeDataDb), 'expirymonths' =>
$preload['expiryMonths'], 'isloyaltycard' => false, 'loyaltypoints' =>
null,'loyaltyCardSchemeData' => null, 'loyaltyRedeemAmount' => null,
'pinnumber' => 0);
$output = array_merge($data, $otherdata);
for ($i=0; $i <$preload['quantity'] ; $i++) {
$cardid = new CarddetailsId($uuidGenerator->generate());
$output['pinnumber'] = mt_rand(100000,999999);
if (!$carddetails) {
$commandBus->dispatch(
new CreateCarddetails($cardid, $output)
);
} else {
self::generateCardFunctionForErrorException($cardid, $output, $commandBus);
}
}
}
Also: if in this code you are triggering any database inserts or updates, you don't want to trigger them each iteration. You will want to start some kind of database transaction and flush the queries each X iterations instead.

How to Parse the ajax script in DOM Parser?

I parsing scores from http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383
I able to parse all the attributes. But I can't able to parse the time.
I Used
$homepages = file_get_html("http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383");
$teama = $homepages->find('span[id="clock"]');
Kindly help me
Since the that particular site is loading the values dynamically (thru AJAX request), you cant really parse the value upon initial load.
<span id="clock"></span> // this tends to be empty initial load
Normal scrapping:
$homepages = file_get_contents("http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383");
$doc = new DOMDocument();
#$doc->loadHTML($homepages);
$xpath = new DOMXPath($doc);
$query = $xpath->query("//span[#id='clock']");
foreach($query as $value) {
echo $value->nodeValue; // the traversal is correct, but this will be empty
}
My suggestion is instead of scraping it, you will need to have to access it thru a request also, since it is a time (of course, as the match goes on this will change and change until the game has ended). Or you can also use their request.
$url = 'http://sports.in.msn.com/liveplayajax/SOCCERMATCH/match/gsm/en-in/1597383';
$contents = file_get_contents($url);
$data = json_decode($contents, true);
echo '<pre>';
print_r($data);
echo '</pre>';
Should yield something like (a part of it actually):
[2] => Array
(
[Code] =>
[CommentId] => -1119368663
[CommentType] => manual
[Description] => FULL-TIME: South Africa 0-5 Brazil.
[Min] => 90'
[MinExtra] => (+3)
[View] =>
[ViewHint] =>
[ViewIndex] => 0
[EditKey] =>
[TrackingValues] =>
[AopValue] =>
)
You should get the 90' by using foreach. Consider this example:
foreach($data['Commentary']['CommentaryItems'] as $key => $value) {
if(stripos($value['Description'], 'FULL-TIME') !== false) {
echo $value['Min'];
break;
}
}
Should print: 90'

PHP Web scraping of Javascript generated contents [duplicate]

This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Closed 8 years ago.
I am stuck with a scraping task in my project.
i want to grab the data from the link in $html , all table content of tr and td , here i am trying to grab the link but it only shows javascript: self.close()
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
foreach($html->find('a') as $element)
echo $element->href . '<br>';
?>
Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.
So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:
http://www.govliquidation.com/json/buyer_ux/salescalendar.js
Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.
Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:
<?php
$url = "http://www.govliquidation.com/json/buyer_ux/salescalendar.js";
$json = file_get_contents($url);
$data = json_decode($json);
?>
This yields a data object that you can inspect and convert in CSV by simple looping.
stdClass Object
(
[result] => stdClass Object
(
[events] => Array
(
[0] => stdClass Object
(
[yahoo_dur] => 11300
[closing_today] => 0
[language_code] => en
[mixed_id] => 9297
[event_id] => 9297
[close_meridian] => PM
[commercial_sale_flag] => 0
[close_time] => 01/06/2014
[award_time_unixtime] => 1389070800
[category] => Tires, Parts & Components
[open_time_unixtime] => 1388638800
[yahoo_date] => 20140102T000000Z
[open_time] => 01/02/2014
[event_close_time] => 2014-01-06 17:00:00
[display_event_id] => 9297
[type_code] => X3
[title] => Truck Drive Axles # Killeen, TX
[special_flag] => 1
[demil_flag] => 0
[google_close] => 20140106
[event_open_time] => 2014-01-02 00:00:00
[google_open] => 20140102
[third_party_url] =>
[bid_package_flag] => 0
[is_open] => 1
[fda_count] => 0
[close_time_unixtime] => 1389045600
You retrieve $data->result->events, use fputcsv() on its items converted to array form, and Bob's your uncle.
In the case of the second site, you have a table with several TR elements, and you want to catch the first two TD children of each TR.
By inspecting the source code you see something like this:
<tr>
<td> Allendale</td>
<td> Eastern Time
</td>
</tr>
<tr>
<td> Alpine</td>
<td> Eastern Time
</td>
So you just grab all the TR's
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
$fp = fopen('output.csv', 'w');
if (!$fp) die("Cannot open output CSV - permission problems maybe?");
foreach($html->find('tr') as $tr) {
$csv = array(); // Start empty. A new CSV row for each TR.
// Now find the TD children of $tr. They will make up a row.
foreach($tr->find('td') as $td) {
// Get TD's innertext, but
$csv[] = $td->innertext;
}
fputcsv($fp, $csv);
}
fclose($fp);
?>
You will notice that the CSV text is "dirty". That is because the actual text is:
<td> Alpine</td>
<td> Eastern Time[CARRIAGE RETURN HERE]
</td>
So to have "Alpine" and "Eastern Time", you have to replace
$csv[] = $td->innertext;
with something like
$csv[] = strip(
html_entity_decode (
$td->innertext,
ENT_COMPAT | ENT_HTML401,
'UTF-8'
)
);
Check out the PHP man page for html_entity_decode() about character set encoding and entity handling. The above ought to work -- and an ought and fifty cents will get you a cup of coffee :-)

PHP ob_start skeleton only working first time

I have browsed around the suggested titles and found some answers but nothing that really worked, and so I turn to you...
I have a function that uses ob_start() to call a file from which the contents are used as a skeleton. Once the contents have been retrieved I use ob_end_clean().
I only seem to be getting output from the first time I call the function and nothing more afterwards. I have included a code dump in case I am doing something wrong.
I have also included a sample of what is returned from my database call ($dl->select ...)
I have also made sure that data is indeed being passed back from the database where I am expecting it to.
Array
(
[0] => Array
(
[rev_id] => 7
[rev_temp_tree_id] => 2
[rev_tree_id] =>
[rev_status] =>
[rev_last_updated_by] => 0
[rev_authorized_by] =>
[rev_date_updated] => 1334600174
[rev_date_reviewed] =>
[rev_update_type] => 1
[temp_tree_id] => 2
[temp_tree_bag_size] => 250
[temp_tree_botanical_id] =>
[temp_tree_stem_size] => 0
[temp_tree_crown] => 0
[temp_tree_price] => 0
[temp_tree_height] => 0
[temp_tree_plant_date] => 0
[temp_tree_review_id] =>
[temp_tree_comments] =>
[temp_tree_marked_move] => 0
[temp_tree_initial_location] =>
[temp_tree_coord] =>
[temp_tree_name] => TEST
[temp_tree_sale_status] => 0
[temp_tree_open_ground] =>
[temp_tree_block] => 0
[temp_tree_row] => 0
)
)
and the code...
<?php
function print_trees($trees){
$return = '';
ob_start();
include_once('skeletons/tree.html');
$tree_skeleton= ob_get_contents();
ob_end_clean();
$search=array("[tree_id]", "[tree_name]", "[Classes]", "[rev_id]");
foreach($trees as $t){
$replace=array($t['temp_tree_id'], $t['temp_tree_name'].' ['.$t['temp_tree_id'].']', 'temp_tree', $t['rev_id']);
$return.=str_replace($search,$replace,$tree_skeleton);
}
return $return;
}
switch ($_GET['mode']){
case 'trees' :
$db_status = '';
$new_trees = $dl->select('tree_review AS tr LEFT JOIN temp_tree_trees AS tt ON tr.rev_temp_tree_id=tt.temp_tree_id', 'tr.rev_update_type="1"');
echo '<h2>New Trees</h2>';
if($dl->totalrows>0){
echo print_trees($new_trees);
}
else{
echo 'no new trees for review';
}
echo '<br /><br />';
$new_trees = $dl->select('tree_review AS tr LEFT JOIN temp_tree_trees AS tt ON tr.rev_tree_id=tt.temp_tree_id', 'tr.rev_update_type="2"');
echo '<h2>Updated Trees</h2>';
if($dl->totalrows>0){
echo print_trees($new_trees);
}
else{
echo 'no update trees for review';
}
echo '<br /><br />';
$new_trees = $dl->select('tree_review AS tr LEFT JOIN temp_tree_trees AS tt ON tr.rev_tree_id=tt.temp_tree_id', 'tr.rev_update_type="3"');
echo '<h2>Moved Trees</h2>';
if($dl->totalrows>0){
echo print_trees($new_trees);
}
else{
echo 'no moved trees for review';
}
echo '<br /><br />';
$new_trees = $dl->select('tree_review AS tr LEFT JOIN temp_tree_trees AS tt ON tr.rev_tree_id=tt.temp_tree_id', 'tr.rev_update_type="4"');
echo '<h2>Duplicated Trees</h2>';
if($dl->totalrows>0){
echo print_trees($new_trees);
}
else{
echo 'no duplicated trees for review';
}
break;
}
?>
Any help would be appreciated.
Thanks in advance.
I believe it could be one of two things:
You might be having trouble with the fact that you're representing a multi-dimensional array as a string in the HTML file, and then attempting to operate on that with a string replace. You might be better off representing the file as a data structure (XML, JSON) and parsing apart that way - this will let you skip output buffering entirely.
Alternately, I'm not sure if $new_trees is an array of objects or something else. If it's an array of objects, the foreach() loop isn't going to work correctly i.e. it should be $t->temp_tree_id vs. $t['temp_tree_id']
thanks for your comments.
I found out what the problem was, it was rather stupid actually. The file that I am importing as a skeleton is included using include_once, so once I try to call it again it won't let me.
#minitech I updated my code as you suggested, thanks.
#gadhra the content in the skeleton file is plain html and im using keywords to replace the content from the db into the html. I should have attached the html along with my code.
Thanks again :)

getResources-Snippet with specified resourceIDs does not work

I have a TV named "Kategorie" and I want a list of all resources underneath a specified resource (27) grouped by this TV. Nothing special imho. This is my approach:
<?php
$resource_ids = $modx->getTree(27,1); # => 27 is the Container with the desired resources
# We need the 'Kategorie'-TV of the resources
$cat = $modx->getObject('modTemplateVar', array('name' => 'Kategorie'));
# Map Resources to Category
$resources_by_cat = array();
foreach ($resource_ids as $id) {
$resources_by_cat[$cat->getValue($id)][] = $id;
}
# Iterate over categories and output the resources
foreach ($resources_by_cat as $cat => $ids) {
$joined_ids = join(",",$ids); # => eg "33,34,56"
print "<h2>".$cat."</h2>";
print '<table class="references">';
print '
<tr>
<th class="title">Titel</th>
<th class="author">Von</th>
</tr>
';
print $modx->runSnippet('getResources', array(
"resources" => $joined_ids,
"includeTVs" => "1",
"tpl" => "referenceRow"
));
print '</table>';
}
?>
… which looks fine to me but throws this error to me:
[2011-01-05 12:26:24] (ERROR # /index.php) Error 42000 executing statement: Array ( [0] => 42000 [1] => 1064 [2] => You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '1,2,15,18,27,23,30,3,4,22,24,26,47,5,6,7,8,9,10,11,12,14,13,17,16,19,20,49,50,21' at line 1 )
Anyone knows what's going on here? Or is there even a better approach to my goal?
UPDATE
I updated to the most recent version of getResources. Now I don't get that error message. Yet it does not work. But the "parents" option does not work either.
I used $modx->getDocument($id) instead and it works now as expected.
foreach($ids as $rid) {
$doc = $modx->getDocument($rid);
var_dump($doc);
// real output trimmed
}

Categories