PHP Web scraping of Javascript generated contents [duplicate] - php

This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Closed 8 years ago.
I am stuck with a scraping task in my project.
i want to grab the data from the link in $html , all table content of tr and td , here i am trying to grab the link but it only shows javascript: self.close()
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
foreach($html->find('a') as $element)
echo $element->href . '<br>';
?>

Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.
So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:
http://www.govliquidation.com/json/buyer_ux/salescalendar.js
Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.
Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:
<?php
$url = "http://www.govliquidation.com/json/buyer_ux/salescalendar.js";
$json = file_get_contents($url);
$data = json_decode($json);
?>
This yields a data object that you can inspect and convert in CSV by simple looping.
stdClass Object
(
[result] => stdClass Object
(
[events] => Array
(
[0] => stdClass Object
(
[yahoo_dur] => 11300
[closing_today] => 0
[language_code] => en
[mixed_id] => 9297
[event_id] => 9297
[close_meridian] => PM
[commercial_sale_flag] => 0
[close_time] => 01/06/2014
[award_time_unixtime] => 1389070800
[category] => Tires, Parts & Components
[open_time_unixtime] => 1388638800
[yahoo_date] => 20140102T000000Z
[open_time] => 01/02/2014
[event_close_time] => 2014-01-06 17:00:00
[display_event_id] => 9297
[type_code] => X3
[title] => Truck Drive Axles # Killeen, TX
[special_flag] => 1
[demil_flag] => 0
[google_close] => 20140106
[event_open_time] => 2014-01-02 00:00:00
[google_open] => 20140102
[third_party_url] =>
[bid_package_flag] => 0
[is_open] => 1
[fda_count] => 0
[close_time_unixtime] => 1389045600
You retrieve $data->result->events, use fputcsv() on its items converted to array form, and Bob's your uncle.

In the case of the second site, you have a table with several TR elements, and you want to catch the first two TD children of each TR.
By inspecting the source code you see something like this:
<tr>
<td> Allendale</td>
<td> Eastern Time
</td>
</tr>
<tr>
<td> Alpine</td>
<td> Eastern Time
</td>
So you just grab all the TR's
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
$fp = fopen('output.csv', 'w');
if (!$fp) die("Cannot open output CSV - permission problems maybe?");
foreach($html->find('tr') as $tr) {
$csv = array(); // Start empty. A new CSV row for each TR.
// Now find the TD children of $tr. They will make up a row.
foreach($tr->find('td') as $td) {
// Get TD's innertext, but
$csv[] = $td->innertext;
}
fputcsv($fp, $csv);
}
fclose($fp);
?>
You will notice that the CSV text is "dirty". That is because the actual text is:
<td> Alpine</td>
<td> Eastern Time[CARRIAGE RETURN HERE]
</td>
So to have "Alpine" and "Eastern Time", you have to replace
$csv[] = $td->innertext;
with something like
$csv[] = strip(
html_entity_decode (
$td->innertext,
ENT_COMPAT | ENT_HTML401,
'UTF-8'
)
);
Check out the PHP man page for html_entity_decode() about character set encoding and entity handling. The above ought to work -- and an ought and fifty cents will get you a cup of coffee :-)

Related

How to Parse the ajax script in DOM Parser?

I parsing scores from http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383
I able to parse all the attributes. But I can't able to parse the time.
I Used
$homepages = file_get_html("http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383");
$teama = $homepages->find('span[id="clock"]');
Kindly help me
Since the that particular site is loading the values dynamically (thru AJAX request), you cant really parse the value upon initial load.
<span id="clock"></span> // this tends to be empty initial load
Normal scrapping:
$homepages = file_get_contents("http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383");
$doc = new DOMDocument();
#$doc->loadHTML($homepages);
$xpath = new DOMXPath($doc);
$query = $xpath->query("//span[#id='clock']");
foreach($query as $value) {
echo $value->nodeValue; // the traversal is correct, but this will be empty
}
My suggestion is instead of scraping it, you will need to have to access it thru a request also, since it is a time (of course, as the match goes on this will change and change until the game has ended). Or you can also use their request.
$url = 'http://sports.in.msn.com/liveplayajax/SOCCERMATCH/match/gsm/en-in/1597383';
$contents = file_get_contents($url);
$data = json_decode($contents, true);
echo '<pre>';
print_r($data);
echo '</pre>';
Should yield something like (a part of it actually):
[2] => Array
(
[Code] =>
[CommentId] => -1119368663
[CommentType] => manual
[Description] => FULL-TIME: South Africa 0-5 Brazil.
[Min] => 90'
[MinExtra] => (+3)
[View] =>
[ViewHint] =>
[ViewIndex] => 0
[EditKey] =>
[TrackingValues] =>
[AopValue] =>
)
You should get the 90' by using foreach. Consider this example:
foreach($data['Commentary']['CommentaryItems'] as $key => $value) {
if(stripos($value['Description'], 'FULL-TIME') !== false) {
echo $value['Min'];
break;
}
}
Should print: 90'

simple_html_dom.php

I am using "simple_html_dom.php" to scrap the data from the Wikipedia site. If I run the code in scraperwiki.com it's throwing an error as exit status 139 and if run the same code in my xampp sever, the server is hanging.
I have a set of links
I'm trying to get Literacy value from all the sites
If I run the code with one link there is no problem and it's returning the expected result
If I try to get data from all the sites in one go I'm facing the above problem
The code is:
<?php
$test=array
(
0 => "http://en.wikipedia.org/wiki/Andhra_Pradesh",
1 => "http://en.wikipedia.org/wiki/Arunachal_Pradesh",
2 => "http://en.wikipedia.org/wiki/Assam",
3 => "http://en.wikipedia.org/wiki/Bihar",
4 => "http://en.wikipedia.org/wiki/Chhattisgarh",
5 => "http://en.wikipedia.org/wiki/Goa",
for($ix=0;$ix<=9;$ix++){
$content = file_get_html($test[$ix]);
$tables = $content ->find('#mw-content-text table',0);
foreach ($tables ->children() as $child1) {
foreach($child1->find('th a') as $ele){
if($ele->innertext=="Literacy"){
foreach($child1->find('td') as $ele1){
echo $ele1->innertext;
}}} }}
Guide me where am wrong. Is there any memory problem??? Is there any xampp configuration???
<?php
require 'simple_html_dom.php';
$test = array(
0 => "http://en.wikipedia.org/wiki/Andhra_Pradesh",
1 => "http://en.wikipedia.org/wiki/Arunachal_Pradesh",
2 => "http://en.wikipedia.org/wiki/Assam",
3 => "http://en.wikipedia.org/wiki/Bihar",
4 => "http://en.wikipedia.org/wiki/Chhattisgarh",
5 => "http://en.wikipedia.org/wiki/Goa");
for($ix=0;$ix<=count($test);$ix++){
$content = file_get_html($test[$ix]);
$tables = $content ->find('#mw-content-text table',0);
foreach ($tables ->children() as $child1) {
foreach($child1->find('th a') as $ele){
if($ele->innertext=="Literacy"){
foreach($child1->find('td') as $ele1){
echo $ele1->innertext;
}
}
}
}
$content->clear();
}
?>
but these URLs are too much. You may get a fatal error of max execution time execeeded or you may get error 324.

PHP ob_start skeleton only working first time

I have browsed around the suggested titles and found some answers but nothing that really worked, and so I turn to you...
I have a function that uses ob_start() to call a file from which the contents are used as a skeleton. Once the contents have been retrieved I use ob_end_clean().
I only seem to be getting output from the first time I call the function and nothing more afterwards. I have included a code dump in case I am doing something wrong.
I have also included a sample of what is returned from my database call ($dl->select ...)
I have also made sure that data is indeed being passed back from the database where I am expecting it to.
Array
(
[0] => Array
(
[rev_id] => 7
[rev_temp_tree_id] => 2
[rev_tree_id] =>
[rev_status] =>
[rev_last_updated_by] => 0
[rev_authorized_by] =>
[rev_date_updated] => 1334600174
[rev_date_reviewed] =>
[rev_update_type] => 1
[temp_tree_id] => 2
[temp_tree_bag_size] => 250
[temp_tree_botanical_id] =>
[temp_tree_stem_size] => 0
[temp_tree_crown] => 0
[temp_tree_price] => 0
[temp_tree_height] => 0
[temp_tree_plant_date] => 0
[temp_tree_review_id] =>
[temp_tree_comments] =>
[temp_tree_marked_move] => 0
[temp_tree_initial_location] =>
[temp_tree_coord] =>
[temp_tree_name] => TEST
[temp_tree_sale_status] => 0
[temp_tree_open_ground] =>
[temp_tree_block] => 0
[temp_tree_row] => 0
)
)
and the code...
<?php
function print_trees($trees){
$return = '';
ob_start();
include_once('skeletons/tree.html');
$tree_skeleton= ob_get_contents();
ob_end_clean();
$search=array("[tree_id]", "[tree_name]", "[Classes]", "[rev_id]");
foreach($trees as $t){
$replace=array($t['temp_tree_id'], $t['temp_tree_name'].' ['.$t['temp_tree_id'].']', 'temp_tree', $t['rev_id']);
$return.=str_replace($search,$replace,$tree_skeleton);
}
return $return;
}
switch ($_GET['mode']){
case 'trees' :
$db_status = '';
$new_trees = $dl->select('tree_review AS tr LEFT JOIN temp_tree_trees AS tt ON tr.rev_temp_tree_id=tt.temp_tree_id', 'tr.rev_update_type="1"');
echo '<h2>New Trees</h2>';
if($dl->totalrows>0){
echo print_trees($new_trees);
}
else{
echo 'no new trees for review';
}
echo '<br /><br />';
$new_trees = $dl->select('tree_review AS tr LEFT JOIN temp_tree_trees AS tt ON tr.rev_tree_id=tt.temp_tree_id', 'tr.rev_update_type="2"');
echo '<h2>Updated Trees</h2>';
if($dl->totalrows>0){
echo print_trees($new_trees);
}
else{
echo 'no update trees for review';
}
echo '<br /><br />';
$new_trees = $dl->select('tree_review AS tr LEFT JOIN temp_tree_trees AS tt ON tr.rev_tree_id=tt.temp_tree_id', 'tr.rev_update_type="3"');
echo '<h2>Moved Trees</h2>';
if($dl->totalrows>0){
echo print_trees($new_trees);
}
else{
echo 'no moved trees for review';
}
echo '<br /><br />';
$new_trees = $dl->select('tree_review AS tr LEFT JOIN temp_tree_trees AS tt ON tr.rev_tree_id=tt.temp_tree_id', 'tr.rev_update_type="4"');
echo '<h2>Duplicated Trees</h2>';
if($dl->totalrows>0){
echo print_trees($new_trees);
}
else{
echo 'no duplicated trees for review';
}
break;
}
?>
Any help would be appreciated.
Thanks in advance.
I believe it could be one of two things:
You might be having trouble with the fact that you're representing a multi-dimensional array as a string in the HTML file, and then attempting to operate on that with a string replace. You might be better off representing the file as a data structure (XML, JSON) and parsing apart that way - this will let you skip output buffering entirely.
Alternately, I'm not sure if $new_trees is an array of objects or something else. If it's an array of objects, the foreach() loop isn't going to work correctly i.e. it should be $t->temp_tree_id vs. $t['temp_tree_id']
thanks for your comments.
I found out what the problem was, it was rather stupid actually. The file that I am importing as a skeleton is included using include_once, so once I try to call it again it won't let me.
#minitech I updated my code as you suggested, thanks.
#gadhra the content in the skeleton file is plain html and im using keywords to replace the content from the db into the html. I should have attached the html along with my code.
Thanks again :)

trying format json a certain way in PHP using array from mysql

I am trying to build a restful web service for my website. I have a php mysql query using the following code:
function mysql_fetch_rowsarr($result, $taskId, $num, $count){
$got = array();
if(mysql_num_rows($result) == 0)
return $got;
mysql_data_seek($result, 0);
while ($row = mysql_fetch_assoc($result)) {
$got[]=$row;
}
print_r($row)
print_r(json_encode($result));
return $got;
which returns the following using the print_r($data) in the code above
Array ( [0] => Array ( [show] => Blip TV Photoshop Users TV [region] => UK [url] => http://blip.tv/photoshop-user-tv/rss [resourceType] => RSS / Atom feed [plugin] => Blip TV ) [1] => Array ( [show] => TV Highlights [region] => UK [url] => http://feeds.bbc.co.uk/iplayer/highlights/tv [resourceType] => RSS / Atom feed [plugin] => iPlayer (UK) ) )
Here is the json it returns:
[{"show":"Blip TV Photoshop Users TV","region":"UK","url":"http:\/\/blip.tv\/photoshop-user-tv\/rss","resourceType":"RSS \/ Atom feed","plugin":"Blip TV"},{"show":"TV Highlights","region":"UK","url":"http:\/\/feeds.bbc.co.uk\/iplayer\/highlights\/tv","resourceType":"RSS \/ Atom feed","plugin":"iPlayer (UK)"}]
I am using the following code to add some items to the array then convert it to json and return the json.
$got=array(array("resource"=>$taskId,"requestedSize"=>$num,"totalSize"=>$count,"items"),$got);
using the following code to convert it to json and return it.
$response->body = json_encode($result);
return $response;
this gives me the following json.
[{"resource":"video","requestedSize":2,"totalSize":61,"0":"items"},[{"show":"Blip TV Photoshop Users TV","region":"UK","url":"http:\/\/blip.tv\/photoshop-user-tv\/rss","resourceType":"RSS \/ Atom feed","plugin":"Blip TV"},{"show":"TV Highlights","region":"UK","url":"http:\/\/feeds.bbc.co.uk\/iplayer\/highlights\/tv","resourceType":"RSS \/ Atom feed","plugin":"iPlayer (UK)"}]]
The consumers of the API want the json in the following format and I cannot figure out how to get it to come out this way. I have searched and tried everything I can find and still not get it. And I have not even started trying to get the xml formatting
{"resource":"video", "returnedSize":2, "totalSize":60,"items":[{"show":"Blip TV Photoshop Users TV","region":"UK","url":"http://blip.tv/photoshop-user-tv/rss","resourceType":"RSS / Atom feed","plugin":"Blip TV"},{"show":"TV Highlights","region":"UK", "url":"http://feeds.bbc.co.uk/iplayer/highlights/tv","resourceType":"RSS / Atom feed","plugin":"iPlayer (UK)"}]}
I appreciate any and all help with this. I have setup a copy of the database with readonly access and can give all the source code it that will help, I will warn you that I am just now learning php, I learned to program in basic, fortran 77 so the php is pretty messy and I would guess pretty bloated.
OK The above about json encoding was answered. The API consumers also want the special character "/", not to be escaped since it is a URL. I tried the "JSON_UNESCAPED_SLASHES " in the json_encode and got the following error.
json_encode() expects parameter 2 to be long
Your $result line should look like
$result=array(
"resource"=>$taskId,
"requestedSize"=>$num,
"totalSize"=>$count,
"items" => $got
);

How to make a PHP XML/RPC call to UPC database

hay i am working on barcode reader project when i call upcdatabase from my php script it give me errors. i use the php example provided by www.upcdatabase.com
the code is
<?php error_reporting(E_ALL);
ini_set('display_errors', true);
require_once 'XML/RPC.php';
$rpc_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'; // Set your rpc_key here
$upc='0639382000393';
// Setup the URL of the XML-RPC service
$client = new XML_RPC_Client('/xmlrpc', 'http://www.upcdatabase.com');
$params = array( new XML_RPC_Value( array(
'rpc_key' => new XML_RPC_Value($rpc_key, 'string'),
'upc' => new XML_RPC_Value($upc, 'string'),
), 'struct'));
$msg = new XML_RPC_Message('lookup', $params);
$resp = $client->send($msg);
if (!$resp)
{
echo 'Communication error: ' . $client->errstr;
exit;
}
if(!$resp->faultCode())
{
$val = $resp->value();
$data = XML_RPC_decode($val);
echo "<pre>" . print_r($data, true) . "</pre>";
}else{
echo 'Fault Code: ' . $resp->faultCode() . "\n";
echo 'Fault Reason: ' . $resp->faultString() . "\n";
}
?>
when i check the $upc='0639382000393'; into upc data base view this then it works fine but i run this script into the browser then it give the following error Array
(
[status] => fail
[message] => Invalid UPC length
)
Unfortunately, their API appears rather short on documentation.
There are three types of codes the site mentions on the Item Lookup page:
13 digits for an EAN/UCC-13
12 digits for a Type A UPC code, or
8 digits for a Type-E (zero-supressed) UPC code.
Right after the page mentions those three types, it also says,
Anything other than 8 or 12 digits is not a UPC code!
The 13-digit EAN/UCC-13 is a superset of UPC. It includes valid UPCs, but it has many other values that are not valid UPCs.
From the Wikipedia article on EAN-13:
If the first digit is zero, all digits in the first group of six are encoded using the patterns used for UPC, hence a UPC barcode is also an EAN-13 barcode with the first digit set to zero.
Having said that, when I removed the leading zero from $upc, it worked as expected. Apparently the Item Lookup page has logic to remove the leading zero, while the API does not.
Array
(
[upc] => 639382000393
[pendingUpdates] => 0
[status] => success
[ean] => 0639382000393
[issuerCountryCode] => us
[found] => 1
[description] => The Teenager's Guide to the Real World by BYG Publishing
[message] => Database entry found
[size] => book
[issuerCountry] => United States
[noCacheAfterUTC] => 2011-01-22T14:46:15
[lastModifiedUTC] => 2002-08-23T23:07:36
)
Alternatively, instead of setting the upc param, you can set the original 13-digit value to the ean param and it will also work.

Categories