I created a little script that imports wordpress posts from an xml file:
if(isset($_POST['wiki_import_posted'])) {
// Get uploaded file
$file = file_get_contents($_FILES['xml']['tmp_name']);
$file = str_replace('&', '&', $file);
// Get and parse XML
$data = new SimpleXMLElement( $file , LIBXML_NOCDATA);
foreach($data->RECORD as $key => $item) {
// Build post array
$post = array(
'post_title' => $item->title,
........
);
// Insert new post
$id = wp_insert_post( $post );
}
}
The problem is that my xml file is really big, and when i submit the form, the browser just hangs for a couple of minutes.
Is it possible to display some messages during the import, like displaying a dot after every item is imported?
Unfortunately, no, not easily. Especially if you're building this on top of the WP framework you'll find it not worth your while at all. When you're interacting with a PHP script you are sending a request and awaiting a response. However long it takes that PHP script to finish processing and start sending output is how long it usually takes the client to start seeing a response.
There are a few things to consider if what you want is for output to start showing as soon as possible (i.e. as soon as the first echo or output statement is reached).
Turn off output buffering so that output begins sending immediately.
Output whatever you want inside the loop that would indicate to you the progress you wish to be know about.
Note that if you're doing this with an AJAX request content may not be ready immediately to transport to the DOM via your XMLHttpRequest object. Also note that some browsers do their own buffering before content can be ready for the user to display (like IE for example).
Some suggestions you may want to look into to speed up your script, however:
Why are you doing str_replace('&','&',$file) on a large file? You realize that has cost with no benefit, right? You've acomplished nothing and if you meant you want to replace the HTML entity & then you probably have some of your logic very wrong. Encoding is something you want to let the XML parser handle.
You can use curl_multi instead of file_get_contents to do multiple HTTP requests concurrently to save time if you are transferring a lot of files. It will be much faster since it's none-blocking I/O.
You should use DOMDocument instead of SimpleXML and a DOMXPath query can get you your array much faster than what you're currently doing. It's a much nicer interface than SimpleXML and I always recommend it above SimpleXML since in most cases SimpleXML makes things incredibly difficult to do and for no good reason. Don't let the name fool you.
Related
Here's my problem:
function parse_xml_to_json($url) {
$fileContents = file_get_contents($url);
$simpleXml = simplexml_load_string($fileContents, null
, LIBXML_NOCDATA);
$json = json_encode($simpleXml);
return $json;
}
$jsonto = parse_xml_to_json('myxmlfile.html');
echo $jsonto;
Essentially I need to use an XML file from an external source and loop it through to display nicely some data.
I created a function that gets content from the external URL (file_get_contents), then I turn the string of XML into an object (I use LIBXML_NOCDATA as a parameter because it contains ), right after I turn the object into a JSON file and for the very last step, I echo the result.
So far so good, it worked but I'm wondering if I can do anything if the XML file contains a malicious script or else.
Is the function simplexml_load_string and then the JSON encode enough to prevent a malicious script or an invalid XML?
You code is prone to a Denial of Service (DOS) attack.
$fileContents = file_get_contents($url);
This can blow your memory limit. Or come close to, while taking a long time (the server you request the data from stales in the middle after providing a lot of content - and then only some little bytes each couple of seconds). So your script will "hang" while consuming the memory.
If the script can then be triggered with another HTTP request multiple times, this can consume your servers resources (the echo statement suggests this is entirely possible).
I want to use PHP to simultaneously download data from 2 URLs via simplexml_load_file but the script must wait until all data is gathered before going ahead processing the rest of the code.
$url1 = "http://www.example.com/api1";
$request1 = simplexml_load_file($url1);
$url2 = 'http://www.example.com/api2';
$request2 = simplexml_load_file("compress.zlib://$url2", NULL, TRUE);
echo 'finished';
I want all data is completely downloaded before printing the word finished.
How would you edit the script above to accomplish that?
Fetching URLs directly while opening "files" with functions such as simplexml_load_file is intended as a short-cut for simple cases where you don't need things like non-blocking / asynchronous I/O.
Your script as written will wait for everything to download before printing the word "finished", but it will also wait for the response from http://www.example.com/api1 to finish downloading before starting the request to http://www.example.com/api2.
You will need to break your problem down:
Download the contents of two URLs, in parallel (or more accurately "asynchronously"). Your result will be two strings.
Parse each of those strings using simplexml_load_string.
The most popular HTTP library for PHP is Guzzle, but you should be able to find many alternatives, and guides to writing your own using the built-in cURL functions if you search for terms like "PHP asynchronous HTTP" or "PHP parallel HTTP requests".
Before this question gets stamped as a duplicate, I am sorry! I've read ALL the duplicate questions and if anything, it has confused me even more. So maybe this question is slightly different.
I've written a little Javascript library that makes ajax calls and fetches and parses information from the graph facebook API.
This enables me to pretty much show all my page status' on my web page. However I'm just about to launch, and I have done as much testing as I can.
However. I'm sure errors will occur, and I've written many error catches blah blah blah.
What I want to do, is save all my errors in a xml file.
So when an error occurs, I want the javascript to load the xml file from the server, add the errors, then save the changes.
I know how to load the xml doc using XmlHttpRequests, And I'm sure I can figure out how to modify the xml just by using dom manipulation.
All i really want to know is. How do i save these changes? does it save automatically?
Or do i have to "somehow" pass the updated xml version to php and get that to save it?
Im not quite sure how to go about it.
I would use mySQL and php but that means "somehow" passing the error information to php, then saving it.
However id much prefer XML seeing as I'm the only person that will be reading the xml file.
Thanks very much.
Alex
Or do i have to "somehow" pass the updated xml version to php and get that to save it?
Yes, you'll want to use an XML HTTP request to send the XML DOM to the server where PHP can save it:
function postXML(xmlDOM, postURL, fileName, folderPath){
try{
// Create XML HTTP Request Object
oXMLReq = new ActiveXObject("MSXML2.XMLHTTP.3.0");
// Issue Synchronous HTTP POST
oXMLReq.open("POST",postURL,false);
// Set HTTP Request Headers
if(fileName != null){ oXMLReq.setRequestHeader("uploadFileName", fileName); } // What should file be named when saved on server?
if(folderPath != null){ oXMLReq.setRequestHeader("uploadDir", folderPath); } // What folder should file be saved in on server?
// SEND XML
///WScript.Echo(xmlDOM.xml);
oXMLReq.send(xmlDOM.xml);
return oXMLReq.responseText;
}catch(e){
return "postXML failed - check network connection to server";
}
}
I made a simple parser for saving all images per page with simple html dom and get image class but i had to make a loop inside the loop in order to pass page by page and i think something is just not optimized in my code as it is very slow and always timeouts or memory exceeds. Could someone just have a quick look at the code and maybe you see something really stupid that i made?
Here is the code without libraries included...
$pageNumbers = array(); //Array to hold number of pages to parse
$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);
//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
array_push($pageNumbers, $pn->innertext);
}
// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.
//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){
$target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
$target_html = file_get_html($target_url); //Reading the page html to find all images inside next.
//Final loop to find and save each image per page.
foreach($target_html->find('img.clipart') as $element) {
$image->source = url_to_absolute($target_url, $element->src);
$get = $image->download('curl'); // using GD
echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';
}
}
Thank you.
I suggest making a function to do the actual simple html dom processing.
I usually use the following 'template'... note the 'clear memory' section.
Apparently there is a memory leak in PHP 5... at least I read that someplace.
function scraping_page($iUrl)
{
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('img');
// do something with the element objects
// clean up memory (prevent memory leaks in PHP 5)
$html->clear(); // **** very important ****
unset($html); // **** very important ****
return; // also can return something: array, string, whatever
}
Hope that helps.
You are doing quite a lot here, I'm not surprised the script times out. You download multiple web pages, parse them, find images in them, and then download those images... how many pages, and how many images per page? Unless we're talking very small numbers then this is to be expected.
I'm not sure what your question really is, given that, but I'm assuming it's "how do I make this work?". You have a few options, it really depends what this is for. If it's a one-off hack to scrape some sites, ramp up the memory and time limits, maybe chunk up the work to do a little, and next time write it in something more suitable ;)
If this is something that happens server-side, it should probably be happening asynchronously to user interaction - i.e. rather than the user requesting some page, which has to do all this before returning, this should happen in the background. It wouldn't even have to be PHP, you could have a script running in any language that gets passed things to scrape and does it.
iam using json object in my php file but i dont want my json object to be displayed in source code as it increases my page size a lot.
this is what im doing in php
$json = new Services_JSON();
$arr = array();
$qs=mysql_query("my own query");
while($obj = mysql_fetch_object($qs))
{
$arr[] = $obj;
}
$total=sizeof($arr);
$jsn_obj='{"abc":'.$json->encode($arr).',"totalrow":"'.$total.'"}';
and this is javascript
echo '<script language=\'javascript\'>
var dataref = new Object();
dataref = eval('.$jsn_obj.');
</script>';
but i want to hide this $jsn_obj objects value from my source,how can i do that??? plz help !!
I'm not sure there's a way around your problem, other than to change your mind about whether it's a problem at all (it's not, really).
You can't use the JSON object in your page if you don't output it. The only other way to get the object would be to make a separate AJAX request for it. If you did it that way, you're still transferring the exact same number of bytes that you would have originally, but now you've added the overhead of an extra HTTP request (which will be larger than it would have been originally, since there are now HTTP headers on the transfer). This way would also be slower on your page load, since you'd have to load the page, then send the AJAX request and run the result.
There's much better ways to manage the size of your pages. JSON is just text, so you should look into a server-side solution to zip your content, like mod_deflate. mod_deflate works beautifully on dynamic PHP output as well as static pages. If you don't have control over your web server, you could use PHP's built in zlib compression.
Instead of writing the JSON date directly to the document instead you can use an XMLHttpRequest in or use a library like JQuery to load the JSON data during script runtime.
It depends largely on your json data. If the data you're printing inline in the html is huge you might wanna consider using ajax to load the json data. That is assuming you wanted your page to be loaded faster, even without data.
If the data isn't that big, try to keep the data inline, without making extra http requests. To speed up your page, try using YSlow! to see what other areas you could optimize.