I am trying to find a way to search through a page in php to replace the names of form elements.
I guess I should explain. I'm doing a job for a friend and I want to make an easy database updater that is robust and can withstand adding elements without the person knowing much about php or databases.
In short, I want to search through a form and replace all the name="%name%" with the respective database table key names, so I can use a simple foreach method to update the table.
So I was looking at the DOMDocument element to open an html page and replace every form name inside in order with the corresponding table keys, but I wasn't sure if I can open a php page with loadHTMLfile or not. And, if I could open up a php page, would opening itself cause an infinite loop? Or would it just parse the html as if it were looking at client-side html?
Is there any way to do what I want? If not, that's OK, I'll just make it a little less awesome, but I was just wondering.
It's perfectly doable.
The DOMDocument is possibly the ideal (native) tool for this task, but you'll probably want to look into the DOMDocument::loadHTML() method instead of the loadHTMLfile() one.
To get the processed PHP page into a string, you can request the page with CURL, file_get_contents() or a similar alternative. This involves making an additional request and adding specific control logic to avoid an endless loop.
A better alternative might be to use output buffering, here is a simple example I have at hand in how to replace the contents of the <title> tag:
<?php
ob_start();
echo '<title>Original Title</title>';
/* get and delete current buffer && start a new buffer */
if ((($html = ob_get_clean()) !== false) && (ob_start() === true))
{
echo preg_replace('~<title>([^<]*)</title>~i', '<title>NEW TITLE</title>', $html, 1);
}
?>
I am using preg_replace(), but you shouldn't have any problems adapting it to use DOMDocument nodes. It's also worth noticing that the ob_start() call must be present before any headers / contents are sent to the browser, this includes session cookies and so on.
This should get you going, let me know if you need any more help.
A generic DOMDocument example:
<?php
ob_start(); // This must be the very first thing.
echo '<html>'; // Start of HTML.
echo '...'; // Your inputs and so on.
echo '</html>'; // End of HTML.
// Final processing, the $html variable will hold all output so far.
if ((($html = ob_get_clean()) !== false) && (ob_start() === true))
{
$dom = new DOMDocument();
$dom->loadHTML($html); // load the output HTML
/* your specific search and replace logic goes here */
echo $doc->saveHTML(); // output the replaced HTML
}
?>
Related
Let's say you wanted to parse the DOM with PHP. You can easily achieve this using the DomDocument.
However, in order to do so, you would need to load some HTML using loadHTML or loadHTMLFile and provide the functions with a string containing HTML (or a file path in the case of loadHTMLFile).
As an example, if you just wanted to get an element with a specific ID (in PHP, not JavaScript), WITHIN your page, what can you do?
If you have PHP code generating the page, you could use the output buffer to generate the page in memory, edit the generated page and then flush it to the browser. You can only change the DOM before the browser gets it.
You could do the following:
ob_start(); // Should be called before any output is generated
// ... PHP code that outputs HTML ...
$generated_html = ob_get_clean(); // Store generated HTML to string
// Load and manipulate HTML
$doc = new DOMDocument();
$doc->loadHTML($generated_html);
// ... Manipulate the generated HTML ...
echo $doc->saveHTML(); // echo the modified HTML
However, since you are generating the HTML it would make more sense to change whatever you need to change before it's generated to reduce procesing time.
If you want to change the HTML of a page which is already shown in the browser you'll need another way (such as JS/AJAX) since at that point PHP can't possibly access the DOM.
getElementById method can be invoked on the DOMDocument instance with id string to get the element. 1
$element = $testDOMDocument->getElementById('test-id');
I am grabbing the contents from google with PhP, how can I search $page for elements with the id of "#lga" and echo out another property? Say #lga is an image, how would I echo out it's source?
No, i'm not going to do this with Google, Google is strictly an example and testing page.
<body><img id="lga" src="snail.png" /></body>
I want to find the element named "lga" and echo out it's source; so the above code I would want to echo out "snail.png".
This is what i'm using and how i'm storing what I found:
<?php
$url = "https://www.google.com/";
$page = file($url);
foreach($page as $part){
}
?>
You can achieve this using the built-in DOMDocument class. This class allows you to work with HTML in a structured manner rather than parsing plain text yourself, and it's quite versatile:
$dom = new DOMDocument();
$dom->loadHTML($html);
To get the src attribute of the element with the id lga, you could simply use:
$imageSrc = $dom->getElementById('lga')->getAttribute('src');
Note that DOMDocument::loadHTML will generate warnings when it encounters invalid HTML. The method's doc page has a few notes on how to suppress these warnings.
Also, if you have control over the website you are parsing the HTML from, it might be more appropriate to have a dedicated script to serve the information you are after. Unless you need to parse exactly what's on a page as it is served, extracting data from HTML like this could be quite wasteful.
I'm trying to take an existing php file which I've built for a page of my site (blue.php), and grab the parts I really want with some xPath to create a different version of that page (blue-2.php).
I've been successful in pulling in my existing .php file with
$documentSource = file_get_contents("http://mysite.com/blue.php");
I can alter an attribute, and have my changes reflected correctly within blue-2.php, for example:
$xpath->query("//div[#class='walk']");
foreach ($xpath->query("//div[#class='walk']") as $node) {
$source = $node->getAttribute('class');
$node->setAttribute('class', 'run');
With my current code, I'm limited to making changes like in the example above. What I really want to be able to do is remove/exclude certain divs and other elements from showing on my new php page (blue-2.php).
By using echo $doc->saveHTML(); at the end of my code, it appears that everything from blue.php is included in blue-2.php's output, when I only want to output certain elements, while excluding others.
So the essence of my question is:
Can I parse an entire page using $documentSource = file_get_contents("http://mysite.com/blue.php");, and pick and choose (include and exclude) which elements show on my new page, with xPath? Or am I limited to only making modifications to the existing code like in my 'div class walk/run' example above?
Thank you for any guidance.
I've tried this, and it just throws errors:
$xpath->query("//img[#src='blue.png']")->remove();
What part of the documentation did make you think remove is a method of DOMNodeList? Use DOMNode::removeChild
foreach($xpath->query("//img[#src='blue.png']") as $node){
$node->parentNode->removeChild($node);
}
I would suggest browsing a bit through all classes & functions from the DOM extension (which is not PHP-only BTW), to get a bit of a feel what to find where.
On a side note: is probably very more resource efficient if you could get a switch in your original blue.php resulting in the different output, because this solution (extra http-request, full DOM load & manipulation) has a LOT of unneeded overhead compared to that.
I made a simple parser for saving all images per page with simple html dom and get image class but i had to make a loop inside the loop in order to pass page by page and i think something is just not optimized in my code as it is very slow and always timeouts or memory exceeds. Could someone just have a quick look at the code and maybe you see something really stupid that i made?
Here is the code without libraries included...
$pageNumbers = array(); //Array to hold number of pages to parse
$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);
//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
array_push($pageNumbers, $pn->innertext);
}
// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.
//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){
$target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
$target_html = file_get_html($target_url); //Reading the page html to find all images inside next.
//Final loop to find and save each image per page.
foreach($target_html->find('img.clipart') as $element) {
$image->source = url_to_absolute($target_url, $element->src);
$get = $image->download('curl'); // using GD
echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';
}
}
Thank you.
I suggest making a function to do the actual simple html dom processing.
I usually use the following 'template'... note the 'clear memory' section.
Apparently there is a memory leak in PHP 5... at least I read that someplace.
function scraping_page($iUrl)
{
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('img');
// do something with the element objects
// clean up memory (prevent memory leaks in PHP 5)
$html->clear(); // **** very important ****
unset($html); // **** very important ****
return; // also can return something: array, string, whatever
}
Hope that helps.
You are doing quite a lot here, I'm not surprised the script times out. You download multiple web pages, parse them, find images in them, and then download those images... how many pages, and how many images per page? Unless we're talking very small numbers then this is to be expected.
I'm not sure what your question really is, given that, but I'm assuming it's "how do I make this work?". You have a few options, it really depends what this is for. If it's a one-off hack to scrape some sites, ramp up the memory and time limits, maybe chunk up the work to do a little, and next time write it in something more suitable ;)
If this is something that happens server-side, it should probably be happening asynchronously to user interaction - i.e. rather than the user requesting some page, which has to do all this before returning, this should happen in the background. It wouldn't even have to be PHP, you could have a script running in any language that gets passed things to scrape and does it.
Hello friends I am running 2 applications on the same server and I wanted to import some content of Application-1 into Application-2 and similarly some content of Application-2 into Application-2.
Currently this is how I am getting it:
<?php
$app1 = file_get_contents('http://localhost/cms');
file_put_contents('cms.html', $app1);
$app2 = file_get_contents('http://localhost/blog');
file_put_contents('blog.html', $app2);
?>
Like this I am saving these 2 files with the HTML of those applications
Now lets say I have
a <div id="header"> in cms.html which I want to import in <div id="header"> of Application-2's index.php file
and
a <div id="footer"> in blog.htmlwhich I want in <div id="footer"> of Application-1's index.php file.
How do I do this?
UPDATE : Both applications are Joomla and WordPress.
My Template is a simple one without any JavaScript so kindly suggest a PHP solution until and unless it is compulsory to load JavaScript, I want to keep it simple and light.
Please also note I have just started learning programing hence It will be very difficult for me to understand too technical a language hence requesting a simple language. Hope you understand.
If my approach is wrong at the very first place, then please feel free to suggest a better method of doing this task.
Kindly help.
UPDATE 2 : Please not my issue is not of getting content from the other application. My issue is to get a fragment of the HTML file that is being generated.
If the files reside on the same local server, there is no need to use file_get_contents with an HTTP URL. Doing so will do an HTTP request and that adds unneccessary latency.
Unless you want the files to be processed by their respective applications, you can simply load the contents of the HTML files from their file path, e.g.
$content = file_get_contents('/path/to/application1/html/file.htm');
If those files contain PHP that needs to be evaluated, use
include 'path/to/application1/html/file.htm';
instead. Note that using include will put PHP into HTML mode at the beginning of the target file, and resumes PHP mode again at the end. So you must enclose any PHP code in the file with the appropriate <?php open tag.
If you need to capture the output of the include call in a variable, wrap it into
ob_start();
include '…'; // this can also take a URL
$content = ob_get_clean();
This will enable output buffering.
If you need to work on the HTML inside the files, use a proper HTML parser.
Here's what I would suggest to you:
Create header.php and footer.php, each containing the content that you want to have in both of your headers / footers.
Inside your <div id="header">, add a <?php include "header.php" ?> in all your files which you want to share the same header. Likewise, do the same thing for the footer.
Unless you really have to, trying to parse through parts of another page to find the appropriate tags will make what you're trying to do unneccessarily complicated.
-- Edit --
Since you're working with CMSs, there aren't actually files sitting on your system to parse through. The HTML you see in a browser is created at the time someone requests it.
To do what you're describing, you could use php's cURL library to access the page, then parse through the html to pull out the tags you're looking for. However, this method will be extremely slow, since you're basically requiring the user to load 2 pages rather than 1.
You may want to look into the backend for joomla / wordpress - see how the footer is stored in the database, and put in code to access it from the other application.
Well I guess I found a better way of doing it myself so I am posting here as an answer for the sake of others who are looking for a similar solution.
Write a class which will read between the markers and get that bit.
<?Php
class className {
private $_content;
private $_Top;
private $_Bottom;
public function __construct(
$url, $markerStartTop = null,
$markerEndTop = null,
$markerStartBottom = null,
$markerEndBottom = null){
$this->_content = file_get_contents($url);
$this->_renderTop($markerStartTop, $markerEndTop);
$this->_renderBottom($markerStartBottom, $markerEndBottom);
}
public function renderTop($markerStart = null, $markerEnd = null){
return $this->_Top;
}
private function _renderTop($markerStart = null, $markerEnd = null){
$markerStart = (is_null($markerStart)) ? '<!– Start Top Block –>' : (string)$markerStart;
$markerEnd = (is_null($markerEnd)) ? '<!– End Top Block –>' : (string)$markerEnd;
$TopStart = stripos($this->_content, $markerStart);
$TopEnd = stripos($this->_content, $markerEnd);
$this->_Top = substr($this->_content, $TopStart, $TopEnd-$TopStart);
}
public function renderBottom(){
return $this->_Bottom;
}
private function _renderBottom($markerStart = null, $markerEnd = null){
$markerStart = (is_null($markerStart)) ? '<!– Start Bottom Block –>' : (string)$markerStart;
$markerEnd = (is_null($markerEnd)) ? '<!– End Bottom Block –>' : (string)$markerEnd;
$BottomStart = stripos($this->_content, $markerStart);
$BottomEnd = stripos($this->_content, $markerEnd);
$this->_Bottom = substr($this->_content, $BottomStart, $BottomEnd-$BottomStart);
}
}
?>
Call it:
<?php
$parts = new className('url/to/the/application');
echo $parts->renderTop();
?>
This method is fetching exactly the fragment of the content which you marked between the desired markers.
This assumes well-formed XHTML and PHP 5:
Obtain the desired HTML via some method, e.g. a cURL request.
Use SimpleXML to parse the XML into a DOM, then XPath to extract the desired nodes.
Emit the extracted nodes as text.