Including/Excluding content with xPath/DOM > PHP - php

I'm trying to take an existing php file which I've built for a page of my site (blue.php), and grab the parts I really want with some xPath to create a different version of that page (blue-2.php).
I've been successful in pulling in my existing .php file with
$documentSource = file_get_contents("http://mysite.com/blue.php");
I can alter an attribute, and have my changes reflected correctly within blue-2.php, for example:
$xpath->query("//div[#class='walk']");
foreach ($xpath->query("//div[#class='walk']") as $node) {
$source = $node->getAttribute('class');
$node->setAttribute('class', 'run');
With my current code, I'm limited to making changes like in the example above. What I really want to be able to do is remove/exclude certain divs and other elements from showing on my new php page (blue-2.php).
By using echo $doc->saveHTML(); at the end of my code, it appears that everything from blue.php is included in blue-2.php's output, when I only want to output certain elements, while excluding others.
So the essence of my question is:
Can I parse an entire page using $documentSource = file_get_contents("http://mysite.com/blue.php");, and pick and choose (include and exclude) which elements show on my new page, with xPath? Or am I limited to only making modifications to the existing code like in my 'div class walk/run' example above?
Thank you for any guidance.

I've tried this, and it just throws errors:
$xpath->query("//img[#src='blue.png']")->remove();
What part of the documentation did make you think remove is a method of DOMNodeList? Use DOMNode::removeChild
foreach($xpath->query("//img[#src='blue.png']") as $node){
$node->parentNode->removeChild($node);
}
I would suggest browsing a bit through all classes & functions from the DOM extension (which is not PHP-only BTW), to get a bit of a feel what to find where.
On a side note: is probably very more resource efficient if you could get a switch in your original blue.php resulting in the different output, because this solution (extra http-request, full DOM load & manipulation) has a LOT of unneeded overhead compared to that.

Related

How to change specific xml tag value, where tag has an id. Using php

I am trying to change a value in an xml file using php. I am loading the xml file using php into an object like this..
if(file_exists('../XML/example.xml')) {
$example = simplexml_load_file('../XML/example.xml');
}
else {
exit ("can't load the file");
}
Then once it is loaded I am changing values within tags, by assigning them the contents of another variable, like this...
$example->first_section->second_section->third_section->title = $var['data'];
Then once I've made the necessary changes the file is saved. So far this process is working well, but have now hit a stumbling block.
I want to change a value within a particular tag in my xml file, which has an id. In the XML file it looks like this.
<first_section>
<second_section>
<third_section id="2">
<title>Mrs</title>
</third_section>
</second_section>
</first_section>
How can I change this value using similar syntax to what I've been using?
doing..
$example->first_section->second_section->third_section id="2" ->title = $var['data']
doesn't work as the syntax is wrong.
I've been scanning through stack overflow, and all over the net for an example of doing it this way but come up empty.
Is it possible to target and change a value in an xml like this, or do I need to change the way I am amending this file?
Thanks.
Some dummy code as your provided XML is surely not the original one.
$xml = simplexml_load_file('../XML/example.xml');
$section = $xml->xpath("//third_section[#id='2']")[0];
// runs a query on the xml tree
// gives always back an array, so pick the first one directly
$section["id"] = "3";
// check if it has indeed changed
echo $xml->asXML();
As #Muhammed M. already said, check the SimpleXML documentation for more information. Check the corresponding demo on ideone.com.
Figured it our after much messing around. Thanks to your contributions I indeed needed to use Xpath. However the reason it wasn't working for me was because I wasn't specifying the entire path for the node I wanted to edit.
For example, after loading the xml file into an object ($xml):
foreach($xml->xpath("/first_section/second_section/third_section[#id='2']") as $entry ) {
$entry->title = "mr";
}
This will work, because the whole path to the node is included in the parenthesis.
But in our above examples eg:
foreach($xml->xpath("//third_section[#id='2']" as $entry ) {
$entry->title = "mr";
}
This wouldn't work, even though it was my understanding that the double // will make it drill down, and I assumed that xpath will search the whole xml structure and return where id=2. It appears after spending hours testing this isn't the case. You must include the entire path to the node. As soon as I did that it worked.
Also on a side note. $section = $xml->xpath("//third_section[#id='2']")[0];
IS incorrect syntax. You don't need to specify the index "[0]" at the end. Including it flags up Dreamweavers syntax checker. And ignoring Dreamweaver and uploading anyway breaks the code. All you need is..
$section = $xml->xpath(" entire path to node in here [#id='2']");
Thanks for helping and suggesting xpath. It works very well... once you know how to use it.

Get pixel coordinates of HTML/DOM elements using PHP

I am working on an web crawler/site analyzer in php. What I need to do is to extract some tags from a HTML file and compute some attributes (such as image size for example). I can easily do this using a DOM parser, but I would also need to find the pixel coordinates and size of a html/DOM tree element (let's say I have a div and I need to know which area it covers and on which coordinate does it start and if). I can define a standard screen resolution, that is not a problem for me, but I need to retrieve the pixel coordinates automatically, by using a server-side php script (or calling some java app from console or something similar, if needed).
From what I understand, I need a headless browser in php and that would simulate/render a webpage, from which I can retrieve the pixel coordinates I need. Would you recommend me a open-source solution for that? Some code snippets would also be useful, so I would not install the solution and then notice it does not provide pixel coordinates.
PS: I see people who answered missed the point of the question, so it means I did not explain well that I need this solution to work COMPLETELY server-side. Say I use a crawler and it feeds html pages to my script. I could launch it from browser, but also from console (like 'php myScript.php').
maybe you can set the coordinates as some kind of metadata inside your tag using javascript
$("element").data("coordinates",""+this.offset.top+","+this.offset.left);
then you have to request with php
$html = file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('element');
foreach ($tags as $tag) {
echo $tag->getAttribute('data'); <-- this will print the coordinates of each tag
}
A Headless browser is an overkill for what you're trying to achieve. Just use cookies to store whatever you want.
So any time you get some piece of information, such as an X,Y coordinate, scroll position, etc. in javascript, simply send it to a PHP script that makes a cookie out of it with some unique string index.
Eventually, you'll have a large array of cookie data that will be directly available to any PHP or javascript file, and you can do anything you'd like with it at that point.
For example, if you wanted to just store stuff in sessions, you could do:
jquery:
// save whatever you want from javascript
// note: probably better to POST, since we're not getting anything really, just showing quick example
$.get('save-attr.php?attr=xy_coord&value=300,550');
PHP:
// this will be the save-attr.php file
session_start();
$_SESSION[$_GET['attr']] = $_GET['value'];
// now any other script can get this value like so:
$coordinates = $_SESSION['xy_coord'];
// where $coordinates would now equal "300,550"
Simple continue this pattern for whatever you need access to in PHP

Parsing Wordpress XML file in PHP

Im migrating big Wordpress page to custom CMS. I need to extract information from big (20MB+) XML file, exported from Wordpress.
I don't have any experience in XML under PHP and i don't know how to start reading file.
Wordpress file contains structures like this:
<excerpt:encoded><![CDATA[Encoded text here]]></excerpt:encoded>
and i don't know how to handle this in PHP.
You are probably going to do fine with simplexml:
$xml = simplexml_load_file('big_xml_file.xml');
foreach ($xml->element as $el) {
echo $el->name;
}
See php.net for more info
Unfortunately, your XML example didn't come through.
PHP5 ships with two extensions for working with XML - DOM and "SimpleXML".
Generally speaking, I recommend looking into SimpleXML first since it's the more accessible library of the two.
For starters, use "simplexml_load_file()" to read an XML file into an object for further processing.
You should also check out the "SimpleXML basic examples page on php.net".
I don't have any experience in XML under PHP
Take a look at simplexml_load_file() or DomDocument.
<excerpt:encoded><![CDATA[Encoded text here]]></excerpt:encoded>
This should not be a problem for the XML parser. However, you will have a problem with the content exported by WordPress. For example, it can contain WordPress shortcodes, which will come across in their raw format instead of expanded.
Better Approach
Determine if what you are migrating to supports an export from WordPress feature. Many other systems do - Drupal, Joomla, Octopress, etc.
Although Adam is Absolutely right, his answer needed a bit more details. Here's a simple script that should get you going.
$xmlfile = simplexml_load_file('yourxmlfile.xml');
foreach ($xmlfile->channel->item as $item) {
var_dump($item->xpath('title'));
var_dump($item->xpath('wp:post_type'));
}
simplexml_load_file() is the way to go creating an object, but you will also need to use xpath as WordPress uses name spaces. If I remember correctly SimpleXML does not handle name space well or at all.
$xml = simplexml_load_file( $file );
$xml->xpath('/rss/channel/wp:category');
I would recommend looking at what WordPress uses for importing the files.
https://github.com/WordPress/WordPress/blob/master/wp-admin/includes/class-wp-importer.php

Is there a simple way to get and manipulate nested <div> tags with php

First off, I'm far from awesome with PHP - having only a basic familiarity with it, but I'm looking for a way to manipulate the contents of nested divs with php. This is a basic site for a local non-profit food bank that will allow them to post events for their clientelle.
For example, the file I want to parse and work with has this structure (consider this the complete file though there may be more than 2 entries at any point in time):
<div class="event">
<div class="eventTitle">title text</div>
<div class="eventContent">event content</div>
</div>
<div class="event">
<div class="eventTitle">title2</div>
<div class="eventContent">event content2</div>
</div>
My thoughts are to parse it (what's the best way?), and build a multidimensional array of all div with class="event", and the nested contents of each. However, up to this point all my attempts have ended in failure.
The point of this is allow the user (non-technical food bank admin) to add, edit, and delete these structures. I have the code working to add the structures - but am uncertain as to how I would re-open the file at a later date to then edit and/or delete select instances of the "event" divs and their nested contents. It seems like it should be an easy task but I just can't wrap my head around the search results I have found online.
I have tried some stuff with preg_match(), getElementById(), and getElementByTagName(). I'd really like to help this organization out but I'm at the point where I have to defer to my betters for advice on how to solve the task at hand.
Thanks in advance.
To Clarify:
This is for their website, hosted on an external service by a provider that does not allow them to host a DB or provide ftp/sftp/ssh access to the server for regular maintenance. The plan is to get the site up there once, and from then on, have it maintained via an unsecure (no other options at this point) url.
Can anyone provide a sample php syntax to parse the above html and create a multidimensional array of the div tags? As I mentioned, I have attempted to thumb my way through it, but have been unsuccessful. I know what I need to do, I just get lost in the syntax.
IE: this is what I've come up with to do this, but it doesn't seem to work, and I don't have a strong enough understanding of php to understand exactly why it does not.
<?php
$doc = new DOMDocument();
$doc->load('events.php');
$events = array();
foreach ($doc->getElementsByTagName('div') as $node) {
// looks at each <div> tag and creates an array from the other named tags below // hopefully...
$edetails = array (
'title' => $node->getElementsByTagName('eventTitle')->item(0)->nodeValue,
'desc' => $node->getElementsByTagName('eventContent')->item(0)->nodeValue
);
array_push($events, $edetails);
}
foreach ($events as &$edetails) {
// walk through the $events array and write out the appropriate information.
echo $edetails['title'] . "<br>";
echo $edetails['desc'] . "<br>";
}
print_r($events); // this is currently empty and not being populated
?>
Error:
PHP Warning: DOMDocument::load(): Extra content at the end of the document in /var/www/html/events.php, line: 7 in /var/www/html/test.php on line 4
Looking at this now, I realize this would never work because it is looking for tags named eventTitle and eventContent, not classes. :(
I would use a "database", whether it's an sqlite database or a simple text file (seems sufficient for your needs), and use php scripts to manipulate that file and build the required html to manage the text/database file and display the contents.
That would be a lot easier than using DOM manipulation to add / edit / remove events.
By the way, I would probably look for a sponsor, get a decent hosting provider and use a real database...
If you want to keep using the "php" file you have (which I think is needless complex), the reasons your current code fails are:
1) The load() method for DOMDocument is designed for XML, and expects a well formed file. The work around for this would be to either use the loadHTMLFile() method, or to wrap everything in a parent element.
2) The looping fails as the getElementsByTagName() is looking for tags - so the outermost loop gets 6 different divs in your current example (the parent event, and the children eventTitle and eventContent)
3) The inner loops fail of course, as you're again using getElementsByTagName(). Note that the tag names are all still 'div'; what you're really trying/wanting to search on is the value of 'class' attribute. In theory, you could work around this by putting in a lot of logic using things like hasChildNodes() and/or getAttribute().
Alternatively, you could restructure using valid XML, rather than this weird hybrid you're trying to use - if you do that, you could use DOMDocument to write out the file, as well as read it. Probably overkill, unless you're looking to learn how to use the PHP DOM libraries and XML.
As other's have mentioned, I'd change the format of events.php into something besides a bunch of div's. Since a database isn't an option, I'd probably go for a pipe delimited file, something like:
title text|event content
title2|event content2
The code to parse this would be much simpler, something along the lines of:
<?php
$events = array();
$filename = 'events.txt';
if (file_exists($filename)) {
$lines = file($filename);
foreach ($lines as $line) {
list($title, $desc) = explode('|', $line);
$event = array('title'=>$title, 'desc'=>$desc);
$events[] = $event; //better way of adding one element to an array than array_push (http://php.net/manual/en/function.array-push.php)
}
}
print_r($events);
?>
Note that this code reads the whole file into memory, so if they have too many events or super long descriptions, this could get unwieldy, but should work fine for hundreds, even thousands, of events or so.

Extract text from a DIV that occurs on multiple pages on a website, then output to .txt?

Just to note from the start, the content is uncopyrighted and I would like to automate the process of acquiring the text for the purpose of a project.
I'd like to extract the text from a particular and recurring DIV (that is attributed with it's own 'class', in case that makes it easier) sitting in each page on a simply designed website.
There is a single archive page on the site with a list of all of the pages containing the content I would like.
The site is www.zenhabits.net
I imagine this could be achieved with some sort of script, but have no idea where to start.
I appreciate any help.
-Nathan.
This is pretty straight forward.
Firstly, get all the links from this site, and throw them all into an array:
set_time_limit(0);//this could take a while...
ignore_user_abort(true);//in case browser times out
$html_output=file_get_contents("http://zenhabits.net/archives/");
# -- Do a preg_match on the html, and grab all links:
if(preg_match_all('/<a href=\"http:\/\/zenhabits.net\/(.*)\">/',$html_output,$matches)) {
# -- Append Data To Array
foreach($matches[1] as $secLink) {
$links[] = "http://zenhabits.net/".$secLink;
}
}
I tested this for you, and:
//first 3 are returning something weird, but you don't need them - so I shall remove them xD
unset($links[0]);
unset($links[1]);
unset($links[2]);
No that's all done, time to go through all of THOSE links (in the array $links), and take its content:
foreach($links as $contLink){
$html_output_c=file_get_contents("$contLink");
if(preg_match('|<div class=\"post\">(.*)</div>|s',$html_output_c,$c_matches)) {
# -- Append Data To Array
echo"data found <br>";
$contentFromPage[] = $c_matches[1];
}
else{echo "no content found in: $contLink -- <br><br><br>";}
}//end of foreach
I've basically just written a whole crawler script for you..
And now, loop the content array, and do whatever you want with it(here we shall put it into a text file):
//$contentFromPage now contains all of div class="post" content (in an array) - so do what you want with it
foreach($contentFromPage as $content){
# -- We need a name for each text file --
$textName=rand()."_content_".rand().".txt";//we'll just use some numbers and text
//define file path (where you want the txt file to be saved)
$path="../";//we'll just put it in a folder above the script
$full_path=$path.$textName;
// now save the file..
file_put_contents($full_path,$content);
//and that's it
}//end of foreach
You may also use the SimpleHTML DOM Parser script to extract the content. This is a very useful script that I had used for 1.6 year. You can download the script from http://simplehtmldom.sourceforge.net/ . It is well documented with examples. Hope this will help you to solve your problem.

Categories