How to parse PCDATA and child element separately with PHP DOM?

How to parse PCDATA and child element separately with PHP DOM? - php

I'm trying to parse an XML of a dtbook, which contains levels (1, 2 and 3) that later on contains p-tags. I'm doing this with PHP DOM. Link to XML
Inside som of these p-tags there are noteref-tags. I do get a hold of those, but it seems that the only results I'm able to get is either that the noteref appears before the p-tag, or after. I need some of the noterefs to appear inside the p-tag; or in other words, where they actually are supposed to be.
<p>Special education for the ..... <noteref class="endnote" idref="fn_5"
id="note5">5</noteref>. Interest ..... 19th century <noteref class="endnote"
idref="fn_6" id="note6">6</noteref>.</p>
This is the code I've got for the p-tag now. Before this, I'm looping through the dt-book to get tho the p-tag. That works fine.
if($level1->tagName == "p") {
echo "<p>".$level1->nodeValue;
$noterefs = $level1->childNodes;
foreach($noterefs as $noteref) {
if($noteref->nodeType == XML_ELEMENT_NODE) {
echo "<span><b>".$noteref->nodeValue."</b></span>";
}
}
echo "</p><br>";
}
These are the results I get:
Special education for the ..... 5. Interest ..... 19th century 6.56
56Special education for the ..... 5. Interest ..... 19th century 6.
I also want the p-tag to not display what's inside the noteref-tag. That should be done by the noteref-tag (only).
So, does anybody know what could possibly be done to fix these things? It feels like I've both googled and tried almost everything.

DOMNode->nodeValue (which in PHP's DOMElement is the same as DOMNode->textContent) will contain the complete text content from itself and all its descending nodes. Or, to put it a little more simple: it contains the complete content of the node, but with all tags removed.
What you probably want to try is the something like the following (untested):
if($level1->tagName == "p") {
echo "<p>";
// loop through all childNodes, not just noteref elements
foreach($level1->childNodes as $childNode) {
// you could also use if() statements here, of course
switch($childNode->nodeType) {
// if it's just text
case XML_TEXT_NODE:
echo $childNode->nodeValue;
break;
// if it's an element
case XML_ELEMENT_NODE:
echo "<span><b>".$childNode->nodeValue."</b></span>";
break;
}
}
echo "</p><br>";
}
Be aware though that this is still rather flimsy. For instance: if any other elements, besides <noteref> elements, show up in the <p> elements, they will also be wrapped in <span><b> elements.
Hopefully I've at least given you a clue as to why your result <p> elements showed the contents of the child elements as well.
As a side note: if what you want to achieve is transform the contents of an XML document into HTML or perhaps some other XML structure, it might pay off to look into XSLT. Be aware though that the learning curve could be steep.

Related

Adding a class to all English text in HTML?

The requirement is to add an englishText class around all english words on a page. The problem is similar to this, but the Javascript solutions wont work for me. I require a PHP example to solve this problem. For example, if you have this:
<p>Hello, 你好</p>
<div>It is me, 你好</div>
<strong>你好, how are you</strong>
Afterwards I need to end with:
<p><span class="englishText">Hello</span>, 你好</p>
<div><span class="englishText">It is me</span>, 你好</div>
<strong>你好, <span class="englishText">how are you</span></strong>
There are more complicated cases, such as:
<strong>你好, TEXT?</strong>
<div>It is me, 你好</div>
This should become:
<strong>你好, <span class="englishText">TEXT?</span></strong>
<div><span class="englishText">It is me</span>, 你好</div>
But I think I can sort out these edge cases once I know how actually iterate over the document correctly.
I can't use javascript to solve this because:
This needs to work on browsers that don't support javascript
I would prefer to have the classes in place on page load so there isn't any delay in rendering the text in the correct font.
I figured the best way to iterate over the document would be using PHP Simple HTML DOM Parser.
But the problem is that if I try this:
foreach ($html->find('div') as $element)
{
// make changes here
}
My concern is that the following case will cause chaos:
<div>
Hello , 你好
<div>Hello, 你好</div>
</div>
As you can see, it's going to go into the first div and then if I process that node, I will be processing the node within that too.
Any ideas how to get around this and only select the nodes for processing once?
UPDATE
I realise now that what I effectively need is a recursive way to iterate over HTML elements with the ability to change them as I iterate over them.

You should travel through siblings that way you won't get in trouble with such a cases...
Something like that:
<?php
foreach ($html->find('div') as $element)
{
foreach($element->next_sibling() as $sibling){
echo $sibling->plaintext()."\n";
}
}
?>
Or much easier way imo:
Just...
Change every <*> to "\n"."<*>" with preg_replace();
Make an array of lines like $lines = explode("\n",$html_string);
3.
foreach($lines as $line){
$text = strip_tags($line);
echo $text;
}

How to combine the text node of 2 pieces of extracted data using Goutte/Domcrawler

I've been trying to figure out how to combine two pieces of extracted text into a single result (array). In this case, the title and subtitle of a variety of books.
<td class="item_info">
<span class="item_title">Carrots Like Peas</span>
<em class="item_subtitle">- And Other Fun Facts</em>
</td>
The closest I've been able to get is:
$holds = $crawler->filter('span.item_title,em.item_subtitle');
Which I've managed to output with the following:
$holds->each(function ($node) {
echo '<pre>';
print $node->text();
echo '</pre>';
});
And results in
<pre>Carrots Like Peas</pre>
<pre>- And Other Fun Facts</pre>
Another problem is that not all the books have subtitles, so I need to avoid combining two titles together.
How would I go about combining those two into a single result (or array)?

In my case, I took a roundabout way to get where I wanted to be. I stepped back one level in the DOM to the td tag and grabbed everything and dumped it into the array.
I realized that DomCrawler's documentation had the example code to place the text nodes into an array.
$items_out = $crawler->filter('td.item_info')->each(function (Crawler $node, $i) {
return $node->text();
});
I'd tried to avoid capturing the td because author's were also included in those cells. After even more digging, I was able to strip the authors from the array with the following:
foreach ($items_out as &$items) {
$items = substr($items,0, strpos($items,' - by'));
}
Just took me five days to get it all sorted out. Now onto the next problem!

As per Goutte Documentation, Goutte utilizes the Symfony DomCrawler component. Information on adding content to a DomCrawler object can be found atSymfony DomCrawler - Adding Content

Parsing XML document with PHP using 'foreach' loop

I'm new to PHP, MySQL and XML... and have been trying to wrap my head around classes, objects, arrays and loops. I'm working on a parser that extracts data from an XML file, then stores it into a database. A fun and delightfully frustrating challenge to work on during the christmas holiday.
Before posting this question I've gone over the PHP5.x documentation, W3C and also searched quite a bit around stackoverflow.
Here's the code...
> XML:
<alliancedata>
<server>
<name>irrelevant</name>
</server>
<alliances>
<alliance>
<alliance id="101">Knock Out</alliance>
<roles>
<role>
<role id="1">irrelevant</role>
</role>
</roles>
<relationships>
<relationship>
<proposedbyalliance id="102" />
<acceptedbyalliance id="101" />
<relationshiptype id="4">NAP</relationshiptype>
<establishedsince>2014-12-27T18:01:34.130</establishedsince>
</relationship>
<relationship>
<proposedbyalliance id="101" />
<acceptedbyalliance id="103" />
<relationshiptype id="4">NAP</relationshiptype>
<establishedsince>2014-12-27T18:01:34.130</establishedsince>
</relationship>
<relationship>
<proposedbyalliance id="104" />
<acceptedbyalliance id="101" />
<relationshiptype id="4">NAP</relationshiptype>
<establishedsince>2014-12-27T18:01:34.130</establishedsince>
</relationship>
</relationships>
</alliance>
</alliancedata>
> PHP:
$xml = simplexml_load_file($alliances_xml); // $alliances_xml = path to file
// die(var_dump($xml));
// var_dump prints out the entire unparsed xml file.
foreach ($xml->alliances as $alliances) {
// Alliance info
$alliance_id = mysqli_real_escape_string($dbconnect, $alliances->alliance->alliance['id']);
$alliance_name = mysqli_real_escape_string($dbconnect,$alliances->alliance->alliance);
// Diplomacy info
$proposed_by_alliance_id = mysqli_real_escape_string($dbconnect,$alliances->alliance->relationships->relationship->proposedbyalliance['id']);
$accepted_by_alliance_id = mysqli_real_escape_string($dbconnect,$alliances->alliance->relationships->relationship->acceptedbyalliance['id']);
$relationship_type_id = mysqli_real_escape_string($dbconnect,$alliances->alliance->relationships->relationship->relationshiptype['id']);
$established_date = mysqli_real_escape_string($dbconnect,$alliances->alliance->relationships->relationship->establishedsince);
// this is my attempt to echo every result
echo "Alliance ID: <b>$alliance_id</b> <br/>";
echo "Alliance NAME: <b>$alliance_name</b> <br/>";
echo "Diplomacy Proposed: <b>$proposed_by_alliance_id</b> <br/>";
echo "Diplomacy Accepted: <b>$accepted_by_alliance_id</b> <br/>";
echo "Diplomacy Type: <b>$relationship_type_id</b> <br/>";
echo "Date Accepted: <b>$established_date</b> <br/>";
echo "<hr/>";
}
> intrepter output:
Alliance ID: 1
Alliance NAME: Knock Out
Diplomacy Proposed: 102
Diplomacy Accepted: 101
Diplomacy Type: 4
Date Accepted: 2011-10-24T05:08:35.830
I don't understand why the loop simply stops after parsing the first row of data. My best guess, is that my code is not telling PHP what to do after the first values are parsed.
Honestly I have no idea how to explain this in words, so here's a visual representation.
First row is interpreted as
--->$alliance_id
--->$alliance_name
--->$proposed_by_alliance_id
--->$accepted_by_alliance_id
--->$relationship_type_id
--->$established_date
then for the next <relationship> subnodes the following happens...
---> ?? _(no data)_
---> ?? _(no data)_
--->$proposed_by_alliance_id
--->$accepted_by_alliance_id
--->$relationship_type_id
--->$established_date
Since I'm not telling PHP to add $alliance_id and $alliance_name to every iteration of the <relationship> subnode, the interpreter simply decides to abort the foreach operation.
As I mentioned above, I'm new to both PHP and Stackoverflow and I really appreciate any help or wisdom you can share. Thank you in advance.

You write that you've got problems to debug your issues traversing an XML document with SimpleXML.
The first puzzle you come over is that your foreach does only iterate once:
foreach ($xml->alliances as $alliances) {
You can't accept the fact. However, if we take the XML you've got in your question and actually take a look how many <alliances> elements the XML document has, we can see that SimpleXML is doing the right thing here:
there is exactly one (1) <alliances> element inside the document element.
$xml->alliances has one (1) iteration.
$xml->alliances->count() gives int(1)
The accordance with the XML can be easily verified as well. Commented dead code in your questions example suggests that you were using var_dump to see whether or not the XML loads. You don't have to, if simplexml_load_file does not return false, the document was loaded (if you opt for falsy: the document was either not loaded or empty).
So if you want to ensure the document has loaded, just check the return value and throw an exception in case there was a problem.
To check which XML a SimpleXMLElement contains, you shouldn't use var_dump as well. Instead output the XML. As the XML can be quite large at this point, take only the first 256 bytes for example, that normally shows a good picture:
echo substr($xml->alliances->asXML(), 0, 256), "\n";
<alliances>
<alliance>
<alliance id="1">Harmless?</alliance>
<foundedbyplayerid id="10"/><alliancecapitaltownid id="14646"/>
<allianceticker>H?</allianceticker>
<foundeddatetime>2010-02-25T14:18:07.867</foundeddatetime>
<alliancecapitallastmoved>2012-01-19T17:42
^^^^^^^^^
This directly shows that you're iterating over the element(s) named alliances which exist only once in the document. This is totally aligned with the observation you've made that there is only one foreach.
With this really basic debugging you can do the following conclusion:
It is observed that Foreach does only iterate once (1).
Foreach has been commanded to iterates over elements named alliances.
As there is only one (1) iteration, there has to be only one (1) alliances element.
Counting the alliances elements, the result is one.
Therefore it is confirmed that there is only one (1) alliances element.
So obviously you're iterating over the wrong element(s).
As this outline of the error finding is rather extensive (just to give you the picture at which many points you could have already improved both your code but also the error checking and especially to show you places where you can start with trouble-shooting), the question remains, why you weren't able to spot this already. As until now, an answer here already pointed to the fact, that you were iterating over the wrong element(s). However it was not written out, but just a bit cryptic in code:
[...] change your for loop from foreach ($xml->alliances->alliance as $alliance) { to foreach ($xml->alliance as $alliance) {
and that's all
Source
Sure it's weak, as this only gives code but doesn't answer any of your (programming) question(s).
After finding the cause, let's cure this step by step
So after finding out that it's the wrong element, it's easy to fix that: iterate over the right elements.
This can be done by applying incremental changes to your code.
First of all the correct element needs to be chosen:
foreach ($xml->alliances->alliance as $alliances) {
This will immediately make your code spit out a lot of errors, many for each iteration. And there are many iterations. So you can already say with this little change, something was effectively changed into the right direction: Instead of one iteration, there are now many more.
But before fixing the mess with the newly introduced errors and warnings, first take care about the code just changed. The next thing is to rename the variable $alliances to $alliance (your editor should support your with that by either using search and replace (often CTRL+R) or by offering a refactoring command named "rename variable" (e.g. SHIFT+F6 in Phpstorm)). Afterwards that line (and the following lines are also changed but I don't show them) looks like:
foreach ($xml->alliances->alliance as $alliance) {
And it's yet still not ready. As $xml->alliances->alliance is a bit bulky, let's move it out and take a more speaking variable for that: $alliances:
$alliances = $xml->alliances->alliance;
foreach ($alliances as $alliance) {
The next step that needs to be done is just to correct an error you made. For some obscure reason totally not clear to me is that pass all data through mysqli_real_escape_string(). Even though if you would have intended to pass the data later on to a database, this is yet at the wrong place to call that function. First of all extract the data, that function is called later on in preparation of the database insert operation which is a different part of your application.
I just replaced all occurences of "mysqli_real_escape_string($dbconnect," with "trim(" so that finally - after proper indentation - the code has changed to this:
$alliances = $xml->alliances->alliance;
foreach ($alliances as $alliance) {
// Alliance info
$alliance_id = trim($alliance->alliance->alliance['id']);
$alliance_name = trim($alliance->alliance->alliance);
// Diplomacy info
$proposed_by_alliance_id = trim($alliance->alliance->relationships->relationship->proposedbyalliance['id']);
$accepted_by_alliance_id = trim($alliance->alliance->relationships->relationship->acceptedbyalliance['id']);
$relationship_type_id = trim($alliance->alliance->relationships->relationship->relationshiptype['id']);
$established_date = trim($alliance->alliance->relationships->relationship->establishedsince);
Thanks to the better named variables it now is pretty visible where the many
Notice: Trying to get property of non-object
warnings come from: The many calls to $alliance->alliance-> are just redundant. If we remember that originally you did iterate over the wrong elements, this is the counter-part: Because you used the wrong elements, you had to make the error more than once, otherwise you could not have extracted any data at all. Just think a second about this. It also means, that the earlier you could have verified that what your intention to do is actually done by the code, the less little problems were introduced.
Good thing here again is that this is easy to fix by replacing all "$alliance->alliance->" with "$alliance->":
$alliances = $xml->alliances->alliance;
foreach ($alliances as $alliance) {
// Alliance info
$alliance_id = trim($alliance->alliance['id']);
$alliance_name = trim($alliance->alliance);
// Diplomacy info
$proposed_by_alliance_id = trim($alliance->relationships->relationship->proposedbyalliance['id']);
$accepted_by_alliance_id = trim($alliance->relationships->relationship->acceptedbyalliance['id']);
$relationship_type_id = trim($alliance->relationships->relationship->relationshiptype['id']);
$established_date = trim($alliance->relationships->relationship->establishedsince);
Running the code again now shows that the iteration works and the information to obtain from each alliance element works perfectly fine as well. Still there are errors given because as you already say in your question, you not only wonder about the iteration but also about further traversing the relationships:
Alliance ID ......: 1
Alliance NAME ....: Harmless?
Diplomacy Proposed: 454
Diplomacy Accepted: 1
Diplomacy Type ...: 4
Date Accepted ...: 2011-10-24T05:08:35.830
-------------------------------------------------
[4x Notice: Trying to get property of non-object]
Alliance ID ......: 2
Alliance NAME ....: Danger
Diplomacy Proposed:
Diplomacy Accepted:
Diplomacy Type ...:
Date Accepted ...:
-------------------------------------------------
...
The error messages correspond to the following four lines:
$proposed_by_alliance_id = trim($alliance->relationships->relationship->proposedbyalliance['id']);
$accepted_by_alliance_id = trim($alliance->relationships->relationship->acceptedbyalliance['id']);
$relationship_type_id = trim($alliance->relationships->relationship->relationshiptype['id']);
$established_date = trim($alliance->relationships->relationship->establishedsince);
Which means, that again, you need to apply trouble-shooting steps as outlined at the very beginning of my answer to this section now of your code.
Here is the code example so far:
$xml = simplexml_load_file($alliances_xml); // $alliances_xml = path to file
if (!$xml) {
throw new UnexpectedValueException(
sprintf("Unable to load XML or it was empty. Filename given was %s", var_export($alliances_xml, true))
);
}
$alliances = $xml->alliances->alliance;
// limit to two iterations for debugging
$alliances = new LimitIterator(new IteratorIterator($alliances), 0, 2);
foreach ($alliances as $alliance) {
// Alliance info
$alliance_id = trim($alliance->alliance['id']);
$alliance_name = trim($alliance->alliance);
// Diplomacy info
$proposed_by_alliance_id = trim($alliance->relationships->relationship->proposedbyalliance['id']);
$accepted_by_alliance_id = trim($alliance->relationships->relationship->acceptedbyalliance['id']);
$relationship_type_id = trim($alliance->relationships->relationship->relationshiptype['id']);
$established_date = trim($alliance->relationships->relationship->establishedsince);
// this is my attempt to echo every result
echo "Alliance ID ......: $alliance_id\n";
echo "Alliance NAME ....: $alliance_name\n";
echo "Diplomacy Proposed: $proposed_by_alliance_id\n";
echo "Diplomacy Accepted: $accepted_by_alliance_id\n";
echo "Diplomacy Type ...: $relationship_type_id\n";
echo "Date Accepted ...: $established_date\n";
echo "-------------------------------------------------\n";
}
Please note that I'm using the command-line to execute the PHP code as it's much faster then via the browser over a webserver. I also do not need to write HTML to just have nicely formatted output.

I made phpfiddle of your code, tested, working.
http://phpfiddle.org/main/code/7agg-si3f
You need to remove
<server>
<name>Epic1</name>
</server>
and add </alliances> to the end, since it's reporting invalid xml
after that change your for loop from foreach ($xml->alliances->alliance as $alliance) {
to foreach ($xml->alliance as $alliance) {
and that's all

How do you access Simple DOM selectors?

I can access some of the 'class' items with a
$ret = $html->find('articleINfo'); and then print the first key of the returned array.
However, there are other tags I need like span=id"firstArticle_0" and I cannot seem to find it.
$ret = $html->find('#span=id[ etc ]');
In some cases something is returned, but it's not an array, or is an array with empty keys.
Unfortunately I cannot use var_dump to see the object, since var_dump produces 1000 pages of unreadable junk. The code looks like this.
<div id="articlething">
<p class="byline">By Lord Byron and Alister Crowley</p>
<p>
<span class="location">GEORGIA MOUNTAINS, Canada</span> |
<span class="timestamp">Fri Apr 29, 2011 11:27am EDT</span>
</p>
</div>
<span id="midPart_0"></span><span class="mainParagraph"><p><span class="midLocation">TUSCALOOSA, Alabama</span> - Who invented cheese? Everyone wants to know. They held a big meeting. Tom Cruise is a scientologist. </p>
</span><span id="midPart_1"></span><p>The president and his family visited Chuck-e-cheese in the morning </p><span id="midPart_2"></span><p>In Russia, 900 people were lost in the balls.</p><span id="midPart_3">

Simple HTML DOM can be used easily to find a span with a specific class.
If want all span's with class=location then:
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('span[class=location]');
Then do something like:
foreach($aObj as $key=>$oValue)
{
echo $key.": ".$oValue->plaintext."<br />";
}
It worked for me using your example my output was:
label=span, class=location: Found 1
0: GEORGIA MOUNTAINS, Canada
Hope that helps... and please Simple HTML DOM is great for what it does and easy to use once you get the hang of it. Keep trying and you will have a number of examples that you just use over and over again. I've scraped some pretty crazy pages and they get easier and easier.

Try using this. Worked for me very well and extremely easy to use. http://code.google.com/p/phpquery/

The docs on the PHP Simple DOM parser are spotty on deciphering Open Graph meta tags. Here's what seems to work for me:
<?php
// grab the contents of the page
$summary = file_get_html($url);
// Get image possibilities (for example)
$img = array();
// First, if the webpage has an og:image meta tag, it's easy:
if ($summary->find('meta[property=og:image]')) {
foreach ($summary->find('meta[property=og:image]') as $e) {
$img[] = $e->attr['content'];
}
}
?>

Is there a way to optimise finding text items on a page (not regex)

After seeing several threads rubbishing the regexp method of finding a term to match within an HTML document, I've used the Simple HTML DOM PHP parser (http://simplehtmldom.sourceforge.net/) to get the bits of text I'm after, but I want to know if my code is optimal. It feels like I'm looping too many times. Is there a way to optimise the following loop?
//Get the HTML and look at the text nodes
$html = str_get_html($buffer);
//First we match the <body> tag as we don't want to change the <head> items
foreach($html->find('body') as $body) {
//Then we get the text nodes, rather than any HTML
foreach($body->find('text') as $text) {
//Then we match each term
foreach ($terms as $term) {
//Match to the terms within the text nodes
$text->outertext = str_replace($term, '<span class="highlight">'.$term.'</span>', $text->outertext);
}
}
}
For example, would it make a difference to determine check if I have any matches before I start the loop maybe?

You don't need the outer foreach loop; there's generally only one body tag in a well-formed document. Instead, just use $body = $html->find('body',0);.
However, since a loop with only a single iteration is essentially equivalent in run time to not looping at all, it probably won't have much performance impact either way. So in reality, you really just have 2 nested loops even in your original code, not 3.

Speaking out of ignorance, does find take arbitrary XPath expressions? If it does, you can fold the two outer loops into one:
foreach($html->find('body/text') as $body) {
...
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to parse PCDATA and child element separately with PHP DOM? - php

Related

Adding a class to all English text in HTML?

How to combine the text node of 2 pieces of extracted data using Goutte/Domcrawler

Parsing XML document with PHP using 'foreach' loop

How do you access Simple DOM selectors?

Is there a way to optimise finding text items on a page (not regex)

Categories

Resources