I am using DOMXPath to query nodes in an HTML document which content I would like to extract.
I have the following HTML document:
<p class="data">
Immediate Text
<br>
Text In Second Line
<br>
E-Mail:
<script>Some Script Tag</script>
<a href="#">
<script>Another Script Tag</script>
Some Link In Third Line
</a>
<br>
Text In Last Line
</p>
I would like to receive the following result:
Immediate Text\r\nText In Second Line\r\nE-Mail: Some Link In Third Line\r\nText In Last Line
So far I have the following PHP code:
#...
libxml_use_internal_errors(true);
$dom = new \DOMDocument();
if(!$dom->loadHTML($html)) {
#...
}
$xpath = \DOMXPath($dom);
$result = $xpath->query("(//p[#class='data'])[1]/text()[not(parent::script)]");
Problems:
It does not include the child nodes' texts.
It does not include line breaks.
By using child axis / in /text() you'll get only direct child of current node context. To get all descendants, use descendant axis (//) instead.
To get both text node and <br>, you can try using //nodes() axis and filter further by node's type -to get nodes of type text node- or name -to get elements named br- :
(//p[#class='data'])[1]//nodes()[self::text() or self:br][not(parent::script)]
Related
I want to target a tags with class genre within parent div with id test:
<div id="test">
<a class="genre">hello</a>
<a class="genre">hello2</a>
</div>
So far, I can get all the genre a tags:
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//a[#class="genre"]');
... but I want to adjust //a[#class="genre"] so I only target the ones within the test div.
I don't understand why you did not write it yourself because you use all needed elements of xpath in your expression. Or, maybe, i've misunderstand you question
$elements = $xpath->query('//div[#id="test"]/a[#class="genre"]');
I need to loop through a bunch of HTML code and remove the <a> </a> tags from all links which DONT include the data attribute data-link="keepLink"
Here is an example of body value I need to modify:
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel.</strong>
After the modification I need it to look like (so the offer link is removed):
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel.</strong>
So far I have managed to get the first half of the link removing if it doesn't include a data-link="keepLink" attribute. But the closing </a> is still present.
Here is the regex I have used:
$result["body_value"] = preg_replace('/<a (?![^>]*data-link="keepLink").*?>/i', '', $result["body_value"]);
So the new body value looks like:
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel</a>.</strong>
The DOMDocument extension is available by default in PHP. It is presumably faster and is designed exactly for what you are trying to achieve. You can use it to load your document and search for any links without a data-link attribute like this:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com'); // load the file
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[not(#data-link=\'keepLink\')]'); // search for links that do not have the 'data-link' attribute set to 'keepLink'
foreach($nodes as $element){
$textInside = $element->nodeValue; // get the text inside the link
$parentNode = $element->parentNode; // save parent node
$parentNode->replaceChild(new DOMText($textInside), $element); // remove the element
}
$myNewHTML = $dom->saveHTML(); // see http://php.net/manual/ro/domdocument.savehtml.php for limitations such as auto-adding of doc-type
echo $myNewHTML;
Proof of concept: https://3v4l.org/ejatQ.
Please bear in mind that this will take only the text values inside the elements without a data-link='keepLink' attribute value.
If you are set on regex and don't want to use a parser.
Try this
<a (?!data-link=)[^>]*>((?!<\/a>).*?)<\/a>
And replace it by $1. To keep your link-text.
See https://regex101.com/r/wKQk4p/2
Please say if you need any further explaination.
I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.
You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.
I tried all the solutions posted on this question. Although it is similar to my question, it's solutions aren't working for me.
I am trying to get the plain text that is outside of <b> and it should be inside the <div id="maindiv>.
<div id=maindiv>
<b>I don't want this text</b>
I want this text
</div>
$part is the object that contains <div id="maindiv">.
Now I tried this:
$part->find('!b')->innertext;
The code above is not working. When I tried this
$part->plaintext;
it returned all of the plain text like this
I don't want this text I want this text
I read the official documentation, but I didn't find anything to resolve this:
Query:
$selector->query('//div[#id="maindiv"]/text()[2]')
Explanation:
// - selects nodes regardless of their position in tree
div - selects elements which node name is 'div'
[#id="maindiv"] - selects only those divs having the attribute id="maindiv"
/ - sets focus to the div element
text() - selects only text elements
[2] - selects the second text element (the first is whitespace)
Note! The actual position of the text element may depend on
your preserveWhitespace setting.
Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace
Example:
$html = <<<EOF
<div id="maindiv">
<b>I dont want this text</b>
I want this text
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXpath($doc);
$node = $selector->query('//div[#id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text
remove the <b> first:
$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text
$imgs = $xpath->query('//img');
$i = 0;
foreach($imgs as $img) {
$before = new DOMText($captions[$i]);
$img->parentNode->insertBefore($before);
$i++;
}
I want to insert some texts before a tag however I can't make this to work. The texts are sometimes inserted into wrong places. How can I solve this?
Try:
$img->parentNode->insertBefore($before, $img);
Adding $img as an argument (a 'reference node') to the insert function explicitly tells insert where in the hierarchy you want to insert the new node. Otherwise the docs simply say "appended to the children", which means the new node will be the last one of the parent's children.
e.g.
<span> <-- parentNode
<b>This is some text before the image</b>
<img ...> <--- $img node
<i>This is some text after the image</b>
</span>
Without the extra argument, you get:
<span>
<b>...</b>
<img ...>
<i>...</i>
new text node here <-- appended to parent Node's children, e.g. last
</span>
WITH the argument, you get:
<span>
<b>...</b>
new text node here <--the "before" position
<img ...>
<i>...</i>
</span>