XPath in PHP: Get all text nodes, except navigation - php

I’m writing a custom parser/data extractor for some pretty shitty HTML.
Changing the HTML is out of the question.
I will spare you the details of the hoops I’ve had to jump through but I’ve now come pretty close to my original goal. I’m using a combination of DOMDocument getElementByName, regular expression replace (I know, I know...), and XPath queries.
I need to get all the text out of the body of the document. I would like for the navigation to remain a separate entity, at least in the abstract. Here’s what I’m doing now:
$contentnodes = $xpath->query("//body//*[not(self::a)]/text()|//body//ul/li/a");
foreach ($contentnodes as $contentnode) {
$type = $contentnode->nodeName;
$content = $contentnode->nodeValue;
$output[] = array( $type, $content);
}
This works, except that of course it treats all of the links on the page differently, and I only want it to do that to the navigation.
What XPath syntax can I use so that, in the first part of that query, before the |, I tell it to get all the text nodes of body’s children except ul > li > a.
Please note that I cannot rely on the presence of p tags or h1 tags or anything sensible like that to make educated guesses about content.
Thanks
Update: #hr_117’s answer below works. I’ve also found that you can use multiple not statements like so:
//body//text()[not(parent::a/parent::li/parent::ul)][not(parent::h1)]

You may try something like this:
//body//text()[not(parent::a/parent::li/parent::ul)]|//body//ul/li/a

//body//*[not(self::a/parent::li/parent::ul)]/text()[normalize-space()]|//body//ul/li/a
(test)

Related

Simpe_html_dom: How to remove a 'div' tag with a particular 'class' attribute?

How can remove all the div tags with the class attribute equal to info in a PHP page, i.e. the <div class="info">...</div> tags need to be removed from an HTML string.
I have seen this question, but I want to remove elements with the class equal to info.
I will like to use Simple HTML DOM (compared to any other library) because I am already using it in the program.
There are multiple ways to select elements by class. Check out the manual.
$html->find('div[class=info]');
$html->find('div.info');
$html->find('.info'); // not just divs
Then using this answer, you can remove the elements by setting outertext to be blank. In a loop it might look like this:
foreach($html->find('div.info') as $node) {
$node->outertext = '';
}
Edit: this is a good article which seems to cover most aspects of manipulation and common pitfalls (including deleting data).

How to do this string replacement in PHP

I have a string and within that string are some links of the format
Text
I want to replace that entire section with a different piece of markup
The problem is that while I can get the overall structure of the markup to be replaced; and also the URL, it's not so easy for me to get the "Text". If I knew the entire link then I might do something like.
'str_replace( $each_link , $my_new_markup , $the_original_string );'
and iterate through each link, but I cant because I cant know what $each_link is going to be exactly.
Is there any way to look for something like this? I am thinking it must have something to do with REGEX but I am totally hopeless at it, and I don't even know if that's the right place to start.
[WILDCARD of some kind]
You could look at a class like this, Simple HTML DOM Parser that you can use to cycle through elements searching for a specific inner html or other attribute and then change it.
Code looking something like this
foreach($html->find('a') as $element) {
if ($element->innertext == $needle) {
$element->innertext = $my_new_markup;
}
}

How can I display and format HTML code within the page?

I want to display html code just like what your see here.
<textarea><script id="ff">gdgdgs</script></textarea>
and have it displayed without altering the page. and have it nicely within a box like this.
How is this achieved?
I think the best way is to actually have a look and see how Stackoverflow does it! :)
If you right click on your code box in Chrome and select inspect element, it'll show you the markup for that box. It's so useful to be able to do this, obviously not to rip people off, but learn how other people put websites together, and how they achieve cool effects like code boxes! :)
Interestingly enough though, if you simply right click on the page and go to view source, you'll see something slightly different:
<pre><code><textarea><script id="ff">gdgdgs</script></textarea>
</code></pre>
So we can see here that this is what the mark-up for that box looks like before the page has loaded and any JavaScript is run. When the page starts to load on the client side, some JavaScript will be run which takes the above mark-up and tranforms it to look like the mark up you see when you right click on the code box and inspect it in chrome. Doing this gives you a real-time view of the HTML on the page:
<pre class="lang-php prettyprint">
<code>
<span class="tag"><textarea><script</span>
<span class="pln"></span>
<span class="atn">id</span>
<span class="pun">=</span>
<span class="atv">"ff"</span>
<span class="tag">></span>
<span class="pln">gdgdgs</span>
<span class="tag"></script></textarea></span>
<span class="pln"><br></span>
</code>
</pre>
So if you have a look, you can see the transformed code uses a pre tag. This basically says, anything between here you can treat as a literal or in otherwords, keep line breaks and spaces where I left them!
As well as using the pre tag to wrap the code, you can also see that they use different CSS classes. This is to achieve the color coding you can see.
They also use a code tag which as far as I can see, is very similar to pre, only it makes your markup a bit clearer by saying, within this tag, you should expect to see code. It's probably more semantic more than anything, like the HTML tag artical. In most browsers, it'll also change the font for text inside the code tag to mono-space, which is a bit more code like! :)
You can go furhter into this and see exactly what their CSS classes look like, from this you can start to build a mental picture to see how their mark-up and CSS works together to product their nice code boxes.
Of course, if you don't want to roll this functionality yourself, you can use someone elses framework to achive this. SyntaxHighlighter for example if widely used and recommended.
With Syntax Highlighter, you simply reference the Syntax Highlighter CSS and javascript, and then only need to wrap your code in one pre tag to get it working, something like below:
<pre class="brush: xml">
<textarea><script id="ff">gdgdgs</script></textarea>
</pre>
It might be worth a look!
Hope this helps! :)
you could use
>
>
and
<
<
This website here can help you with your particular problem. It converts your tags/html/javascript to ASCII. If you need a function, here it is. It converts the passed tags/html/javascript to ASCII. The ASCII code is escaped and treated as text by the browser. You can latter use the generated ASCII and add it to the box.
function stringToAscii(s)
{
var ascii="";
if(s.length>0)
for(i=0; i<s.length; i++)
{
var c = ""+s.charCodeAt(i);
while(c.length < 3)
c = "0"+c;
ascii += c;
}
return(ascii);
}
Use the Encoded Version like this:
<textarea>
<script id="ff">
gdgdgs
</script>
</textarea>
Is this what you mean?
<textarea><script id="ff">gdgdgs</script></textarea>
Look up HTML entities.
Yeah, just include it like:
$(document).ready(function(){
var a = '<textarea><script id="ff">gdgdgs</scrip'+'t></textarea>';
$("div").css('background','red').text(a);
});
I use the <xmp> element.

Regex to alter img attributes in Wordpress Filter

I have a custom theme I've developed for a photographer client and need to implement lazy-loading of the images so that the blog loads faster as it is horribly slow due to the amount of images he currently has, even when only showing five posts. To do this I'm using the JAIL jquery plugin but I need to be able to modify the image tags for it to work properly.. basically I have to replace the src attribute with a placeholder and set a data-href attribute to the source url. I cannot seem to find a resolution that works properly inside of a wordpress filter, I'm basically filtering the_content() hook in the posts.. does anyone know how I could accomplish this?
The standard Stackoverflow cliche for these questions is that you should use a DOM parser. Which is actually correct, but not quite feasible (performance) for output manipulation.
To accomplish what you want you could try:
$html = preg_replace_callback(
'#(<img\s[^>]*src)="([^"]+)"#',
"callback_img", $html);
Then define a callback like this:
function callback_img($match) {
list(, $img, $src) = $match;
return "$img=\"placeholder\" data-href=\"$src\" ";
}
Note that this regex is only workable if all your image links follow this scheme consistently (they all should be using double quotes for example).

Using PHP PCRE to fetch div content

I'm trying to fetch data from a div (based on his id), using PHP's PCRE. The goal is to fetch div's contents based on his id, and using recursivity / depth to get everything inside it. The main problem here is to get other divs inside the "main div", because regex would stop once it gets the next </div> it finds after the initial <div id="test">.
I've tryed so many different approaches to the subject, and none of it worked. The best solution, in my oppinion, is to use the R parameter (Recursion), but never got it to work properly.
Any Ideais?
Thanks in advance :D
You'd be much better off using some form of DOM parser - regex really isn't suited to this problem. If all you want is basic HTML dom parsing, something like simplehtmldom would be right up your alley. It's trivial to install (just include a single PHP file) and trivial to use (2-3 lines will do what you need).
include('simple-html-dom.php');
$dom = str_get_html($bunchofhtmlcode);
$testdiv = $dom->find('div#test',0); // 0 for the first occurrence
$testdiv_contents = $testdiv->innertext;

Categories