php surf along html DOM callbacking node plain text contents - php

Scenario:
I need to apply a php function to the plain text contained inside HTML tags, and show the result, maintaining the original tags (with their original attributes).
Visualize:
Take this:
<p>Some text here pointing to the moon and that's it</p>
Return this:
<p>
phpFunction('Some text here pointing to the ')
phpFunction('moon')
phpFunction(' and that\'s it')
</p>
What I should do:
Use a PHP html parser (instead of using regexp) and iterate over every tag, applying the callback to the node text content.
Problem:
If I have, for example, an <a> tag inside a <p> tag, the text content of the parent <p> tag would consist of two different plain text parts, which the php callback should considerate as separate.
Question:
How should I approach this in a clean and smooth way?
Thanks for your time, all the best.

In the end, I decided to use regex instead of including an external library.
For the sake of simplicity:
$expectedOutput = preg_replace_callback(
'/>(.*)</U',
function ($withstuff) {
return '>'.doStuff($withStuff).' <';
},
$fromInput
);
This will look for everything between > and <, which is, indeed, what I was looking for.
Of course any suggestion/comment is still welcome.
Peace.

Related

PHP create hashtag links but NOT inside existent html code

I have this regex to convert hash tags into clickable links:
$text = preg_replace('/(?<!\S)#([0-9a-zA-Z]+)/m', '#$1', $text, $maximum_tags);
The problem is that it also converts if my text contains html tags with styling:
<span style="color:#CCC;">FOO</span>
It would try to make a hashtag out of that #CCC , how should I modify the regex to only work outside of html tags ? E.g in plain text areas.
The lexical effort to preg_replace conditionally towards existing markup is high, you should prefer to operate such contents as DOM (using ext/dom and its DOMDocument class architecture) and fetch and replace DOMText to bring hashtag-links into existing DOM.

Is there any way to rename or hide only one HTML tag?

Preface: I cannot rename the source tags or edit their IDs. Any changes to the tags must happen after they have been fetched.
What I'm doing: using file_get_contents in PHP, I am requesting data from a remote site. This data is just two <p> tags. I need to hide or rename the second of the two <p> tags.
Is this possible with PHP or jQuery?
What I'm working with:
<p>Hello my name is test</p><p>I like studying geology.</p>
If you need to hide second text, you can do this with Jquery:
$('p:eq(1)').hide();
Jsfiddle
You could try a php string replace
$new_string = str_replace('</p><p>','',file_get_contents('somecontent'));
If you need to do it before render HTML, you need to parse contents and remove/replace second p tag and create a new content.
Here is a DOM parser Simple HTML DOM Parser
Find similar questions below
How to match second <a> tag in this string
How to add attribute to first P tag using PHP regular expression?
Or you can do it after rendering HTML as rNix suggested.

Highlight piece of HTML code with matching text

I need to find text in html and highlight that html without corrupting html itself which i being highlighted.
Text to Replace:
This is text 2. This is Text 3.
HTML:
This is text 1. <p>
This is <span>Text 2</span>. This <div>is</div> text 3.
</p> This is Text 4.
Desired Output:
This is text 1.<p>
<strong class="highlight">This is <span>Text 2</span>. This <div>is</div> text 3. </strong>
</p> This is Text 4.
EDIT: Sorry, if I was not able to explain properly.
I need to highlight a portion of html document (in php or javascript) if string i am searching matches to text in HTML.
But remember that the string i am searching my not be identical to search string, it may contain some extra HTML.
For example if i am searching for this string "This is text.", it should be matched with "This is text.", "<anyhtmltag>This</anyhtmltag> is text.", "This <anyhtmltag>is</anyhtmltag> text.", "This<anyhtmltag> is text</anyhtmltag>." and so on.
You need to be more specific, if you want to achieve this by either server-side (using PHP for example and returning to browser a HTML code already containing highlighted output) or client-side (using a jQuery for example to find and highlight something in HTML returned by server)?
It seems to me, that you just asked a question, without doing nothing (like searching the net), as finding proper solution for jQuery (client-side) took me around TEN seconds! And three most important search results were on StackExchange and jQuery documentation itself.
Find text using jQuery?
Find text string using jQuery?
jQuery .wrap() function description
Here is an example in a very brief:
<script>
$('div:contains("This is <span>Text 2</span>. This <div>is</div> text 3")')wrap("<strong class="highlight"></strong>");
</script>
It generally finds, what you want to find and wraps it with what you want it to be wrapped with.
This works, when the text you want to find is inside some div, that is why, I used $('div:contains. If you want to search whole page, you can use $('*:contains instead.
This is example for jQuery and client-side highlighting. For PHP (server-side) version, do some little searching on either Google or StackOverflow and you'll for sure find many examples.
EDIT: As for your updated question. If you are using any textbox to put there, what you want to search, you can of course use something like this:
<script>
$("#mysearchbox").change(
{
$('div:contains($("#mysearchbox")').wrap("<strong class="highlight"></strong>");
});
</script>
and define your search box somewhere else for example like this:
<input id="mysearchbox"></input>
This way, you're attaching an onChange event to your search box, that will be fired anytime you type anything to it and that should find (and highlight) anything you entered.
Note that this examples are from-hand (from memory). I don't have access do jQuery from where I'm writing, so I can't check, if there aren't any mistakes, in what I've just wrote. You need to test it yourself.
By using JQuery you can add classes to elements like this...
The HTML:
<p id='myParagraph'>I need highlighting!</p>
The JQuery:
$(document).ready(function(){
$('#myParagraph').addClass('highlight');
});
I would have a look at where you can put certain HTML elements, as it is not advisable to be putting things like div tags in p tags.
I hope this helps!
UPDATED
Okay, well you can use JQuery to wrap tags around your code.
If you need to remove the tags you can use PHP's strip tags function - this might help with comparing the text string without the HTML formatting - obviously will be done before the page has loaded in the browser. Not sure on a Javascript equivalent.
The wrap will allow you to get from your HTML to your Desired Output - That said, I would seriously consider the structure of your HTML to make sure it is the best it can be... might make life easier.

PHP preg_replace to replace img tags and move them within html text

I'm developing a mobile web, which contents come from a foreign XML, and i'm having trouble with tags. They're coming with style attributes, which I think would be easy to erase using preg_replace in php before showing the contents. The problem comes when a img tag is found within a text... something like: "Hel<img .../>lo My name is Alfred<br/>". If I just erase the style attribute (generally coming with display:float), the image breaks the text, making it horrible to read.
My solution is: using preg_replace, I "clean" all image tags, BUT then I need to take those tags and place them after the next <br/>, </p>, etc. (every final of paragraph tag). I think it will at least make the page more readable and organized.
The problem: don't know how to get every img tag's index, just after I cleaned it, and then find the next end of paragraph to place it there.
Example-->
before:
Hell<img .../>o my name is Alfred.<br/>
<p>I come <img .../>from England</p>
after:
Hello my name is Alfred<br/>
<img .../>
<p>I come from England</p>
<img .../>
Thanks in advance.
EDIT ---
My doubt is: if I found an img tag (<img />) in text (maybe using preg_replace, because I first needed to find a img tag, verify its attributes and change them if necessary), how do I get the index inside the whole string (by whole string I mean the whole html document read as a string) so I could take the whole tag and move it to the next end of paragraph?
You won't get the position of the match from any preg_ function. On the other hand you can do the replacement with preg_replace_callback and pile up the matches to print them afterwards.
Example:
function noimages( $match )
{
$stack = array();
$img_regex = '%<img.*(/>|</img>)%ixU';
$noimages = preg_replace_callback($img_regex,
function ($imgtag) use(&$stack) { array_push($stack, $imgtag[0]); return ''; },
$match);
return array($noimages,$stack);
}
So for example:
$match = '<p>I'm not <img src="yyy.jpg"/> interested on this <img src="zzz.jpg"></img> issue </p>';
list($withoutimg, $imgs) = noimages( $match );
Will return the block in $withoutimg and an array with the two img tags in $imgs.

Get Text Content of current URL in php

I am working on URL Get content.
If i want to fetch ONLY the text conent from this site(Only text)
http://en.wikipedia.org/wiki/Asia
How is it possible. I can fetch the URL title and URL using PHP.
I got the url title using the below code:
$url = getenv('HTTP_REFERER');
$file = file($url);
$file = implode("",$file);
//$get_description = file_get_contents($url);
if(preg_match("/<title>(.+)<\/title>/i",$file,$m))
$get_title = $m[1];
echo $get_title;
Could you pl help me to get the content.
Using file_get_content i could get the HTML code alone. Any other possibilities?
Thanks -
Haan
If you just want to get a textual version of a HTML page, then you will have to process it yourself. Fetch the HTML (as you seem to already know how to do) and then process it into plain text with PHP.
There are several approaches to doing this. The first is htmlspecialchars() which will escape all the HTML special characters. I don't imagine this is what you actually want but I thought I'd mention it for completeness.
The second approach is strip_tags(). This will remove all HTML completely from a HTML document. However, it doesn't validate the input its working with, it just does a fairly simple text replace. This means you will end up with stuff that you might not want in the textual representation being included (such as the contents of the head section, or the innards of embedded javascript and stylesheets)
The other approach is to parse the downloaded HTML with DOMDocument. I've not written code for you (don't have time), but the general procedure would be similar to as follows:
Load the HTML into a DOMDocument object
Get the document's body element and iterate over its children.
For each child, if the child in question is a text node, append it to an output string. If it isn't a text node, then iterate over its children as well to check if any of its children are text nodes (and if not then iterate over those child elements as well and so on). You might also want to check the type of the node further. For example, if you don't want javascript or css embedded in the output then you can check that the tag type is not STYLE or SCRIPT and just ignore it if it is.
The above description is most easily implemented as a recursive function (one that calls itself).
The end result should be a string that contains only the textual content of the downloaded page, with no markup.
EDIT: Forgot about strip_tags! I updated my answer to mention that as well. I left my DOMDocument approach included in my answer though, because as the documentation for strip_tags states, it does no validation of the markup its processing, whereas DOMDocument attempts to parse it (and can potentially be more robust if a DOMDocument based text extraction is implemented well).
Use file_get_contents to get the HTML content and then strip_tags to remove the HTML tags, thus leaving only the text.

Categories