Get html comment element as string with SimpleHTMLDomParser in PHP - php

From the official manual I know that I can get all the comments with the following code:
// Find all comment (<!--...-->) blocks
$es = $html->find('comment');
But this creates an array of comment nodes. I want to get the content of the comments as string. How could I do that?
I've tried with $es->plaintext, $es->innertext and $es->outertext.
Here is an example of what I want:
HTML:
...
<div id='a'>
<!-- Some text -->
</div>
...
PHP:
...
$content = $html->find('div[id=a]', 0)->find('comment', 0)->some_attr;
echo 'Content:'.$content;
Browser:
Content: Some text
Thanks in advance !

I've found the solution!
When we load an html with SimpleHTMLDom, the comments (scripts and others things) are removed from document and saved inside an array called 'noise'.
We can get a comment/script/etc searching an string pattern in the whole list of noises and there is a function to do that.
This is the solution:
$html->search_noise($subString);
So, in my own example, the solution can be:
1.- $comment = $html->search_noise('Some');
2.- $comment = $html->search_noise('text');
3.- $comment = $html->search_noise('me te');
4.- etc etc
The search_noise function returns the first noise that match the pattern, so, we have to be a little careful with the chosen sub-string.

Related

get content inside html not working

I am trying to extract the html content from inside a website. I want only the content inside the tags.
//$validLink is a link with .htm extension, source code is rather large
//contains 24,000 lines of html code
$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml);
What else can I do? $thehtml is empty.... I am trying to insert this into a wordpress post... but $thehtml is empty.... for some odd reason. Is there a possible timeout issue or something???
There can't be a timeout issue..... due to the fact that I noticed that if I output just file_get_contents($validlink); for some reason BODY is not found.....
Another possible solution would be just to get the content between the first div and the last div found in the document....
get the string position using 'strpos()' of both tag starting and ending then use sub string method i.e, substr() with this positions
$thehtml = file_get_contents($validlink);
$thehtml = preg_match("/<body.*?>(.*?)<\/body>/is", $thehtml,$matches);
$thehtml = $matches[0];
Here is the correct code:
$thehtml = file_get_contents($validlink);
preg_match('/<body.*?>(.*?)<\/body>/is', $thehtml, $matches);
$thehtml = $matches[1];
But I suggest you to use DOM parser instead.

How to get Wikipedia "clean" content?

I'm using Mediawiki api in order to get content from Wikipedia pages.
I've written a code which generates the next query (for example):
http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&rvsection=0&titles=hawaii
Which retrieves only the leading paragraph from the Wikipdia page about Hawaii.
The problem is that as you might notice there are a lot of irrelevant substrings such as:
"[[Molokai|Moloka{{okina}}i]], [[Lanai|Lāna{{okina}}i]], [[Kahoolawe|Kaho{{okina}}olawe]], [[Maui]] and the [[Hawaii (island)|".
All those barckets [[]] are not relevant , and I wonder whether there is an alegant method to pull only 'clean' content from such pages?
Thanks in advance.
You can get a clean HTML text from Wikipedia with this query:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii
If you want just a plain text, without HTML, try this:
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=hawaii&explaintext
please try this:
$relevant = preg_replace('/[[.*?]]/', '', $string);
EDIT: just found this - hope it is helpful

How to format text within a p tag using PHP DomDocument

I've seen many question which are almost what I am looking to do and have led me to almost be able to get it but not quite.
I want to format text as follows within a <p> tag which is within a div.
Page 1 of 2
So in ordinary HTML I used the b tag within the paragraph but can't seem to figure out how to do that with DomDocument. When I try to create an element like so
$pTag = $dom->createElement("p", "Page <b>1</b> of 2");
It outputs just that without recognising the as HTML. So I thought about it and came up with
$pTag->nodeValue .=
as a way to append a new element but that did no good. It didn't give me any errors but it didn't append the <b> tag either. This seems like something that should be simple but doesn't seem to be.
When I tried echo it outputted the text to the top of the screen, not where I wanted it.
I'd appreciate any advice.
Something like the following should work:
$bTag = $dom->createElement("b", "1");
$pTag = $dom->createElement("p");
$pTag->appendChild($dom->createTextNode("Page "));
$pTag->appendChild($bTag);
$pTag->appendChild($dom->createTextNode(" of 2"));

String replace the contents of a div

What I want to do:
I have a div with an id. Whenever ">" occurs I want to replace it with ">>". I also want to prefix the div with "You are here: ".
Example:
<div id="bbp-breadcrumb">Home > About > Contact</div>
Context:
My div contains breadcrumb links for bbPress but I'm trying to match its format to a site-wode bread crumb plugin that I'm using for WordPress. The div is called as function in PHP and outputted as HTML.
My question:
Do I use PHP of Javascript to replace the symbols and how do I go about calling the contents of the div in the first place?
Find the code that's generating the <, and either set the appropriate option (breadcrumb_separator or so) or modify the php code to change the separator.
Modifying supposedly static text with JavaScript is not only a maintenance nightmare, extremely brittle, and might lead to a strange rendering (as users see your site being modified if their system is slow), but will also not work in browsers without (or with disabled) JavaScript support.
You could use CSS to add the you are here text:
#bbp-breadcrumb:before {
content: "You are here: ";
}
Browser support:
http://www.quirksmode.org/css/beforeafter_content.html
You could change the > to >> with javascript:
var htmlElement = document.getElementById('bbp-breadcrumb');
htmlElement.innerHTML = htmlElement.innerHTML.split('>').join('>>').split('>').join('>>')
I don't recommend altering content like this, this is really hacky. You'd better change the ouput rendering of the breadcrumb plugin if possible. Within Wordpress this should be doable.
you can use a regex to match the breadcrumb content.. make the changes on it.. and put it back in the context..
check if this helps you:
$the_existing_html = 'somethis before<div id="bbp-breadcrumb">Home > About > Contact</div>something after'; // let's say this is your curreny html.. just added some context
echo $the_existing_html, '<hr />'; // output.. so that you can see the difference at the end
$pattern ='|<div(.*)bbp-breadcrumb(.*)>(.*)<\/div>|sU'; // find some text that is in a div that has "bbp-breadcrumb" somewhere in its atributes list
$all = preg_match_all($pattern, $the_existing_html, $matches); // match that pattern
$current_bc = $matches[3][0]; // get the text inside that div
$new_bc = 'You are here: ' . str_replace('>', '>>', $current_bc);// replace entity for > with the same thing repeated twice
$the_final_html = str_replace($current_bc, $new_bc, $the_existing_html); // replace the initial breadcrumb with the new one
echo $the_final_html; // output to see where we got

Find and replace problem

My website, has 2 database tables. 1 of them have the posts_table and the other one have the videos.
At the moment i am getting the text images etc , normally from the post_table table.
In my CMS when we add a video there is added a short code
[media id=487 width=660 height=440]
This shortcode automaticly get the link of a video from the vid_table where the id is the same as the shortcode.
So what i want is:
I need to do the same thing that the short code do, when a video is added on CMS the short code is showed in the post, i need to delete the shortcode and instead of it want to be played a video that has the link on the vid_table.
I have some problems with my english , so if you dont understand again please tell me.
Any kind of help will be great.
Thank you.
EDITED: So i want to replace the whole media tag with a flash player, that plays the url that belongs to the ID in the media tag
BUMP !! CAN HELP PLEASE ?
This is quite a sophisticated problem actually. I was bored and made a basic tag parser. Right now it has some problems:
HTML rendering should be implemented in a separated class (and a template engine such as Twig should do the rendering);
Tag parsing is way too naive and will probably give you unexpected results if a tag's syntax is incorrect;
[media] tag does not support IE. You would have to change the source itself (method TagParser::renderMedia())
Some features to note:
extra parameters will be rendered as attributes for [link] tag, e.g [link id=25 class=foo] will output example.
parameters may contain spaces if you quote them: [link id=25 class="foo bar"] will output example
If DataProvider::findById() does not return a 'content' in its array, the parser will output http://example.com
The code is too long to paste here, you can find it on gist. Just put each file in the directory specified by the first commented line and you should be set. Run example.php to see it in action. You can find out some more details about using this script by looking at the unit test.
What do you want exactly?, you can get media id out of the text using
$text = 'some stuff [media id=468 width=660 height=440] more stuff';
preg_match("/media id=(.*) w/",$text, $results);
$result = $results[0];
$result = str_replace("media id=","",$result);
$result = str_replace("w","",$result);
$id = $result;

Categories