Get first HTML element from a string - php

I was reading this article. This function that it includes:
<?php
function getFirstPara($string){
$string = substr($string,0, strpos($string, "</p>")+4);
return $string;
}
?>
...seems to return first found <p> in the string. But, how could I get the first HTML element (p, a, div, ...) in the string (kind of :first-child in CSS).

It's generally recommended to avoid string parsing methods to interrogate html.
You'll find that html comes with so many edge cases and parsing quirks that however clever you think you've been with your code, html will come along and whack you over the head with a string that breaks your tests.
I would highly recommend you use a php dom parsing library (free and often included by default with php installs).
For example DomDocument:
$dom = new \DOMDocument;
$dom->loadHTML('<p>One</p><p>Two</p><p>Three</p>');
$elements = $dom->getElementsByTagName('body')->item(0)->childNodes;
print '<pre>';
var_dump($elements->item(0));

You could use http://php.net/strstr as the article
first search for "<p>" this will give you the full string from the first occurrence and to the end
$first = strstr($html, '<p>');
then search for "</p>" in that result, this will give you all the html you dont want to keep
$second = strstr($first, '</p>');
then remove the unwanted html
$final = str_replace($second, "", $first);
The same methode could be use to get the first child by looking for "<" and "</$" in the result from before. You will need to check the first char/word after the < to find the right end tag.

Related

How to replace 'innerHTML' of tag that has specific class (the nth occurrence) (using regex)?

What I am trying to achieve
I am trying to replace the 'innerHTML' of any (in my case) html tag, that has a specific class assigned to it, within a file_get_contents() string, without altering the other content. Later I will create a file (with file_put_contents()).
I am specifically trying to avoid the use of DOMDocuments, Xpath, simple_html_dom because these alter the formatting of a document.
The class markers are just a way to mark the elements in the source, like lightbox does. Marking with a class seemed most elegant, but maybe marking elements in a different way makes the solution easier? I doubt it will make a difference though.
The code should also match when:
when class="..." contains other classes
when innerHTML contains other tags
It is not necessary but it would be amazing if it even matches if:
There is php in class="..."
php inbetween class="..." and >
What I have tried
(in counter-chronological order)
1 - trying to work with the following fucntion I've found in other so answers and php.net:
function preg_replace_nth($pattern, $replacement, $subject, $nth=1) {
return preg_replace_callback($pattern,
function($found) use (&$pattern, &$replacement, &$nth) {
$nth--;
if ($nth==0) return preg_replace($pattern, $replacement, reset($found) );
return reset($found);
},$subject ,$nth );
}
I am not a regex expert and in combination with the php functions it becomes, for me, very difficult, that's why I ask for help. (I've been working on this for an hour or 8.)
I tried feeding it the following regex pattern (did many small alterations:
1 '#(?<=class=\"classToMatch\".*?>).*?(?=</)#';
For the last 30 alterations it keeps returning:
Warning: preg_replace_callback(): Compilation failed: lookbehind assertion is not fixed length at offset xx
Things I realise that are perhaps problematic for regex:
I do not have the luxury to be able to look for a specific closing tag (e.g. </h2>) because the tag could be any element. If really necesarry, maybe I should limit my request to <p>, <h(x)> and <a> elements.
I think dealing with nested elements might become problematic.
2 - working with simple_html_dom and DOMDocument
First I was delighted to see that it worked, but when I opened the source code of the edited document I was horrified because it deleted a lot of formatting.
This was the working code and should be fine for anyone working with html documents with little php and javascript.
$nth = 0; // nth occurrence (starts with 0)
$replaceWith = ''; // replacement string
$dom = new DOMDocument();
#$dom->loadHTMLFile("source.php");
// find all elements with specific class
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' classname ')]");
if (!is_int($nodes->length) || $nodes->length < 1) die('No element found');
$nodeToChange = $nodes->item($nth);
$nodeToChange ->removeChild($nodeToChange ->firstChild);
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($replaceWith);
$lentNodeToEdit->appendChild($fragment);
$dom->saveHTMLFile("test.php");
3 - things with strpos etc. and I am currently considering returning to these functions.
The following regex might be helpful to you:
<(?<tag>\w*)\sclass=\"lent-editable\">(?<text>.*)</\k<tag>>
You will need to find the group name "text", which is the inner HTML you want to replace.

Count start and end html tags

I'm looking for a way to count html tags in a chunk of html using php. This may not be a full web page with a doctype body tags etc.
For example:
If I had something like this
$string = "
<div></div>
<div style='blah'></div>
<p>hello</p>
<p>its debbie mcgee
<p class='pants'>missing p above</p>
<div></div>";
I want to pass it to a function with a tag name such as
CheckHtml( $string, 'p' );
and I would like it to tell me the number of open <p> tags and the number of close p tags </p>. I don't want it to do anything fancy beyond that (no sneaky trying to fix it).
I have tried with string counts with start tags such as <p but it can too easily find things like and return wrong results.
I had a look as DOMDocument but it doesn't seem to count close tags and always expects <html> tags (although I could work around this).
Any suggestions on what to use.
To get a accurate count, you can't use string matching or regex because of the well-known problems of parsing HTML with regex
Nor can you use the output of a standard parser, because that's a DOM consisting of elements and all the information about the tags that were in the HTML has been discarded. End tags will be inferred even for valid HTML, and even some start tags (e.g. html, head, body, tbody) can be inferred. Moreover things like the adoption agency algorithm can result in there being more elements than there were tags in the HTML mark-up. For example <b><i></b>x</i> will result in there being two i elements in the DOM. At the same time, end tags that can't be matched with start tags are simply discarded, as indeed can start and end tags that appear in the wrong place. (e.g. <caption> not in <table> or <legend> not in <fieldset>)
The only way I can think you could do this in any way reliably is this:
There's an open source PHP library for parsing HTML called html5lib.
In there, there's a file called Tokenizer.php and at the end of that file there's a function called emitToken. At this point, the parser has done all the work of figuring out all the HTML weirdnesses¹, and the $token parameter contains all the information about what kind of token has been recognised, including start and end tags.
You could take the library and modify it so that it counts up the start and end tag tokens at that point, and then exposes those totals to your application code at the end of the parse process.
¹: That is, it's figured out the weirdnesses related to your counting problem. It hasn't begun to figure out the tree construction weirdnesses.
You can use substr_count() to return the number of times the needle substring occurs in the haystack $string.
$open_tag_count = substring_count( $string, '<p' );
$close_tag_count = substring_count( $string, '</p>' );
Be aware that '<param and <pre, so you may need to modify your search to handle two different specific cases:
$open_tag_count_without_attributes = substring_count( $string, '<p>' );
$open_tag_count_with_attributes = substring_count( $string, '<p ' );
$open_tag_count = $open_tag_count_without_attributes + $open_tag_count_with_attributes;
You may also wish to consider using [preg_match()][1]. Using a regular expression to parse HTML comes with a fairly substantial set of pitfalls, so use with caution.
substr_count seems like a good bet.
EDIT: You'll have to use preg_match then
I haven't tested, this but, for an idea..
function checkHTML($string,$htmlTag){
$openTags = preg_match('/<'.$htmlTag.'\b[^>]*>',$string);
$closeTags = preg_match('/<\/'.$htmlTag.'>/',$string);
return array($openTags, $closeTags);
}
$numberOfParagraphTags = checkHTML($string,'p');
echo('Open Tags:'.$numberOfParagraphTags[0].' Close Tags:'.$numberOfParagraphTags[1]);
For the chunk of HTML, try using the DomDocument PHP class instead of a string. Then you can use methods such as getElementsByTagName(); that will allow you to count the tags easier and more accurately. To load your string into a DomDocument, you could do something like this:
$doc = new DOMDocument();
$doc->loadHTML($string);
Then, to count your tags, do the following:
$tagList = $doc->getElementsByTagName($tag);
return $tagList.length;

preg_match_all how to remove img tag?

$str=<<<EOT
<img src="./img/upload_20571053.jpg" /><span>some word</span><div>some comtent</div>
EOT;
How to remove the img tag with preg_match_all or other way? Thanks.
I want echo <span>some word</span><div>some comtent</div> // may be other html tag, like in the $str ,just remove img.
As many people said, you shouldn't do this with a regexp. Most of the examples you've seen to replace the image tags are naive and would not work in every situation. The regular expression to take into account everything (assuming you have well-formed XHTML in the first place), would be very long, very complex and very hard to understand or edit later on. And even if you think that it works correctly then, the chances are it doesn't. You should really use a parser made specifically for parsing (X)HTML.
Here's how to do it properly without a regular expression using the DOM extension of PHP:
// add a root node to the XHTML and load it
$doc = new DOMDocument;
$doc->loadXML('<root>'.$str.'</root>');
// create a xpath query to find and delete all img elements
$xpath = new DOMXPath($doc);
foreach ($xpath->query('//img') as $node) {
$node->parentNode->removeChild($node);
}
// save the result
$str = $doc->saveXML($doc->documentElement);
// remove the root node
$str = substr($str, strlen('<root>'), -strlen('</root>'));
$str = preg_replace('#<img[^>]*>#i', '', $str);
preg_replace("#\<img src\=\"(.+)\"(.+)\/\>#iU", NULL, $str);
echo $str;
?
In addition to #Karo96, I would go more broad:
/<img[^>]*>/i
And:
$re = '/<img[^>]*>/i';
$str = preg_replace($re,'',$str);
demo
This also assumes the html will be properly formatted. Also, this disregards the general rule that we should not parse html with regex, but for the sake of answering you I'm including it.
Perhaps you want preg_replace. It would then be: $str = preg_replace('#<img.+?>#is', '', $str), although it should be noted that for any non-trivial HTML processing you must use an XML parser, for example using DOMDocument
$noimg = preg_replace('/<img[^>]*>/','',$str);
Should do the trick.
Don't use regex's for this. period. This isn't exactly parsing and it might be trivial but rexeg's aren't made for the DOM:
RegEx match open tags except XHTML self-contained tags
Just use DomDocument for instance.

str_replace within certain html tags only

I have an html page loaded into a PHP variable and am using str_replace to change certain words with other words. The only problem is that if one of these words appears in an important peice of code then the whole thing falls to bits.
Is there any way to only apply the str_replace function to certain html tags? Particularly: p,h1,h2,h3,h4,h5
EDIT:
The bit of code that matters:
$yay = str_ireplace($find, $replace , $html);
cheers and thanks in advance for any answers.
EDIT - FURTHER CLARIFICATION:
$find and $replace are arrays containing words to be found and replaced (respectively). $html is the string containing all the html code.
a good example of it falling to bits would be if I were to find and replace a word that occured in e.g. the domain name. So if I wanted to replace the word 'hat' with 'cheese'. Any occurance of an absolute path like
www.worldofhat.com/images/monkey.jpg
would be replaced with:
www.worldofcheese.com/images/monkey.jpg
So if the replacements could only occur in certain tags, this could be avoided.
Do not treat the HTML document as a mere string. Like you already noticed, tags/elements (and how they are nested) have meaning in an HTML page and thus, you want to use a tool that knows what to make of an HTML document. This would be DOM then:
Here is an example. First some HTML to work with
$html = <<< HTML
<body>
<h1>Germany reached the semi finals!!!</h1>
<h2>Germany reached the semi finals!!!</h2>
<h3>Germany reached the semi finals!!!</h3>
<h4>Germany reached the semi finals!!!</h4>
<h5>Germany reached the semi finals!!!</h5>
<p>Fans in Germany are totally excited over their team's 4:0 win today</p>
</body>
HTML;
And here is the actual code you would need to make Argentina happy
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//*[self::h1 or self::h2 or self::p]');
foreach( $nodes as $node ) {
$node->nodeValue = str_replace('Germany', 'Argentina', $node->nodeValue);
}
echo $dom->saveHTML();
Just add the tags you want to replace content in the XPath query call. An alternative to using XPath would be to use DOMDocument::getElementsByTagName, which you might know from JavaScript:
$nodes = $dom->getElementsByTagName('h1');
In fact, if you know it from JavaScript, you might know a lot more of it, because DOM is actually a language agnostic API defined by the W3C and implemented in many languages. The advantage of XPath over getElementsByTagName is obviously that you can query multiple nodes in one go. The drawback is, you have to know XPath :)

How to remove text between tags in php?

Despite using PHP for years, I've never really learnt how to use expressions to truncate strings properly... which is now biting me in the backside!
Can anyone provide me with some help truncating this? I need to chop out the text portion from the url, turning
text
into
$str = preg_replace('#(<a.*?>).*?(</a>)#', '$1$2', $str)
Using SimpleHTMLDom:
<?php
// example of how to modify anchor innerText
include('simple_html_dom.php');
// get DOM from URL or file
$html = file_get_html('http://www.example.com/');
//set innerText to null for each anchor
foreach($html->find('a') as $e) {
$e->innerText = null;
}
// dump contents
echo $html;
?>
What about something like this, considering you might want to re-use it with other hrefs :
$str = 'text';
$result = preg_replace('#(<a[^>]*>).*?(</a>)#', '$1$2', $str);
var_dump($result);
Which will get you :
string '' (length=24)
(I'm considering you made a typo in the OP ? )
If you don't need to match any other href, you could use something like :
$str = 'text';
$result = preg_replace('#().*?()#', '$1$2', $str);
var_dump($result);
Which will also get you :
string '' (length=24)
As a sidenote : for more complex HTML, don't try to use regular expressions : they work fine for this kind of simple situation, but for a real-life HTML portion, they don't really help, in general : HTML is not quite "regular" "enough" to be parsed by regexes.
You could use substring in combination with stringpos, eventhough this is not
a very nice approach.
Check: PHP Manual - String functions
Another way would be to write a regular expression to match your criteria.
But in order to get your problem solved quickly the string functions will do...
EDIT: I underestimated the audience. ;) Go ahead with the regexes... ^^
You don't need to capture the tags themselves. Just target the text between the tags and replace it with an empty string. Super simple.
Demo of both techniques
Code:
$string = 'text';
echo preg_replace('/<a[^>]*>\K[^<]*/', '', $string);
// the opening tag--^^^^^^^^ ^^^^^-match everything before the end tag
// ^^-restart fullstring match
Output:
Or in fringe cases when the link text contains a <, use this: ~<a[^>]*>\K.*?(?=</a>)~
This avoids the expense of capture groups using a lazy quantifier, the fullstring restarting \K and a "lookahead".
Older & wiser:
If you are parsing valid html, you should use a dom parser for stability/accuracy. Regex is DOM-ignorant, so if there is a tag attribute value containing a >, my snippet will fail.
As a narrowly suited domdocument solution to offer some context:
$dom = new DOMDocument;
$dom->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD); // 2nd params to remove DOCTYPE);
$dom->getElementsByTagName('a')[0]->nodeValue = '';
echo $dom->saveHTML();
Only use strip_tags(), that would get rid of the tags and left only the desired text between them

Categories