I want a preg_match code that will detect a given string and get its wrapping element.
I have a string and a html code like:
$string = "My text";
$html = "<div><p class='text'>My text</p><span>My text</span></div>";
So i need to create a function that will return the element wrapping the string like:
$element = get_wrapper($string, $html);
function get_wrapper($str, $code){
//code here that has preg_match and return the wrapper element
}
The returned value will be array since it has 2 possible returning values which are <p class='text'></p> and <span></span>
Anyone can give me a regex pattern on how to get the HTML element that wraps the given string?
Thanks! Answers are greatly appreciated.
It's bad idea use regex for this task. You can use DOMDocument
$oDom = new DOMDocument('1.0', 'UTF-8');
$oDom->loadXML("<div>" . $sHtml ."</div>");
get_wrapper($s, $oDom);
after recursively do
function get_wrapper($s, $oDom) {
foreach ($oDom->childNodes AS $oItem) {
if($oItem->nodeValue == $s) {
//needed tag - $oItem->nodeName
}
else {
get_wrapper($s, $oItem);
}
}
}
The simple pattern would be the following, but it assumes a lot of things. Regexes shouldn't be used with these. You should look at something like the Simple HTML DOM parser which is more intelligent.
Anyway, the regex that would match the wrapper tags and surrounding html elements is as follows.
/[A-Za-z'= <]*>My text<[A-Za-z\/>]*/g
Even if regex is never the correct answer in the domain of dom parsing, I came out with another (quite simple) solution
<[^>/]+?>My String</.+?>
if the html is good (ie it has closing tags, < is replaced with < & so on). This way you have in the first regex group the opening tag and in the second the closing one.
Related
How to remove all from page except text inside <p> tag?
Page:
This is text.
<div class="text">This is text in 'div' tag</div>
<p>This is text in 'p' tag</p>
Expected result:
This is text in 'p' tag
Greetings.
Basically, you'll have to parse the markup. PHP comes with a good parser in the form of the DOMDocument class, so that's really quite easy:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
Next, get all p tags:
$paragraphs = $dom->getElementsByTagName('p');
This method returns a DOMNodeList object, which implements the Traversable interface, so you can use it as an array of DOMNode instances (DOMElement in this case):
$first = $paragraphs->item(0);//or $paragraphs[0] even
foreach ($paragraphs as $p) {
echo $p->textContent;//echo the inner text
}
If you only want the paragraph elements that do not contain child elements, then you can easily check that:
foreach ($paragraphs as $p) {
if (!$p->hasChildNodes()) {
echo $p->textContent; // or $p->nodeValue
}
}
A closely related answer with some more links/info: How to split an HTML string into chunks in PHP?
You can easily do this with the native php strip_tags function like so:
strip_tags("<p>This is text in 'p' tag</p>");
Which will return as you expected, "This is text in 'p' tag". NOTE: this is only useful when you have an outer-container div, and you use a little bit of dirty RegExp in order to strip not only the P, but the whole tags the user expected (ex. the div tag). This function has one argument, and a second optional argument. The first one is the string that you are stripping the tags from, and the second one specifies allowable tags that won't be stripped as a string. These tags will not be removed in the process. For more information on the strip_tags function click here. I hope you got the idea :)
I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}
How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))
I have an XML document from which I want to extract some data:
<tnt:results>
<tnt:result>
<Document id="id1">
<impact _blabla_ for="tree.def" name="Something has changed"
select="moreblabla">true</impact>
<impact _blabla_ for="plant.def" name="Something else has changed"
select="moreblabla">true</impact>
</Document>
</tnt:result>
</tnt:results>
in reality there is no new line -- it's one continuous string and and there can be multiple < Document > elements. I want to have a regular expression that extracts:
id1
tree.def / plant.def
Something has changed / Something else has changed
I was able to come up with this code so far, but it only matches the first impact, rather than both of them:
preg_match_all('/<Document id="(.*)">(<impact.*for="(.*)".*name="(.*)".*<\/impact>)*<\/Document>/U', $response, $matches);
The other way to do it would be to match everything inside the Document element and pass it through a RegEx once more, but I thought I can do this with only one RegEx.
Thanks a lot in advance!
Just use DOM, it's easy enough:
$dom = new DOMDocument;
$dom->loadXML($xml_string);
$documents = $dom->getElementsByTagName('Document');
foreach ($documents as $document) {
echo $document->getAttribute('id'); // id1
$impacts = $document->getElementsByTagName('impact');
foreach ($impacts as $impact) {
echo $impact->getAttribute('for'); // tree.def
echo $impact->getAttribute('name'); // Something has changed
}
}
Don't use RegEx. Use an XML parser.
Really, if you have to worry about multiple Document elements and extracting all sorts of attributes, you're much better off using an XML parser or a query language like XPath.
I have the following string:
<?foo?> <?bar?> <?baz?> hello world <?/?> <?/?> <?/?>
I need a regular expression to convert it into
<?foo?> <?bar?> <?baz?> hello world <?/baz?> <?/bar?> <?/foo?>
The following code works for non-recursive tags:
$x=preg_replace_callback('/.*?<\?\/\?>/',function($x){
return preg_replace('/(.*<\?([^\/][\w]+)\?>)(.*?)(<\?\/?\?>)/s',
'\1\3<?/\2?>',$x[0]);
},$str);
You can't do this with regular expressions. You need to write a parser!
So create a stack (an array where you add and remove items from the end. use array_push() array_pop() ).
Iterate through the tags, pushing known opening tags on the stack.
When you come to a closing tag, pop the stack and that will tell you the tag you need to close.
For a recursive structure, make a recursive function. In some form of pseudo-code:
tags = ['<?foo?>', '<?bar?>', '<?baz?>']
// output consumed stream to 'output' and return the rest
function close_matching(line, output) {
for (tag in tags) {
if line.startswith(tag) {
output.append(tag)
line = close_matching(line.substring(tag.length()), output)
i = line.indexof('<')
... // check i for not found
output.append(line.substring(0, i))
j = line.indexof('>')
... // check j for error, and check what's between i,j is valid for close tag
output.append(closetag_for_tag(tag))
line = line.substring(j + 1)
}
}
return line;
}
This should give you a basic structure that works.