SIMPLE HTML DOM - how to ignore nested elements?

SIMPLE HTML DOM - how to ignore nested elements? - php

My html code is as follows
<span class="phone">
i want this text
<span class="ignore-this-one">01234567890</span>
<span class="ignore-this-two" >01234567890</span>
<a class="also-ignore-me">some text</a>
</span>
What I want to do is extract the 'i want this text' leaving all of the other elements behind. I've tried several iterations of the following, but none return the text I need:
$name = trim($page->find('span[class!=ignore^] a[class!=also^] span[class=phone]',0)->innertext);
Some guidance would be appreciated as the simple_html_dom section on filters is quite bare.

what about using php preg_match (http://php.net/manual/en/function.preg-match.php)
try the below:
<?php
$html = <<<EOF
<span class="phone">
i want this text
<span class="ignore-this-one">01234567890</span>
<span class="ignore-this-two" >01234567890</span>
<a class="also-ignore-me">some text</a>
</span>;
EOF;
$result = preg_match('#class="phone".*\n(.*)#', $html, $matches);
echo $matches[1];
?>
regex explained:
find text class="phone" then proceed until the end of the line, matching any character using *.. Then switch to a new line with \n and grab everything on that line by enclosing *. into brackets.
The returned result is stored in the array $matches. $matches[0] holds the value that is returned from the whole regex, while $matches[1] holds the value that is return by the closing brackets.

Related

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

[PHP]I have a variable for storing strings (a BIIGGG page source code as string), I want to echo only interesting strings (that I need to extract to use in a project, dozens of them), and they are inside the quotation marks of the tag
but I just want to capture the values that start with the letter: N (news)
[<a href="/news7044449/exclusive_news_sunday_"]
<a href="/n[ews7044449/exclusive_news_sunday_]"
that is, I think you will have to work with match using: [a href="/n]
how to do that to define that the echo will delete all the texts of the variable, showing only:
note that there are other hrefs tags with values that start with other letters, such as the letter 'P' : href="/profiles... (This does not interest me.)
$string = '</div><span class="news-hd-mark">HD</span></div><p>exclusive_news_sunday_</p><p class="metadata"><span class="bg">Czech AV<span class="mobile-hide"> - 5.4M Views</span>
- <span class="duration">7 min</span></span></p></div><script>xv.thumbs.preparenews(7044449);</script>
<div id="news_31720715" class="thumb-block "><div class="thumb-inside"><div class="thumb"><a href="/news31720715/my_sister_running_every_single_morning"><img src="https://static-hw.xnewss.com/img/lightbox/lightbox-blank.gif"';
I imagine something like this:
$removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n = ('/something regex expresion I think /' or preg_match, substring?);
echo $string = str_replace($removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n,'',$string);
expected output: /news7044449/exclusive_news_sunday_
NOTE: it is not essential to be through a variable, it can be from a .txt file the place where the extracts will be extracted, and not necessarily a variable.
thanks.

I believe this will help her.
<?php
$source = file_get_contents("code.html");
preg_match_all("/<a href=\"(\/n(?:.+?))\"[^>]*>/", $source, $results);
var_export( end($results) );
Step by Step Regex:
Regex Demo
Regex Debugger

To get just the links out of the $results array from Valdeir's answer:
foreach ($results as $r) {
echo $r;
// alt: to display them with an HTML break tag after each one
echo $r."<br>\n";
}

How to search in string for <span> inside of <a> and unwrap it with PHP

What is the best way to reformat the HTML with PHP. Put span tag outside of a tag?
This:
<p style="text-align:center">
<span style="color:green">Text Link</span>
</p>
To:
<p style="text-align:center">
<span style="color:green">Text Link</span>
</p>

You can make a good use of preg_match function:
$str = '<p style="text-align:center">
<span style="color:green">Text Link</span>
</p>';
if(preg_match('/^(\<p.*?\>).*(\<a.*?\>).*(\<span.*?\>)([0-9a-zA-Z ]*).*$/is', $str, $regs))
{
// $regs = [
// 0 => ... (original string)
// 1 => '<p style="text-align:center">',
// 2 => '<a href="#" target="_blank">',
// 3 => '<span style="color:green">',
// 4 => 'Text Link']
$newStr = $regs[1].$regs[3].$regs[2].$regs[4].'</a></span></p>';
}
You may have to change the [0-9a-zA-Z ]* in regular expression to match your links format.
If input HTML is multiple line text then you have to use s modifier after the regular exception, I also used i as case insensitive just to be sure noone mixed <P> and <p> tag together and so on...
Note that this if fairly specific solution and for more general solution you should use something like loading HTML into DOM and working with nodes, but this is pretty simple and quick solution for this particular case.

Clear in regular expression php

How I can replace string inside some text to getting this string without this "pattern"?
For example I trying replace %%some text%% to
<span class="spoiler">some text</span>
preg_replace("'%%[\w\s]+%%'siu",'<span class="spoiler">$0</span>',$description);

This will do what you are looking for:
$description = '%%some text%%';
$fixed_description = preg_replace("~%%([\w\s]+?)%%~siu",'<span class="spoiler">$1</span>',$description);
echo $fixed_description;
Output:
<span class="spoiler">some text</span>

Preg_replace only replaces first match

I'm relatively new to regex expressions and I'm having a problem with this one. I've searched this site and found nothing that works.
I want it to remove all <br /> between <div class='quote'> and </div>. The reason for this is that the whitespace is preserved anyway by the CSS and I want to remove any extra linebreaks the user puts into it.
For example, say I have this:
<div class='quote'>First line of text<br />
Second line of text<br />
Third line of text</div>
I've been trying to use this remove both the <br /> tags.
$TEXT = preg_replace("/(<div class='quote'>(.*?))<br \/>((.*?)<\/div>)/is","$1$3",$TEXT);
This works to an extent because the result is:
<div class='quote'>First line of text
Second line of text<br />
Third line of text</div>
However it won't remove the second <br />. Can someone help please? I figure it's probably something small I'm missing :)
Thanks!

If you want to clear all br-s inside only one div-block you need to first catch the content inside your div-block and then clear all your br-s.
Your regexp has the only one <br /> in it and so it replaces only one <br />.
You need something like that:
function clear_br($a)
{
return str_replace("<br />", "", $a[0]);
}
$TEXT = preg_replace_callback("/<div class='quote'>.*?<br \/>.*?<\/div>/is", "clear_br", $TEXT);

It does replace more than once, because you didn't use a 4th argument in preg_replace, so it is "without limit" and will replace more than once. It only replaced once because you specified the wrapping <div> in your regex and so it only matched your string once, because your string only has such a wrapping <div> once.
Assuming we already have:
<div class='quote'>First line of text<br />
Second line of text<br />
Third line of text</div>
we can simply do something like:
$s = "<div class='quote'>First line of text<br />\nSecond line of text<br>\nThird line of text</div>";
echo preg_replace("{<br\s*/?>}", " ", $s);
the \s* is for optional whitespaces, because what if it is <br/>? The /? is for optional / because it might be <br>. If the system entered those <br /> for you and you are sure they will be in this form, then you can use the simpler regex instead.
One word of caution is that I actually would replace it with a space, because for hello<br>world, if no space is used as the replacement text, then it would become helloworld and it merged two words into one.
(If you do not have this <div ... > ... </div> extracted already, then you probably would need to first do that using an HTML parser, say if the original content is a whole webpage (we use a parser because what if the content inside this outer <div>...</div> has <div> and </div> or even nested as well? If there isn't any <div> inside, then it is easier to extract it just using regex))

I don't get your [.*?] : You said here that you want "any charactere any number of times zero or one time". So you can simply say "any charactere any number of times" : .*
function clear_br($a){ return str_replace("<br />","",$a); }
$TEXT = preg_replace("/(<div class='quote'>.*<br \/>.*<\/div>)/",clear_br($1), $TEXT);
Otherwise that should works

You have to be careful about how you capture the div that contains the br elements. Mr. 動靜能量 pointed out that you need to watch out for nested divs. My solution does not.
<?php
$subject ="
<div>yomama</div>
<div class='quote'>First line of text<br />
Second line of text<br />
Third line of text</div>
<div>hasamustache</div>
";
$result = preg_replace_callback( '#<div[^>]+class.*quote.*?</div>#s',
function ($matches) {
print_r($matches);
return preg_replace('#<br ?/?>#', '', $matches[0]);
}
, $subject);
echo "$result\n";
?>
# is used as a regex delimiter instead of the conventional /
<div[^>]+ prevents the yomama div from being matched because it would have been with <div.*class.*quote since we have the s modifier (multiline-match).
quote.*? means a non-greedy match to prevent hasamustache</div> from being caught.
So the strategy is to match only the quote div in a string with newlines, and run a function on it that will kill all br tags.
output:

How to grab the contents of HTML tags?

Hey so what I want to do is snag the content for the first paragraph. The string $blog_post contains a lot of paragraphs in the following format:
<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>
The problem I'm running into is that I am writing a regex to grab everything between the first <p> tag and the first closing </p> tag. However, it is grabbing the first <p> tag and the last closing </p> tag which results in me grabbing everything.
Here is my current code:
if (preg_match("/[\\s]*<p>[\\s]*(?<firstparagraph>[\\s\\S]+)[\\s]*<\\/p>[\\s\\S]*/",$blog_post,$blog_paragraph))
echo "<p>" . $blog_paragraph["firstparagraph"] . "</p>";
else
echo $blog_post;

Well, sysrqb will let you match anything in the first paragraph assuming there's no other html in the paragraph. You might want something more like this
<p>.*?</p>
Placing the ? after your * makes it non-greedy, meaning it will only match as little text as necessary before matching the </p>.

If you use preg_match, use the "U" flag to make it un-greedy.
preg_match("/<p>(.*)<\/p>/U", $blog_post, &$matches);
$matches[1] will then contain the first paragraph.

It would probably be easier and faster to use strpos() to find the position of the first
<p>
and first
</p>
then use substr() to extract the paragraph.
$paragraph_start = strpos($blog_post, '<p>');
$paragraph_end = strpos($blog_post, '</p>', $paragraph_start);
$paragraph = substr($blog_post, $paragraph_start + strlen('<p>'), $paragraph_end - $paragraph_start - strlen('<p>'));
Edit: Actually the regex in others' answers will be easier and faster... your big complex regex in the question confused me...

Using Regular Expressions for html parsing is never the right solution. You should be using XPATH for this particular case:
$string = <<<XML
<a>
<b>
<c>texto</c>
<c>cosas</c>
</b>
<d>
<c>código</c>
</d>
</a>
XML;
$xml = new SimpleXMLElement($string);
/* Busca <a><b><c> */
$resultado = $xml->xpath('//p[1]');

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

SIMPLE HTML DOM - how to ignore nested elements? - php

Related

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

How to search in string for <span> inside of <a> and unwrap it with PHP

Clear in regular expression php

Preg_replace only replaces first match

How to grab the contents of HTML tags?

Categories

Resources