Take a part of a big text - php

Let's say we have a string ($text)
I will help you out, if <b>you see this message and never forget</b> blah blah blah
I want to take text from "<b>" to "</b>" into a new string($text2)
How can this be done?
I appreciate any help I can get. Thanks!
Edit:
I want to take a code like this.
<embed type="application/x-shockwave-flash"></embed>

If you only wish the first match and do not want to match something like <b class=">, the following will work:
UPDATED for comment:
$text = "I will help you out, if <b>you see this message and never forget</b> blah blah blah";
$matches = array();
preg_match('#<b>.*?</b>#s', $text, $matches);
if ($matches) {
$text2 = $matches[0];
// Do something with $text2
}
else {
// The string wasn't found, so do something else.
}
But for something more complex, you really should parse it as DOM per Marc B.'s comment.

Use this bad mofo: http://fr2.php.net/domdocument
$dom = new DOMDocument();
$dom->loadHTML($text);
$xpath = new DOMXpath($dom);
$nodes = $xpath->query('//b');
Here you can either loop through each one, or if you know there is only one, just grab the value.
$text1 = $nodes->item(0)->nodeValue;

strip_tags($text, '<b>');
will extract only the parts of the string between <b> </b>
If it is the behavior you look for.

Related

php preg_match_all() how to get correct values in match-array

The following situation:
$text = "This is some <span class='classname'>example</span> text i'm writing to
demonstrate the <span class='classname otherclass'>problem</span> of this.<br />";
preg_match_all("|<[^>/]*(classname)(.+)>(.*)</[^>]+>|U", $text, $matches, PREG_PATTERN_ORDER);
I need an array ($matches) where in one field is "<span class='classname'>example</span>" and in another "example".
But what i get here is one field with "<span class='classname'>example</span>" and one with "classname".
It also should contain the values for the other matches, of course.
how can i get the right values?
You would be better off with a DOM parser, however this question is more to do with how capturing works in Regexes in general.
The reason you are getting classname as a match is because you are capturing it by putting () around it. They are completely unnecessary so you can just remove them. Similarly, you don't need them around .+ since you don't want to capture that.
If you had some group that you had to enclose in () as grouping rather than capturing, start the group with ?: and it won't be captured.
The safe/easy way:
$text = 'blah blah blah';
$dom = new DOM();
$dom->loadHTML($text);
$xp = new DOMXPath($dom);
$nodes = $xp->query("//span[#class='classname']");
foreach($nodes as $node) {
$innertext = $node->nodeValue;
$html = // see http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument
}

Get the integer following part of a string?

I have a bunch of strings that may or may not have a substring similar to the following:
<a class="tag" href="http://www.yahoo.com/5"> blah blah ...</a>
Im trying to retrieve the '5' at the end of the link (that isnt necessarily a one digit number, it can be huge). But, this string will vary. The text before the link, and after, will always be different. The only thing that will be the same is the <a class="tag" href="http://www.yahoo.com/ and the closing </a>.
You can do it using preg_match_all and <a class="tag" href="http:\/\/(.*)\/(\d+)"> regular expression.
Give parse_url() a try. Should be easy from there.
As you only need to retrieve the 5, it's pretty straight forward:
$r = pret_match_all('~\/(\d+)"~', $subject, $matches);
It's then in the first matching group.
If you need more information like the link text, I would suggest you to use a HTML Parser for that:
require('Net/URL2.php');
$doc = new DOMDocument();
$doc->loadHTML('<a class="tag" href="http://www.yahoo.com/5"> blah blah ...</a>');
foreach ($doc->getElementsByTagName('a') as $link)
{
$url = new Net_URL2($link->getAttribute('href'));
if ($url->getHost() === 'www.yahoo.com') {
$path = $url->getPath();
printf("%s (from %s)\n", basename($path), $url);
}
}
Example Output:
5 (from http://www.yahoo.com/5)
I would got with "basename":
// prints passwd
print basename("/etc/passwd")
And to get the link you could use:
$xml = simplexml_load_string( '<a class="tag" href="http://www.yahoo.com/5"> blah blah ...</a>' );
$attr = $xml->attributes();
print $attr['href'];
And finally: If you don't know the whole structure of the string, use this:
$dom = new DOMDocument;
$dom->loadHTML( '<a class="tag" href="http://www.yahoo.com/5"> blah blah ...</a>asasasa<a class="tag" href="http://www.yahoo.com/6"> blah blah ...</a>' );
$nodes = $dom->getElementsByTagName('a');
foreach ($nodes as $node) {
print $node->getAttribute('href');
print basename( $node->getAttribute('href') );
}
As this will also fix invalid HTML code.

How to remove link with preg_replace();?

I'm not sure how to explain this, so I'll show it on my code.
First and
Second and
Third
how can I delete opening and closing but not the rest?
I'm asking for preg_replace(); and I'm not looking for DomDocument or others methods to do it. I just want to see example on preg_replace();
how is it achievable?
Only pick the groups you want to preserve:
$pattern = '~()([^<]*)()~';
// 1 2 3
$result = preg_replace($pattern, '$2', $subject);
You find more examples on the preg_replace manual page.
Since you asked me in the comments to show any method of doing this, here it is.
$html =<<<HTML
First and
Second and
Third
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$elems = $xpath->query("//a[#class='delete']");
foreach ($elems as $elem) {
$elem->parentNode->removeChild($elem);
}
echo $dom->saveHTML();
Note that saveHTML() saves a complete document even if you only parsed a fragment.
As of PHP 5.3.6 you can add a $node parameter to specify the fragment it should return - something like $xpath->query("/*/body")[0] would work.
$pattern = '/<a (.*?)href=[\"\'](.*?)\/\/(.*?)[\"\'](.*?)>(.*?)<\/a>/i';
$new_content = preg_replace($pattern, '$5', $content);
$pattern = '/<a[^<>]*?class="delete"[^<>]*?>(.*?)<\/a>/';
$test = 'First and Second and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <b class="delete">seriously</b> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <b class="delete">seriously</b> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
$test = 'First and <a class="delete" href="url2.html">Second</a> and Third';
echo preg_replace($pattern, '$1', $test)."\n";
preg_replace('#(.+?)#', '$1', $html_string);
It is important to understand this is not an ideal solution. First, it requires markup in this exact format. Second, if there were, say, a nested anchor tag (albeit unlikely) this would fail. These are some of the many reasons why Regular Expressions should not be used for parsing/manipulating HTML.

Remove all text within specific tags

I am interesting in removing all the text within the following tags:
<p class="wp-caption-text">Remove this text</p>
Can anybody give me an idea of how this can be done in php?
Thank you very much
Get rid of the tag and content inside of it:
$content = preg_replace('/<p\sclass=\"wp\-caption\-text\">[^<]+<\/p>/i', '', $content);
or if you want to preserve the tags:
$content = preg_replace('/(<p\sclass=\"wp\-caption\-text\">)[^<]+(<\/p>)/i', '$1$2', $content);
As bit higher-level alternative to regular expressions.
You can process with DOM. You can match all nodes you're looking for with XPath //p[#class="wp-caption-text"].
For example:
$doc = new DOMDocument();
$doc->loadHTML($yourHTMLasString);
$xpath = new DOMXPath($doc);
$query = '//p[#class="wp-caption-text"]';
$entries = $xpath->query($query);
foreach ($entries as $entry) {
$entry->textContent = '';
}
echo $doc->saveHTML();
Try this:
$string = '<p class="wp-caption-text">Remove this text</p>';
$pattern = '/(.*<p .*>).*(<\/p>.*)/';
$replacement = '$1$2';
echo preg_replace($pattern, $replacement, $string);
if its always the same tag you could simply do search for the string. use the position resulting to substring from it to the closing tag.
Or you could use a regular expression, there are good ones posted here that can help you.

Using regex to remove HTML tags

I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.

Categories