I'm using PHP to get content from an external website.
I want to know if it's possible to find and replace strings from the output so I can make all links absolute.
I need to convert "/ and '/ to "$url/
If it's possible to do that, I can figure out how to do the rest. I don't know if it's possible though.
Thanks
For simple string replacement, use str_replace(), eg
$html = str_replace(array("'/", '"/'), array("'$url/", '"' . $url . '/'), $html);
If you're after a more robust solution, I'd suggest loading the HTML string into a DOMDocument, loop over all the tags with href starting with / and change the attribute of each before writing out the HTML.
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$anchors = $xpath->query('//*[starts-with(#href, "/")]');
foreach ($anchors as $anchor) {
$href = $anchor->getAttribute('href');
$anchor->setAttribute('href', $url . $href);
}
$html = $doc->saveHTML();
You'll probably want to do the same for tags with src attributes.
You could also use preg_replace(), though the DOMDocument parsing is the most robust.
Related
Goal: Modifying an HTML string that contains apostrophs for wrapping code inline (like Stackoverflow is doing it). But the same time having <code> blocks that can also contain apostrophs which should stay unchanged.
Example:
<p>This is my `inline code`, it can be replaced and tag-wrapped.</p>
<p><code>This text contains `apostrophs`, but should `not` be changed.</code></p>
This regex I am using for converting all wrapping apostrophs to <code> elements:
// replace apostroph with incorporating <code> tag
$content = preg_replace('/(.+?)\`(.+?)\`/', '$1<code class="inlinecode">$2</code>', $content);
Required:
Change the regex, so that it does not convert the apostroph if it is withing a <code> block.
Disclaimer: I tried for several hours to read the HTML string, use PHP's DOM parser, extract all nodes of type code, change their content, write them back, then found out that nodeValue is removing all HTML tags (especially the line breaks). Then tried several solutions found online, still not working... Now I am falling back to regex, even against the odds.
FYI, how I tried it the DOM way:
$code_blocks = $dom->getElementsByTagName('code');
foreach($code_blocks as $codenode) {
// nodeValue strips HTML tags, we need to hack
$nodevalue_html = $codenode->ownerDocument->saveXML($codenode);
// replace, i.e. custom-store each apostroph with '~~~APO~~~' so that they survive
$nodevalue_html = preg_replace('/`/', '~~~APO~~~', $nodevalue_html);
// $codenode->textValue = $nodevalue_html; // fail
// $codenode->nodeValue = $nodevalue_html; // fail
// ...
}
// html to string
$html_new = $dom->saveHTML();
$html_new = preg_replace('/~~~APO~~~/', '`', $html_new);
I wished I could use Markdown like Stackoverflow, but I still need to deal with HTML.
Using an XPath query to avoid text nodes that have a code element as ancestor:
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::code)][contains(.,"`")]');
foreach ($textNodes as $textNode) {
$parts = (function($text) { yield from explode('`', $text); })($textNode->nodeValue);
$frag = $dom->createDocumentFragment();
do {
$frag->appendChild($dom->createTextNode($parts->current()));
$parts->next();
if ( $parts->valid() ) {
$codeElt = $dom->createElement('code');
$codeElt->appendChild($dom->createTextNode($parts->current()));
$frag->appendChild($codeElt);
$parts->next();
}
} while ($parts->valid());
$textNode->parentNode->replaceChild($frag, $textNode);
}
echo $dom->saveHTML();
demo
demo for php < 7.0
I believe the only way is to explode and reassemble the string:
$html_string = '....................'; // contains apostrophes and <code>...</code> blocks
$delim = "<code>";
$closing_tag = "</code>";
$explode = explode($delim, $html_string);
foreach($explode as &$ex) {
$closing_tag_pos = strpos($ex, $closing_tag);
if ($closing_tag_pos !== false) {
$pre_closing_tag = substr($ex, 0, $closing_tag_pos);
$post_closing_tag = substr($ex, $closing_tag_pos);
$ex = $pre_closing_tag . preg_replace('/`/', '~~~APO~~~', $post_closing_tag);
}
}
$mapped_html_string = implode($delim, $explode);
I would like to get back the number which is between span HTML tags. The number may change!
<span class="topic-count">
::before
"
24
"
::after
</span>
I've tried the following code:
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
But it doesn't work.
Entire code:
$result=array();
$page = 201;
while ($page>=1) {
$source = file_get_contents ("http://www.jeuxvideo.com/forums/0-27047-0-1-0-".$page."-0-counter-strike-global-offensive.htm");
preg_match_all("#<span class=\"topic-count\">(.*?)</span>#", $source, $nombre[$i]);
$result = array_merge($result, $nombre[$i][1]);
print("Page : ".$page ."\n");
$page-=25;
}
print_r ($nombre);
Can do with
preg_match_all(
'#<span class="topic-count">[^\d]*(\d+)[^\d]*?</span>#s',
$html,
$matches
);
which would capture any digits before the end of the span.
However, note that this regex will only work for exactly this piece of html. If there is a slight variation in the markup, for instance, another class or another attribute, the pattern will not work anymore. Writing reliable regexes for HTML is hard.
Hence the recommendation to use a DOM parser instead, e.g.
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.jeuxvideo.com/forums/0-27047-0-1-0-1-0-counter-strike-global-offensive.htm');
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
foreach ($xpath->evaluate('//span[contains(#class, "topic-count")]') as $node) {
if (preg_match_all('#\d+#s', $node->nodeValue, $topics)) {
echo $topics[0][0], PHP_EOL;
}
}
DOM will parse the entire page into a tree of nodes, which you can then query conveniently via XPath. Note the expression
//span[contains(#class, "topic-count")]
which will give you all the span elements with a class attribute containing the string topic-count. Then if any of these nodes contain a digit, echo it.
I have been trying to get the innertext of html tag from a url (defimedia.info) but i get only 1 output. The code i tried is:
$html = file_get_contents("http://www.defimedia.info");
preg_match("'<h3>(.*?)<h3>'si", $html, $match);
echo($match[1]);
even when i try to use foreach or i try to use $match[2], it does not work. Any help would certainly be appreciated.regardsbhaamb
you need preg_match_all function. Documented here http://php.net/manual/en/function.preg-match-all.php
try like this.
<?php
$html = file_get_contents("http://www.defimedia.info");
preg_match_all('/<h3>(.*?)<h3>/si', $html, $match);
print_r($match);
?>
Regex is not the correct tool for parsing HTML/XML instead you can use DOMDocument
You can use DOMDocument like as
$html = file_get_contents("http://www.defimedia.info");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
libxml_use_internal_errors(false);
$h3s = $dom->getElementsByTagName('h3');
foreach ($h3s as $h3) {
echo $h3->nodeValue."<br>";
}
Why did I used libxml_use_internal_errors(true); ?
I need to find and replace all text matches in a case insensitive way, unless the text is within an anchor tag - for example:
<p>Match this text and replace it</p>
<p>Don't match this text</p>
<p>We still need to match this text and replace it</p>
Searching for 'match this text' would only replace the first instance and last instance.
[Edit] As per Gordon's comment, it may be preferred to use DOMDocument in this instance. I'm not at all familiar with the DOMDocument extension, and would really appreciate some basic examples for this functionality.
Here is an UTF-8 safe solution, which not only works with properly formatted documents, but also with document fragments.
The mb_convert_encoding is needed, because loadHtml() seems to has a bug with UTF-8 encoding (see here and here).
The mb_substr is trimming the body tag from the output, this way you get back your original content without any additional markup.
<?php
$html = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace itŐŰ</p>
<p>This is a link <span>with <strong>don\'t match this text</strong> content</span></p>';
$dom = new DOMDocument();
// loadXml needs properly formatted documents, so it's better to use loadHtml, but it needs a hack to properly handle UTF-8 encoding
$dom->loadHtml(mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8"));
$xpath = new DOMXPath($dom);
foreach($xpath->query('//text()[not(ancestor::a)]') as $node)
{
$replaced = str_ireplace('match this text', 'MATCH', $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}
// get only the body tag with its contents, then trim the body tag itself to get only the original content
echo mb_substr($dom->saveXML($xpath->query('//body')->item(0)), 6, -7, "UTF-8");
References:
1. find and replace keywords by hyperlinks in an html fragment, via php dom
2. Regex / DOMDocument - match and replace text not in a link
3. php problem with russian language
4. Why Does DOM Change Encoding?
I read dozens of answers in the subject, so I am sorry if I forgot somebody (please comment it and I will add yours as well in this case).
Thanks for Gordon and stillstanding for commenting on my other answer.
Try this one:
$dom = new DOMDocument;
$dom->loadHTML($html_content);
function preg_replace_dom($regex, $replacement, DOMNode $dom, array $excludeParents = array()) {
if (!empty($dom->childNodes)) {
foreach ($dom->childNodes as $node) {
if ($node instanceof DOMText &&
!in_array($node->parentNode->nodeName, $excludeParents))
{
$node->nodeValue = preg_replace($regex, $replacement, $node->nodeValue);
}
else
{
preg_replace_dom($regex, $replacement, $node, $excludeParents);
}
}
}
}
preg_replace_dom('/match this text/i', 'IT WORKS', $dom->documentElement, array('a'));
This is the stackless non-recursive approach using pre-order traversal of the DOM tree.
libxml_use_internal_errors(TRUE);
$dom=new DOMDocument('1.0','UTF-8');
$dom->substituteEntities=FALSE;
$dom->recover=TRUE;
$dom->strictErrorChecking=FALSE;
$dom->loadHTMLFile($file);
$root=$dom->documentElement;
$node=$root;
$flag=FALSE;
for (;;) {
if (!$flag) {
if ($node->nodeType==XML_TEXT_NODE &&
$node->parentNode->tagName!='a') {
$node->nodeValue=preg_replace(
'/match this text/is',
$replacement, $node->nodeValue
);
}
if ($node->firstChild) {
$node=$node->firstChild;
continue;
}
}
if ($node->isSameNode($root)) break;
if ($flag=$node->nextSibling)
$node=$node->nextSibling;
else
$node=$node->parentNode;
}
echo $dom->saveHTML();
libxml_use_internal_errors(TRUE); and the 3 lines of code after $dom=new DOMDocument; should be able to handle any malformed HTML.
$a='<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>';
echo preg_replace('~match this text(?![^<]*</a>)~i','replacement',$a);
The negative lookahead ensures the replacement happens only if the next tag is not a closing link . It works fine with your example, though it won't work if you happen to use other tags inside your links.
You can use PHP Simple HTML DOM Parser. It is similar to DOMDocument, but in my opinion it's simpler to use.
Here is the alternative in parallel with Netcoder's DomDocument solution:
function replaceWithSimpleHtmlDom($html_content, $search, $replace, $excludedParents = array()) {
require_once('simple_html_dom.php');
$html = str_get_html($html_content);
foreach ($html->find('text') as $element) {
if (!in_array($element->parent()->tag, $excludedParents))
$element->innertext = str_ireplace($search, $replace, $element->innertext);
}
return (string)$html;
}
I have just profiled this code against my DomDocument solution (witch prints the exact same output), and the DomDocument is (not surprisingly) way faster (~4ms against ~77ms).
<?php
$a = '<p>Match this text and replace it</p>
<p>Don\'t match this text</p>
<p>We still need to match this text and replace it</p>
';
$res = preg_replace("#[^<a.*>]match this text#",'replacement',$a);
echo $res;
?>
This way works. Hope you want realy case sensitive, so match with small letter.
HTML parsing with regexs is a huge challenge, and they can very easily end up getting too complex and taking up loads of memory. I would say the best way is to do this:
preg_replace('/match this text/i','replacement text');
preg_replace('/(<a[^>]*>[^(<\/a)]*)replacement text(.*?<\/a)/is',"$1match this text$3");
If your replacement text is something which might occur otherwise, you might want to add an intermediate step with some unique identifier.
I need to convert
$text = 'We had <i>fun</i>. Look at this photo of Joe';
[Edit] There could be multiple links in the text.
to
$text = 'We had fun. Look at this photo (http://example.com) of Joe';
All HTML tags are to be removed and the href value from <a> tags needs to be added like above.
What would be an efficient way to solve this with regex? Any code snippet would be great.
First do a preg_replace to keep the link. You could use:
preg_replace('(.*?)', '$\2 ($\1)', $str);
Then use strip_tags which will finish off the rest of the tags.
try an xml parser to replace any tag with it's inner html and the a tags with its href attribute.
http://www.php.net/manual/en/book.domxml.php
The DOM solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query('//a[#href]') as $node) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
echo strip_tags($dom->saveHTML());
and the same without XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $node) {
if($node->hasAttribute('href')) {
$textNode = new DOMText(sprintf('%s (%s)',
$node->nodeValue, $node->getAttribute('href')));
$node->parentNode->replaceChild($textNode, $node);
}
}
echo strip_tags($dom->saveHTML());
All it does is load any HTML into a DomDocument instance. In the first case it uses an XPath expression, which is kinda like SQL for XML, and gets all links with an href attribute. It then creates a text node element from the innerHTML and the href attribute and replaces the link. The second version just uses the DOM API and no Xpath.
Yes, it's a few lines more than Regex but this is clean and easy to understand and it won't give you any headaches when you need to add additional logic.
I've done things like this using variations of substring and replace. I'd probably use regex today but you wanted an alternative so:
For the <i> tags, I'd do something like:
$text = replace($text, "<i>", "");
$text = replace($text, "</i>", "");
(My php is really rusty, so replace may not be the right function name -- but the idea is what I'm sharing.)
The <a> tag is a bit more tricky. But, it can be done. You need to find the point that <a starts and that the > ends with. Then you extract the entire length and replace the closing </a>
That might go something like:
$start = strrpos( $text, "<a" );
$end = strrpos( $text, "</a>", $start );
$text = substr( $text, $start, $end );
$text = replace($text, "</a>", "");
(I don't know if this will work, again the idea is what I want to communicate. I hope the code fragments help but they probably don't work "out of the box". There are also a lot of possible bugs in the code snippets depending on your exact implementation and environment)
Reference:
strrpos - http://www.php.net/manual/en/function.strrpos.php
replace - http://www.php.net/manual/en/function.str-replace.php
substr - http://php.net/manual/en/function.substr.php
It's also very easy to do with a parser:
# available from http://simplehtmldom.sourceforge.net
include('simple_html_dom.php');
# parse and echo
$html = str_get_html('We had <i>fun</i>. Look at this photo of Joe');
$a = $html->find('a');
$a[0]->outertext = "{$a[0]->innertext} ( {$a[0]->href} )";
echo strip_tags($html);
And that produces the code you want in your test case.