Parsing HTML and replacing strings [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I have a large quantity of partial HTML stored in a CMS database.
I'm looking for a way to go through the HTML and find any <a></a> tags that don't have a title and add a title to them based on the contents of the tags.
So if I had some text I'd like to modify the tag to look like:
<a title="some text" href="somepage"></a>
Some tags already have a title and some anchor tags have nothing between them.
So far I've managed to make some progress with php and regex.
But I can't seem to be able to get the contents of the anchors, it just displays either a 1 or a 0.
<?php
$file = "test.txt";
$handle = fopen("$file", "r");
$theData = fread($handle, filesize($file));
$line = explode("\r\n", $theData);
$regex = '/^.*<a ((?!title).)*$/'; //finds all lines that don't contain an anchor with a title
$regex2 = '/<a .*><\/a>/'; //finds all lines that have nothing between the anchors
$regex3 = '/<a.*?>(.+?)<\/a>/'; //finds the contents of the anchors
foreach ($line as $lines)
{
if (!preg_match($regex2, $lines) && preg_match($regex, $lines)){
$tags = $lines;
$contents = preg_match($regex3, $tags);
$replaced = str_replace("<a ", "<a title=\"$contents\" ", $lines);
echo $replaced ."\r\n";
}
else {
echo $lines. "\r\n";
}
}
?>
I understand regex is probably not the best way to parse HTML so any help or alternate suggestions would be greatly appreciated.

Use PHP's built-in DOM parsing. Much more reliable than regex. Be aware that loading HTML into the PHP DOM will normalize it.
$doc = new DOMDocument();
#$doc->loadHTML($html); //supress parsing errors with #
$links = $doc->getElementsByTagName('a');
foreach ($links as $link) {
if ($link->getAttribute('title') == '') {
$link->setAttribute('title', $link->nodeValue);
}
}
$html = $doc->saveHTML();

If it was coherent, you could use a simplistic regex. But it'll fail if your anchors have classes or anything. Also it doesn't corrently encode the title= attribute:
preg_replace('#<(a\s+href="[^"]+")>([^<>]+)</a>#ims', '<$1 title="$2">$2</a>',);
Therefore phpQuery/querypath is likely the robuster approach:
$html = phpQuery::newDocument($html);
foreach ($html->find("a") as $a) {
if (empty($a->attr("title")) {
$a->attr("title", $a->text());
}
}
print $html->getDocument();

Never use regex on parsing HTML. In php, use DOM.
Here's a more simple one: http://simplehtmldom.sourceforge.net/

Related

Transforming <span style="font-weight:bold">some text<span> into <b>some text</b> in PHP [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
As title, If I have some html <p><span style="font-style:italic">abcde</span><span style="font-weight:bold">abcde</span></p>, I want to strip the style tags and transform them into html tags, so to make it become <p><i>abcde</i><b>abcde</b></p>. How can I do that in PHP?
I notice that when I open the html in CKEditor, this kind of transformation is done automatically. But I want to do it in backend PHP. Thanks.
$string = '<p><span style="font-style-italic;font-weight:bold">abcde</span><span style="font-weight:bold">abcde</span></p>';
$dom = new DOMDocument();
$dom->loadHTML($string);
$xp = new DOMXPath($dom);
$str = '';
$results = $xp->query('//span');
if($results->length>0){
foreach($results as $result){
$style = $result->getAttribute("style");
$style_arr = explode(";",$style);
$style_template = '%s';
if(count($style_arr)>0){
foreach($style_arr as $style_item){
if($style_item == 'font-style-italic'){
$style_template = '<i>'.$style_template.'</i>';
}
if($style_item == 'font-weight:bold'){
$style_template = '<b>'.$style_template.'</b>';
}
}
}
$str .= sprintf($style_template,$result->nodeValue);
}
}
$str = '<p>'.$str.'</p>';
You can also use html tags under php parameters or php opening and closing tags like this
<?php
echo"<h1>Here is Heading h1 </h1>";
?>
Or you can Put your html code in " " after echo
Like this
<?php
echo"Your Html Code Here";
?>
$output = preg_replace('/(<[^>]+) style=".*?"/i', '$1', $input);
Match a < follow by one or more and not > until space came and the style="anything" reached. The /i will work with capital STYLE and $1 will leave the tag as it is, if the tag does not include style="". And for the single quote style='' use this:
(<[^>]+) style=("|').*?("|')

Replace enclosing Apostrophs with HTML tags but not inside <code> blocks

Goal: Modifying an HTML string that contains apostrophs for wrapping code inline (like Stackoverflow is doing it). But the same time having <code> blocks that can also contain apostrophs which should stay unchanged.
Example:
<p>This is my `inline code`, it can be replaced and tag-wrapped.</p>
<p><code>This text contains `apostrophs`, but should `not` be changed.</code></p>
This regex I am using for converting all wrapping apostrophs to <code> elements:
// replace apostroph with incorporating <code> tag
$content = preg_replace('/(.+?)\`(.+?)\`/', '$1<code class="inlinecode">$2</code>', $content);
Required:
Change the regex, so that it does not convert the apostroph if it is withing a <code> block.
Disclaimer: I tried for several hours to read the HTML string, use PHP's DOM parser, extract all nodes of type code, change their content, write them back, then found out that nodeValue is removing all HTML tags (especially the line breaks). Then tried several solutions found online, still not working... Now I am falling back to regex, even against the odds.
FYI, how I tried it the DOM way:
$code_blocks = $dom->getElementsByTagName('code');
foreach($code_blocks as $codenode) {
// nodeValue strips HTML tags, we need to hack
$nodevalue_html = $codenode->ownerDocument->saveXML($codenode);
// replace, i.e. custom-store each apostroph with '~~~APO~~~' so that they survive
$nodevalue_html = preg_replace('/`/', '~~~APO~~~', $nodevalue_html);
// $codenode->textValue = $nodevalue_html; // fail
// $codenode->nodeValue = $nodevalue_html; // fail
// ...
}
// html to string
$html_new = $dom->saveHTML();
$html_new = preg_replace('/~~~APO~~~/', '`', $html_new);
I wished I could use Markdown like Stackoverflow, but I still need to deal with HTML.
Using an XPath query to avoid text nodes that have a code element as ancestor:
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED);
$xp = new DOMXPath($dom);
$textNodes = $xp->query('//text()[not(ancestor::code)][contains(.,"`")]');
foreach ($textNodes as $textNode) {
$parts = (function($text) { yield from explode('`', $text); })($textNode->nodeValue);
$frag = $dom->createDocumentFragment();
do {
$frag->appendChild($dom->createTextNode($parts->current()));
$parts->next();
if ( $parts->valid() ) {
$codeElt = $dom->createElement('code');
$codeElt->appendChild($dom->createTextNode($parts->current()));
$frag->appendChild($codeElt);
$parts->next();
}
} while ($parts->valid());
$textNode->parentNode->replaceChild($frag, $textNode);
}
echo $dom->saveHTML();
demo
demo for php < 7.0
I believe the only way is to explode and reassemble the string:
$html_string = '....................'; // contains apostrophes and <code>...</code> blocks
$delim = "<code>";
$closing_tag = "</code>";
$explode = explode($delim, $html_string);
foreach($explode as &$ex) {
$closing_tag_pos = strpos($ex, $closing_tag);
if ($closing_tag_pos !== false) {
$pre_closing_tag = substr($ex, 0, $closing_tag_pos);
$post_closing_tag = substr($ex, $closing_tag_pos);
$ex = $pre_closing_tag . preg_replace('/`/', '~~~APO~~~', $post_closing_tag);
}
}
$mapped_html_string = implode($delim, $explode);

grab text in the middle to a variable [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
PHP DOMDocument - get html source of BODY
I have the following code as a variable and trying to grab everything in between the body tags (while keeping the p tags etc). Whats the best way of doing this?
pregmatch
strpos / substr
<head>
<title></title>
</head>
<body>
<p>Services Calls2</p>
</body>
Neither. You can use a XML parser, like DomDocument:
$dom = new DOMDocument();
$dom->loadHTML($var);
$body = $dom->getElementsByTagName('body')->item(0);
$content = '';
foreach($body->childNodes as $child)
$content .= $dom->saveXML($child);
Try this, $html has the text:
$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';
echo trim(substr($html, $s, strpos($html, $f) - $s));
I recommend you to use preg_match because contents between <p>Services Calls2</p> can change all the time then subtr or strpos is going to require quite controversial code.
Example:
$a = '<h2><p>Services Calls2</p></h2>';
preg_match("/<p>(?:\w|\s|\d)+<\/p>/", $a, $ar);
var_dump($ar);
The regex is going to allow alphabets, space and digits only.

PHP parse HTML tags [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to parse and process HTML with PHP?
I'm pretty new to PHP.
I have the text of a body tag of some page in a string variable.
I'd like to know if it contains some tag ... where the tag name tag1 is given, and if so, take only that tag from the string.
How can I do that simply in PHP?
Thanks!!
You would be looking at something like this:
<?php
$content = "";
$doc = new DOMDocument();
$doc->load("example.html");
$items = $doc->getElementsByTagName('tag1');
if(count($items) > 0) //Only if tag1 items are found
{
foreach ($items as $tag1)
{
// Do something with $tag1->nodeValue and save your modifications
$content .= $tag1->nodeValue;
}
}
else
{
$content = $doc->saveHTML();
}
echo $content;
?>
DomDocument represents an entire HTML or XML document; serves as the root of the document tree. So you will have a valid markup, and by finding elements By Tag Name you won't find comments.
Another possibility is regex.
$matches = null;
$returnValue = preg_match_all('#<li.*?>(.*?)</li>#', 'abc', $matches);
$matches[0][x] contains the whole matches such as <li class="small">list entry</li>, $matches[1][x] containt the inner HTML only such as list entry.
Fast way:
Look for the index position of tag1 then look for the index position of /tag1. Then cut the string between those two indexes. Look up strpos and substr on php.net
Also this might not work if your string is too long.
$pos1 = strpos($bigString, '<tag1>');
$pos2 = strpos($bigString, '</tag1>');
$resultingString = substr($bigString, -$pos1, $pos2);
You might have to add and/or substract some units from $pos1 and $pos2 to get the $resultingString right.
(if you don't have comments with tag1 inside of them sigh)
The right way:
Look up html parsers

php: Extract text between specific tags from a webpage [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Best methods to parse HTML with PHP
I understand I should be using a html parser like php domdocument (http://docs.php.net/manual/en/domdocument.loadhtml.php) or tagsoup.
How would I use php domdocument to extract text between specific tags, for example get text between h1,h2,h3,p,table? It seems I can only do this for one tag only with getelementbytagname.
Is there a better html parser for such task? Or how would I loop through the php domdocument?
You are correct, use DomDocument (since regex is NOT a good idea for parsing HTML. Why? See here and here for reasons why).
getElementsByTagName gives you a DOMNodeList that you can iterate over to get the text of all the found elements. So, your code could look something like:
$document = new \DOMDocument();
$document->loadHTML($html);
$tags = array ('h1', 'h2', 'h3', 'h4', 'p');
$texts = array ();
foreach($tags as $tag)
{
$elementList = $document->getElementsByTagName($tag);
foreach($elementList as $element)
{
$texts[$element->tagName][] = $element->textContent;
}
}
return $texts;
Note that you should probably have some error handling in there, and you will also lose the context of the texts, but you can probably edit this code as you see fit.
You can doing so with a regex.
preg_match_all('#<h1>([^<]*)</h1>#Usi', $html_string, $matches);
foreach ($matches as $match)
{
// do something with $match
}
I am not sure what is your source so I added a function to get the content via the URL.
$file = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($file);
$body = $doc->getElementsByTagName('body');
$h1 = $body->getElementsByTagName('h1');
I am not sure of this part:
for ($i = 0; $i < $items->length; $i++) {
echo $items->item($i)->nodeValue . "\n";
}
Or:
foreach ($items as $item) {
echo $item->nodeValue . "\n";
}
Here is more info on nodeValue: http://docs.php.net/manual/en/function.domnode-node-value.php
Hope it helps!

Categories