How to strip a HTML element from a text file with PHP?

How to strip a HTML element from a text file with PHP? - php

I am cleaning up a mess created by Adobe InDesign export feature of ePub files.
MY GOAL:
OPTION 1. I want to remove all span elements with class attribute CharOverride-7 but leave the other span elements.
OPTION 2. In some cases I want to replace the span.CharOverride-7 with a new element, such as i.
Note, my current manual and time-cconsuming way is to do mass search and replace action, but the input text file is inconsistent (extra spaces and other artifacts).
The input text contains hundreds of p paragraphs which look like this:
<p class="2"><span class="CharOverride-7">A book title</span><span class="CharOverride-8">https://aaa.net</span><span class="CharOverride-7">.</span></p>
<p class="2"><span class="CharOverride-7">Another book title</span><span class="CharOverride-8">https://aaa.net/</span><span class="CharOverride-7">.</span></p>
The desired output should look like this:
OPTION ONE (removal of the element)
<p class="2">A book title<span class="CharOverride-8">https://aaa.net/</span>.</p>
OPTION TWO (replace span.CharOverride with i element)
<p class="2"><i>A book title</i><span class="CharOverride-8">https://aaa.net</span><i>.</i></p>

For option one this way works with using DOMDocument(): https://www.php.net/manual/de/class.domdocument.php
<?php
$yourHTML = '<p class="2"><span class="CharOverride-7">A book title</span><span class="CharOverride-8">https://aaa.net</span><span class="CharOverride-7">.</span></p>';
$dom = new DOMDocument();
$dom->loadHTML($yourHTML, LIBXML_HTML_NODEFDTD | LIBXML_HTML_NOIMPLIED );
foreach ($dom->getElementsByTagName('span') as $span) {
if ($span->attributes["class"]->value == "CharOverride-7") {
$newelement = $dom->createTextNode($span->textContent);
$span->parentNode->replaceChild($newelement, $span);
}
}
$ret = $dom->saveHTML();
// <p class="2">A book title<span class="CharOverride-8">https://aaa.net</span>.</p>
echo $ret;

Here's a simple approach for you using preg_replace()...
<?php
$data = file_get_contents('[YOUR FILENAME HERE]');
$result1 = preg_replace('/<span class="CharOverride-7">(.*)<\/span>/U', '$1', $data);
//$result2 = preg_replace('/<span class="CharOverride-7">(.*)<\/span>/U', '<i>$1</i>', $data);
echo $result1;
// echo $result2;
// Overwrite your file here... (Beyond scope of this question)
Just use $result1 or $result2 at your leisure.
Regex101 Sandbox

Related

How do you custom-format the first word/character from an html-markup MySQL field?

I did the following which works with simple text fields:
$field = "How are you doing?";
$arr = explode(' ',trim($field));
$first_word = $arr[0];
$balance = strstr("$field"," ");
It didn't work because the field contains html markup, perhaps an image, video, div, div, paragraph, etc and resulted in all text within the html getting mixed in with the text.
I could possibly use strip_tags to strip out the html then obtain first word and reformat it, but then I would have to figure out how to add the html back into the data. I'm wondering if there is a php or custom function ready made for this purpose.

You can use DOMDocument to parse the HTML, modify the contents, and save it back as HTML. Also, find the words is not always as simple as using space delimiters since not all languages delimit their words with spaces and not all words are necessarily delimited by spaces. For example: mother-in-law this could be viewed as one word or as 3 depending on how you define a word. Also, things like pancake do you consider this one word or two (pan and cake)? One simple solution is to use the IntlBreakIterator::createWordInstance class which implements the Unicode Standard for text segmentation A.K.A UAX #29.
Here's an example of how you might go about implementing this:
$html = <<<'HTML'
<div>some sample text here</div>
HTML;
/* Let's extend DOMDocument to include a walk method that can traverse the entire DOM tree */
class MyDOMDocument extends DOMDocument {
public function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
$dom = new MyDOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Let's traverse the DOMTree to find the first text node
foreach ($dom->walk($dom->childNodes->item(0)) as $node) {
if ($node->nodeName === "#text") {
break;
}
}
// Extract the first word from that text node
$iterator = IntlBreakIterator::createWordInstance();
$iterator->setText($node->nodeValue); // set the text in the word iterator
$it = $iterator->getPartsIterator(IntlPartsIterator::KEY_RIGHT);
foreach ($it as $offset => $word) {
break;
}
// You can do whatever you want to $word here
$word .= "s"; // I'm going to append the letter s
// Replace the text node with the modification
$unmodifiedString = substr($node->nodeValue, $offset);
$modifiedString = $word . $unmodifiedString;
$oldNode = $node; // Keep a copy of the old node for reference
$node->nodeValue = $modifiedString;
// Replace the node back into the DOM tree
$node->parentNode->replaceChild($node, $oldNode);
// Save the HTML
$newHTML = $dom->saveHTML();
echo $newHTML;
Outputs
<div>somes sample text here</div>

PHP Simple Html Dom get the plain text of div，but avoiding all other tags

I use PHP Simple Html Dom to get some html,now i have a html dom like follow code,i need fetch the plain text inner div,but avoiding the p tags and their content（only return 111111）, who can help me?Thanks in advance!
<div>
<p>00000000</p>
111111
<p>22222222</p>
</div>

It depends on what you mean by "avoiding the p tags".
If you just want to remove the tags, then just running strip_tags() on it should work for what you want.
If you actually want to just return "11111" (ie. strip the tags and their contents) then this isn't a viable solution. For that, something like this may work:
$myDiv = $html->find('div'); // wherever your the div you're ending up with is
$children = $myDiv->children; // get an array of children
foreach ($children AS $child) {
$child->outertext = ''; // This removes the element, but MAY NOT remove it from the original $myDiv
}
echo $myDiv->innertext;

If you text is always at the same position , try this:
$html->find('text', 2)->plaintext; // should return 111111

Here is my solution
I want to get the Primary Text part only.
$title_obj = $article->find(".ofr-descptxt",0); //Store the Original Tree ie) h3 tag
$title_obj->children(0)->outertext = ""; //Unset <br/>
$title_obj->children(1)->outertext = ""; //Unset the last Span
echo $title_obj; //It has only first element
Edited:
If you have PHP errors
Try to enclose with If else or try my lazy code
($title_obj->children(0))?$title_obj->children(0)->outertext="":"";
($title_obj->children(1))?$title_obj->children(1)->outertext = "":"";
Official Documentation

$wordlist = array("<p>", "</p>")
foreach($wordlist as $word)
$string = str_replace($word, "", $string);

Replace all matches that are not part of HTML code

I have input such as:
<h2 class="role">He played and an important role</h2>
And need to replace the role, but not in the class.
Tricky is, that it may be class="group role something" or so, so I essentially only want to search the real text and not the html, but I need to give back everything.
I'm in PHP and do not have a real good starting point ...

Better no preg_ for parsing HTML, use dom:
$input = '<h2 class="role">He played and an important role</h2>';
$dom = new domDocument('1.0', 'utf-8');
$dom->loadHTML($input);
$dom->preserveWhiteSpace = false;
$element = $dom->getElementsByTagName('h2'); // <--- change tag name as appropriate
$value = $element->item(0)->nodeValue;
// change $value here...

It is better to use the DOM to manipulate HTML, but here is a regex solution.
It will not make the replacement if > appears before < ahead in the string.
$input = '<h2 class="role">He played and an important role</h2>';
$input = preg_replace( '/role(?![^<>]*>[^<>]*(?:<|$))/', 'new role', $input );
echo $input;
// <h2 class="role">He played and an important new role</h2>

PHP text to array and with key

I know RegExp not well, I did not succeeded to split string to array.
I have string like:
<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>
So What I am trying to do is to split text string into array where KEY would be text from header and CONTENT would be all the rest content till the next header like:
array("some text in header" => "some other content, that belongs to header...", ...)

I would suggest looking at the PHP DOM http://php.net/manual/en/book.dom.php. You can read / create DOM from a document.

i've used this one and enjoyed it.
http://simplehtmldom.sourceforge.net/
you could do it with a regex as well.
something like this.
/<h5>(.*)<\/h5>(.*)<h5>/s
but this just finds the first situation. you'll have to cut hte string to get the next one.
any way you cut it, i don't see a one liner for you. sorry.
here's a crummy broken 4 liner.
$chunks = explode("<h5>", $html);
foreach($chunks as $chunk){
list($key, $val) = explode("</h5>", $chunk);
$res[$key] = $val;
}

dont parse HTML via preg_match
instead use php Class
The DOMDocument class
example:
<?php
$html= "<h5>some text in header</h5>
some other content, that belongs to header <p> or <a> or <img> inside.. not important...
<h5>Second text header</h5>";
// a new dom object
$dom = new domDocument('1.0', 'utf-8');
// load the html into the object ***/
$dom->loadHTML($html);
/*** discard white space ***/
$dom->preserveWhiteSpace = false;
$hFive= $dom->getElementsByTagName('h5');
echo $hFive->item(0)->nodeValue; // u can get all h5 data by changing the index
?>
Reference

Extract description in site with no meta tag description?

I need of a function in php that extract a description of a site url that don't have meta tag description any idea?
i have tried this function but don't work :
$content = file_get_contents($url);
function getExcerpt($content) {
$text = html_entity_decode($content);
$excerpt = array();
//match all tags
preg_match_all("|<[^>]+>(.*)]+>|", $text, $p, PREG_PATTERN_ORDER);
for ($x = 0; $x < sizeof($p[0]); $x++) {
if (preg_match('< p >i', $p[0][$x])) {
$strip = strip_tags($p[0][$x]);
if (preg_match("/\./", $strip))
$excerpt[] = $strip;
}
if (isset($excerpt[0])){
preg_match("/([^.]+.)/", $strip,$matches);
return $matches[1];
}
}
return false;
}
$excerpt = getExcerpt($content);

Parsing HTML with RegEx is almost always a bad idea. Thankfully PHP has libraries that can do the work for you. The following code uses DOMDocument to extract either the meta description or if one does not exist, the first 1000 characters in the page.
<?php
function getExcerpt($html) {
$dom = new DOMDocument();
// Parse the inputted HTML into a DOM
$dom->loadHTML($html);
$metaTags = $dom->getElementsByTagName('meta');
// Check for a meta description and return it if it exists
foreach ($metaTags as $metaTag) {
if ($metaTag->getAttribute('name') === "description") {
return $metaTag->getAttribute('content');
}
}
// No meta description, extract an excerpt from the body
// Get the body node
$body = $dom->getElementsByTagName('body');
$body = $body->item(0);
// extract the contents
$bodyText = $body->textContent;
// collapse any line breaks
$bodyText = preg_replace('/\s*\n\s*/', "\n", $bodyText);
// collapse any more leftover spaces or tabs to single spaces
$bodyText = preg_replace('/[ ]+/', ' ', $bodyText);
// return the first 1000 chars
return trim(substr($bodyText, 0, 1000));
}
$html = file_get_contents('test.html');
echo nl2br(getExcerpt($html));
You'll probably want to add a little more logic to it, some DOM traversal to try to find the content, or just some snippet near the middle of the text. As it is, this code will probably grab a bunch of unwanted stuff like the top of the page navigation etc.

You should first check if there is meta description available, if yes then display that else search for the <p> tags and display that data as description (you might want to put a limit on length of a paragraph, e.g. if length is less than 30, search for next paragraph). If there is no <p> tag then simply display the title as description (that's how facebook and Digg works)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

How to strip a HTML element from a text file with PHP? - php

Related

How do you custom-format the first word/character from an html-markup MySQL field?

PHP Simple Html Dom get the plain text of div，but avoiding all other tags

Replace all matches that are not part of HTML code

PHP text to array and with key

Extract description in site with no meta tag description?

Categories

Resources