How to remove all kind of non-breaking spaces with php - php

I am saving a string from Html file into my database.
I fail to get the string trimmed and clean of whitespaces.
I created this simplified function to summarize the problem and what I've tried so far.
<?php
function get_content($html)
{
$dom = new DOMDocument();
$dom->loadHTML($html);
$div = $dom->getElementById('whitespace');
$content = $div->textContent;
# Goal: trim leading, trailing, and non-breaking space
$content = str_replace(' ','',$content);
$content = str_replace('U+00A0','',$content);
$content = str_replace('\u00a0','',$content);
$content = str_replace('\xa0','',$content);
$content = str_replace(chr(160),'',$content);
$content = trim($content);
return $content;
}
file_put_contents(
'trim.output',
get_content('<div id="whitespace"> TuffToTrim</div>'
));
?>
The output is:
      TuffToTrim
While I'd like it to be:
TuffToTrim
I'm kind of desperate at this point :) Any ideas?

Instead of
$content = str_replace(' ','',$content);
$content = str_replace('U+00A0','',$content);
$content = str_replace('\u00a0','',$content);
$content = str_replace('\xa0','',$content);
$content = str_replace(chr(160),'',$content);
$content = trim($content);
You should use
$content = preg_replace('/[\s]+/mu', '', $content);

It should be converted to HTML entities first. Then you should be able to replace characters.
$content = htmlentities($content, null, 'utf-8');
$content = str_replace(" ", "", $content);

Related

Removing images from paragraph tags

I have the following code which pulls out the blockquote and puts my WordPress post content in <p> tags.
<?php
$content = preg_replace('/<blockquote>(.*?)<\/blockquote>/', '', get_the_content());
$content = wpautop($content); // Add paragraph-tags
$content = str_replace('<p></p>', '', $content); // remove empty paragraphs
echo $content;
?>
However it puts the images in <p> tags which I don't want
Here is some code that should do it (not tested).
<?php
$content = preg_replace('/<blockquote>(.*?)<\/blockquote>/', '', get_the_content());
$content = wpautop($content); // Add paragraph-tags
$content = str_replace('<p></p>', '', $content); // remove empty paragraphs
$content = preg_replace('/<p>\s*(<a .*>)?\s*(<img .* \/>)\s*(<\/a>)?\s*<\/p>/iU', '\1\2\3', $content); // remove paragraphs around img tags
echo $content;
?>
On the line after the str_replace you could use this domDocument method:
$dom = new domDocument;
$dom->loadHTML($content);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagname('img');
$removeList = array();
foreach ($images as $domElement) {
$removeList[] = $domElement;
}
foreach ($removeList as $toRemove) {
$toRemove->parentNode->removeChild($toRemove);
}
$content = $dom->saveHTML();
(ps: this will also give you a non preg_replace method, not that it really matters)

PHP Preg Replace - Match String with Space - Wordpress

I'm trying to scan my wordpress content for:
<p><span class="embed-youtube">some iframed video</span></p>
and then change it into:
<p class="img_wrap"><span class="embed-youtube">some iframed video</span></p>
using the following code in my function.php file in my theme:
$classes = 'class="img_wrap"';
$youtube_match = preg_match('/(<p.*?)(.*?><span class="embed-youtube")/', $content, $youtube_array);
if(!empty($youtube_match))
{
$content = preg_replace('/(<p.*?)(.*?><span class=\"embed-youtube\")/', '$1 ' . $classes . '$2', $content);
}
but for some reason I am not getting a match on my regex nor is the replace working. I don't understand why there isn't a match because the span with class embed-youtube exists.
UPDATE - HERE IS THE FULL FUNCTION
function give_attachments_class($content){
$classes = 'class="img_wrap"';
$img_match = preg_match("/(<p.*?)(.*?><img)/", $content, $img_array);
$youtube_match = preg_match('/(<p.*?)(.*?><span class="embed-youtube")/', $content, $youtube_array);
// $doc = new DOMDocument;
// #$doc->loadHTML($content); // load the HTML data
// $xpath = new DOMXPath($doc);
// $nodes = $xpath->query('//p/span[#class="embed-youtube"]');
// foreach ($nodes as $node) {
// $node->parentNode->setAttribute('class', 'img_wrap');
// }
// $content = $doc->saveHTML();
if(!empty($img_match))
{
$content = preg_replace('/(<p.*?)(.*?><img)/', '$1 ' . $classes . '$2', $content);
}
else if(!empty($youtube_match))
{
$content = preg_replace('/(<p.*?)(.*?><span class=\"embed-youtube\")/', '$1 ' . $classes . '$2', $content);
}
$content = preg_replace("/<img(.*?)src=('|\")(.*?).(bmp|gif|jpeg|jpg|png)(|\")(.*?)>/", '<img$1 data-original=$3.$4 $6>' , $content);
return $content;
}
add_filter('the_content','give_attachments_class');
Instead of using regex, make effective use of DOM and XPath to do this for you.
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//p/span[#class="embed-youtube"]');
foreach ($nodes as $node) {
$node->parentNode->setAttribute('class', 'img_wrap');
}
echo $doc->saveHTML();
Here is a quick and dirty REGEX I did for you. It finds the entire string starting with p tag, ending p tag, span also included etc. I also wrote it to include single or double quotes for you since you never know and also to include spaces in various places. Let me know how it works out for you, thanks.
(<p )+(class=)['"]+img_wrap+['"](><span)+[ ]+(class=)+['"]embed-youtube+['"]>[A-Za-z0-9='" ]+(</span></p>)
I have tested it on your code and a few other variations and it works for me.

PHP grabbing content between two strings

// get CONTENT from united domains footer
$content = file_get_contents('http://www.uniteddomains.com/index/footer/');
// remove spaces from CONTENT
$content = preg_replace('/\s+/', '', $content);
// match all tld tags
$regex = '#target="_parent">.(.*?)</a></li><li>#';
preg_match($regex, $source, $matches);
print_r($matches);
I am wanting to match all of the TLDs:
Each tld is preceded by target="_parent">. and followed by </a></li><li>
I am wanting to end up with an array like array('africa','amsterdam','bnc'...ect ect )
What am I doing wrong here?
NOTE: The second step to remove all the spaces is just to simplify things.
Here's a regular expression that will do it for that page.
\.\w+(?=</a></li>)
REY
PHP
$content = file_get_contents('http://www.uniteddomains.com/index/footer/');
preg_match_all('/\.\w+(?=<\/a><\/li>)/m', $content, $matches);
print_r($matches);
PHPFiddle
Here are the results:
.africa, .amsterdam, .bcn, .berlin, .boston, .brussels, .budapest, .gent, .hamburg, .koeln, .london, .madrid, .melbourne, .moscow, .miami, .nagoya, .nyc, .okinawa, .osaka, .paris, .quebec, .roma, .ryukyu, .stockholm, .sydney, .tokyo, .vegas, .wien, .yokohama, .africa, .arab, .bayern, .bzh, .cymru, .kiwi, .lat, .scot, .vlaanderen, .wales, .app, .blog, .chat, .cloud, .digital, .email, .mobile, .online, .site, .mls, .secure, .web, .wiki, .associates, .business, .car, .careers, .contractors, .clothing, .design, .equipment, .estate, .gallery, .graphics, .hotel, .immo, .investments, .law, .management, .media, .money, .solutions, .sucks, .taxi, .trade, .archi, .adult, .bio, .center, .city, .club, .cool, .date, .earth, .energy, .family, .free, .green, .live, .lol, .love, .med, .ngo, .news, .phone, .pictures, .radio, .reviews, .rip, .team, .technology, .today, .voting, .buy, .deal, .luxe, .sale, .shop, .shopping, .store, .eus, .gay, .eco, .hiv, .irish, .one, .pics, .porn, .sex, .singles, .vin, .vip, .bar, .pizza, .wine, .bike, .book, .holiday, .horse, .film, .music, .party, .email, .pets, .play, .rocks, .rugby, .ski, .sport, .surf, .tour, .video
Using the DOM is cleaner:
$doc = new DOMDocument();
#$doc->loadHTMLFile('http://www.uniteddomains.com/index/footer/');
$xpath = new DOMXPath($doc);
$items = $xpath->query('/html/body/div/ul/li/ul/li[not(#class)]/a[#target="_parent"]/text()');
$result = '';
foreach($items as $item) {
$result .= $item->nodeValue; }
$result = explode('.', $result);
array_shift($result);
print_r($result);

Why am I having issues with variable scope in an external file function?

Within a function on my parent file, I am calling a function from an external php file. Here is my (simplified) code:
Parent file:
include "HelperFiles/htmlify.php";
function funcName(){
$description = "some sample text";
$description = htmlify($description, "code");
echo $description;
};
funcName();
htmlify.php file with called function:
$text = "";
function htmlify($text, $format){
if (is_array($_POST)) {
$html = ($_POST['text']);
} else {
$html = $text;
};
$html = str_replace("‘", "'", $html); //Stripping out stubborn MSWord curly quotes
$html = str_replace("’", "'", $html);
$html = str_replace("”", '"', $html);
$html = str_replace("“", '"', $html);
$html = str_replace("–", "-", $html);
$html = str_replace("…", "...", $html);
if ($format == "code"){
$html = str_replace(chr(149), "•",$html);
$html = str_replace(chr(150), "—",$html);
$html = str_replace(chr(151), "—",$html);
$html = str_replace(chr(153), "™",$html);
$html = str_replace(chr(169), "©",$html);
$html = str_replace(chr(174), "®",$html);
$trans = get_html_translation_table(HTML_ENTITIES);
$html = strtr($html, $trans);
$html = nl2br($html);
$html = str_replace("<br />", "<br>",$html);
$html = preg_replace ( "/(\s*<br>)/", "\n<br>", $html ); // seperate lines for each <br>
//$text = str_replace ( "&#", "&#", $text );
//return htmlspecialchars(stripslashes($text), ENT_QUOTES, "UTF-8");
return htmlspecialchars($html, ENT_QUOTES, "UTF-8");
}
else if ($format == "clean"){
return $html;
}
};
I'm getting the following error:
Notice: Undefined index: text in C:_Localhost_Tools\HelperFiles\htmlify.php on line 25
I've tried declaring the $text variable inside and outside of scope in multiple places but can not seem to get around this error (warning). Any help would be greatly appreciated! Thanks.
replace
if (is_array($_POST)) {
with
if (isset($_POST['text'])) {
and you should not get the warning anymore.
However I would recommend to remove this alltogether. The function parameter should always be used - everything else is confusing.
And you can also remove the first line in htmlify.php - that does basically nothing.
The error message reads undefined Index, not undefined variable. Look at all the places where you're trying to access an associative variable with text as key, $_POST['text'] seems to me to be your best bet, there's nothing that suggests that you're dealing with $_POST data AFAIK...

open file and remove first tag

This would be a HTML file:
<li class="msgln">hello</li><li class="msgln">hi</li><li class="msgln">hey</li>
And php script:
$fp = fopen("file.html", 'a');
....
fclose($fp);
How to remove first <li class="msgln">hello</li>?
Content in <li> is dynamically changed
This will work even if the first li would contain other nested lis:
<?php
$doc = new DOMDocument();
$doc->loadHTML('<li class="msgln">hello</li><li class="msgln">hi</li><li class="msgln">hey</li>
');
$root = $doc->documentElement;
$p = $doc->documentElement->childNodes->item(0)->childNodes;
$li = $doc->getElementsByTagName('li')->item(0);
$li->parentNode->removeChild($li);
$html = '';
foreach ($root->childNodes->item(0)->childNodes as $child) {
$html .= $doc->saveXML($child);
}
echo $html;
?>
using regex may cause unexpected results.
You can use preg_replace to achieve this:
$html = file_get_contents('file.html');
$html = preg_replace('#^<li[^>]*>[^<]+</li>#i', '', $html);
If the content of the file is exactly as described then you could use strip_tags() such like:
$fp = fopen("file.html", 'a');
$content = fread($fp);
$content = strip_tags($content);
fclose($fp);
Alternatively you could use regular expressions but this would be slower.
$fp = fopen("file.html", 'a');
$content = fread($fp);
$text = preg_replace( "/<li.+?>.+?<\/li>/is", "", $content, 1 );
fclose($fp);
try this (without regex)
//string contains the file value
$string = '<li class="msgln">hello</li><li class="msgln">hi</li><li class="msgln">hey</li>';
$tag = '</li>';
$lis = explode($tag, $string);
if(count($lis) > 0) {
unset($lis[0]);
$string = implode($tag, $lis);
}

Categories