i am working on code where i upload html code and same code is added as content with top char being title and seo url.
but i had issue with making title as unable to get only plain text from html string to use it as title and seo url
below is my code to get title from html text:
$title = getplaintextintrofromhtml($str,100);
$title = str_replace(PHP_EOL, '', $title);
$title = str_replace(" "," ", $title);
$title = str_replace(str_split('\\/:*?"<>|,+=-'), '', $title);
$title = str_replace("'","", $title);
$title = str_replace("<br>","", $title);
$title = str_replace("\n","", $title);
$title = trim($title);
seo url
$newurltitle=str_replace(" ","-",$title);
and function
function getplaintextintrofromhtml($html, $numchars) {
// Remove the HTML tags
$html = strip_tags($html);
// Convert HTML entities to single characters
$html = html_entity_decode($html, ENT_QUOTES, 'UTF-8');
// Make the string the desired number of characters
// Note that substr is not good as it counts by bytes and not characters
$html = mb_substr($html, 0, $numchars, 'UTF-8');
// Add an elipsis
return $html;
}
even after my above code i get titles with new line , i could not figure out why this happens even thought i am getting plain text but issue like new line still there and i can not use them to make seo url also
You can use the following code to remove newlines, extra spaces, and line feeds:
$title = preg_replace('/\s+/', ' ', $title);
Related
The following native function of my script does strip some content from my article such as bbcodes, html tags, and http or https websites, and I'd like to modify it so that it does strip everything except anything which matches "https://my.website.com/variablepath".
function _bbcode_strip($text)
{
static $patterns = array();
if ($this->tp_bbcode)
{
// use text inside [topicpreview] bbcode as the topic preview
if (preg_match('#\[(topicpreview[^\[\]]+)\].*\[/\1\]#Usi', $text, $matches))
{
$text = $matches[0];
}
}
$text = smiley_text($text, true); // display smileys as text :)
$text = ($this->tp_line_breaks ? str_replace("\n", '
', $text) : $text); // preserve line breaks
// Loop through text stripping inner most nested BBCodes until all have been removed
$regex = '#\[(' . $this->strip_bbcodes . ')[^\[\]]+\]((?:(?!\[\1[^\[\]]+\]).)+)\[\/\1[^\[\]]+\]#Usi';
while(preg_match($regex, $text))
{
$text = preg_replace($regex, '', $text);
}
if (empty($patterns))
{
$patterns = array(
'#<!-- [lmw] --><a class="postlink[^>]*>(.*<\/a[^>]*>)?<!-- [lmw] -->#Usi', // Magic URLs
'#<[^>]*>(.*<[^>]*>)?#Usi', // HTML code
'#\[/?[^\[\]]+\]#mi', // Strip all bbcode tags
'#(http|https|ftp|mailto)(:|\&\#58;)\/\/[^\s]+#i', // Strip remaining URLs
'#"#', // Possible quotes from older board conversions
'#[\s]+#' // Multiple spaces
);
}
return trim(preg_replace($patterns, ' ', $text));
}
I tried myself working on this part of the script which is the regex which decides what to strip:
'#<!-- [lmw] --><a class="postlink[^>]*>(.*<\/a[^>]*>)?<!-- [lmw] -->#Usi', // Magic URLs
'#<[^>]*>(.*<[^>]*>)?#Usi', // HTML code
'#\[/?[^\[\]]+\]#mi', // Strip all bbcode tags
'#(http|https|ftp|mailto)(:|\&\#58;)\/\/[^\s]+#i', // Strip remaining URLs
'#"#', // Possible quotes from older board conversions
'#[\s]+#' // Multiple spaces
but I didn't manage to do what I'm trying to, so I came here to ask for help. Thank you.
I want to pull out the list (ul) element from my wordpress post(s) so I can put it in a different location.
My current css pulls out the images and blockqute and puts just the text
html
<?php
$content = preg_replace('/<blockquote>(.*?)<\/blockquote>/', '', get_the_content());
$content = preg_replace('/(<img [^>]*>)/', '', $content);
$content = wpautop($content); // Add paragraph-tags
$content = str_replace('<p></p>', '', $content); // remove empty paragraphs
echo $content;
?>
Just a friendly reminder is that it is generally not recommended to parse html with regex.
If you would like to do that anyway you could try like this:
$pattern = '~<ul>(.*?)</ul>~s';
So in your code it would look like this:
preg_match_all('/(~<ul>(.*?)</ul>~s)/', $content, $ulElements);
And then for removing it from the original string:
preg_replace('/(~<ul>(.*?)</ul>~s)/', '', $content);
Here I am creating preview for url. Which shows
Url title
Url description (title should not come in this)
Here is my try.
<?php
function plaintext($html)
{
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);
// remove title
//$plaintext = preg_match('#<title>(.*?)</title>#', $html);
// remove comments and any content found in the the comment area (strip_tags only removes the actual tags).
$plaintext = preg_replace('#<!--.*?-->#s', '', $plaintext);
// put a space between list items (strip_tags just removes the tags).
$plaintext = preg_replace('#</li>#', ' </li>', $plaintext);
// remove all script and style tags
$plaintext = preg_replace('#<(script|style)\b[^>]*>(.*?)</(script|style)>#is', "", $plaintext);
// remove br tags (missed by strip_tags)
$plaintext = preg_replace("#<br[^>]*?>#", " ", $plaintext);
// remove all remaining html
$plaintext = strip_tags($plaintext);
return $plaintext;
}
function get_title($html)
{
return preg_match('!<title>(.*?)</title>!i', $html, $matches) ? $matches[1] : '';
}
function trim_display($size,$string)
{
$trim_string = substr($string, 0, $size);
$trim_string = $trim_string . "...";
return $trim_string;
}
$url = "http://www.nextbigwhat.com/indian-startups/";
$data = file_get_contents($url);
//$url = trim_url(5,$url);
$title = get_title($data);
echo "title is ; $title";
$content = plaintext($data);
$Preview = trim_display(100,$content);
echo '<br/>';
echo "preview is: $Preview";
?>
URL title appear correctly. But when I have excluded the title content from description, even it appear.
i have uses $plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html); to exclude the title from plain text.
Regex is correct as per me event it does not exclude title content.
What is the problem here?
output we get here is:
title is ; Indian Startups Archives - NextBigWhat.com
preview is: Indian Startups Archives : NextBigWhat.com [whatever rest text]...
Actually the text which appears in title part should not again come in preview. That's why i want to exclude it and display rest text in preview.
how to solve the mistery
If you look closer to the title and the preview, they're different. Let's see the output from the curl.
echo plaintext($data);
Well, it seems it has two titles:
<title>
Indian Startups Archives : NextBigWhat.com</title>
and
<title>Indian Startups Archives - NextBigWhat.com</title>
Then the get_title function is retrieving the second title and plaintext leaves alone the first one. What's the difference between them? the line break! therefore your regex isn't matching titles with newline characters, which is why the /s option modifier in regular expressions exists!
tl;dr
Your regex is wrong, add 's' to it.
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#s', ' ', $html);`
instead of
$plaintext = preg_replace('#([<]title)(.*)([<]/title[>])#', ' ', $html);`
I am having trouble with understanding how to keep the norwegian letters
"æ ø å" in this preg_replace function i got for modifying forum titles into SEO URLs.
My website is rendered in "iso-8859-1".
How i want it: someurl.com/read=kjøp_og_salg
Currently looks like this: someurl.com/read=kj_p_og_salg
//----- The seo url function ------//
public function make_seo_name($title){
$title = preg_replace('/[\'"]/', '', $title);
$title = preg_replace('/[^a-zA-Z0-9]+/', '_', $title);
$title = strtolower(trim($title, '_'));
return $title;
}
I tried to utf8_encode/decode the $title before and after the preg_replace was done, but didn't work.
Thank you for your time!
EDIT:
Solved, i fixed it with some help from "One Trick Pony". I ended up with this function.
public function make_seo_name($title){
$title = utf8_encode($title);
$title = preg_replace('/[\'"]/', '', $title);
$title = preg_replace('/[^a-zA-Z0-9\ø\å\æ]+/', '_', $title);
$title = strtolower(trim($title, '_'));
return $title;
}
Note: i did NOT need to change my header from "iso-8859-1" to "UTF-8"
The '/[^a-zA-Z0-9]+/' bit is a regular expression that says to match only characters that are not the characters a through z, A through Z, or 0 through 9. The basic syntax is on wikipedia.
preg_replace then replaces such characters with underscores.
You can add the extra characters you want to allow to this list:
$title = preg_replace('/[^a-zA-Z0-9æøå]+/', '_', $title);
Set the document encoding to utf-8 or iso-8859-1 and add the characters to the list like:
<head><meta charset="utf-8" /></head>
and
$title = preg_replace('/[^a-zA-Z0-9æøå]+/', '_', $title);
I have a form which is accepts HTML data, but we need only their respective text, not anything else. Is there any particular way to extract the text out of the HTML in PHP?
Use strip_tags().
Surely it can be done:
Just look at this function and use it as you like:
function html2txt ($document)
{
$search = array (
"'<script[^>]*?>.*?</script>'si", // Strip out JavaScript code
"'<[\/\!]*?[^<>]*?>'si", // Strip out HTML tags
"'([\r\n])[\s]+'", // Strip out white space
"'#<![\s\S]*?�[ \t\n\r]*>#'",
"'&(quot|#34|#034|#x22);'i", // Replace HTML entities
"'&(amp|#38|#038|#x26);'i", // Added hexadecimal values
"'&(lt|#60|#060|#x3c);'i",
"'&(gt|#62|#062|#x3e);'i",
"'&(nbsp|#160|#xa0);'i",
"'&(iexcl|#161);'i",
"'&(cent|#162);'i",
"'&(pound|#163);'i",
"'&(copy|#169);'i",
"'&(reg|#174);'i",
"'&(deg|#176);'i",
"'&(#39|#039|#x27);'",
"'&(euro|#8364);'i", // Europe
"'&a(uml|UML);'", // German
"'&o(uml|UML);'",
"'&u(uml|UML);'",
"'&A(uml|UML);'",
"'&O(uml|UML);'",
"'&U(uml|UML);'",
"'ß'i",
);
$replace = array ( "",
"",
" ",
"\"",
"&",
"<",
">",
" ",
chr(161),
chr(162),
chr(163),
chr(169),
chr(174),
chr(176),
chr(39),
chr(128),
"ä",
"ö",
"ü",
"�",
"�",
"�",
"�",
);
$text = preg_replace($search, $replace, $document);
return trim ($text);
}
You can parse the HTML using DOMDocument::loadHTMLFile and extract what you need.
$doc = new DOMDocument();
$doc->loadHTMLFile("data.html");
$metaTags = $doc->getElementsByTagName('meta');
// Process $metaTags