I have to convert an old website to a CMS and one of the challenges I have is at present there are over 900 folders that contain up to 9 text files in each folder. I need to combine the up to 9 text files into one and then use that file as the import into the CMS.
The file concatenation and import are working perfectly.
The challenge that I have is parsing some of the text in the text file.
The text file contains a url in the form of
Some text [http://xxxxx.com|About something] some more text
I am converting this with this code
if (substr ($line1, 0, 7) !=="Replace") {
$pattern = '/\\[/';
$pattern2 = '/\\]/';
$pattern3 = '/\\|/';
$replacement = '<a href="';
$replacement3 = '">';
$replacement2='</a><br>';
$subject = $line1;
$i=preg_replace($pattern, $replacement, $subject, -1 );
$i=preg_replace($pattern3, $replacement3, $i, -1 );
$i=preg_replace($pattern2, $replacement2, $i, -1 );
$line .= '<div class="'.$folders[$x].'">'.$i.'</div>' ;
}
It may not be the most efficient code but it works and as this is a one off exercise execution time etc is not an issue.
Now to the problem that I cannot seem to code around. Some of the urls in the text files are in this format
Some text [http://xxxx.com] some more text
The pattern matching that I have above finds pattern and pattern2 but as there is no pattern3 the url is malformed in the output.
Regular expressions are not my forte is there a way to modify what I have above or is there another way to get the correctly formatted url in my output or will I need to parse the output a second time looking for the malformed url and correct it before writing it to the output file?
You can use preg_replace_callback() to achieve this:
Find any string of the format [...]
Try to split them by the delimiter | using explode()
If the split array contains two pieces, then it means the [...] string contains two pieces: the link href and the link anchor text
If not, then it means the the [...] string contains only the link href part
Format and return the link
Code:
$input = <<<EOD
Some text [http://xxxxx.com|About something] some more text
Some text [http://xxxx.com] some more text
EOD;
$output = preg_replace_callback('#\[([^\]]+)\]#', function($m)
{
$parts = explode('|', $m[1]);
if (count($parts) == 2)
{
return sprintf('%s', $parts[0], $parts[1]);
}
else
{
return sprintf('%1$s', $m[1]);
}
}, $input);
echo $output;
Output:
Some text About something some more text
Some text http://xxxx.com some more text
Live demo
Related
I'm outputting a string assembled from a few different parts, and some of those parts may or may not contain some HTML. If I apply ucfirst() to the string and there's HTML before the text to be displayed then the text doesn't get proper capitalization.
$output = $before_text . $text . $after_text;
So if I've got
$before_text = 'this is the lead into ';
$text = 'the rest of the sentence';
$after_text = '.';
then ucfirst() works fine and $output will be
This is the lead in to the rest of the sentence.
But this
$before_text = '<p>';
$text = 'the sentence.';
$after_text = '</p>';
won't do anything. So I guess I need a function or regex to make its way to the first actual, regular text and then capitalize it. But I can't figure it out.
use strip_tags in $text and save in $temp: this should give you text that is not html.
apply ucfirst on $temp and call it $temp_ucfirst: this should give you string upper-cased.
use str_replace to replace $temp in $text with $temp_ucfirst: this should replace the not-html text with the upper-cased one.
I'm trying to remove script tags from the source code using regular expression.
/<\s*script[^>]*[^\/]>(.*?)<\s*\/\s*script\s*>/is
But I ran into the problem when I need to remove the code inside another code.
Please see this screenshot
I'm tested in https://regex101.com/r/R6XaUT/1
How do I correctly create a regular expression so that it can cover all the code?
Sample text:
$text = '<b>sample</b> text with <div>tags</div>';
Result for strip_tags($text):
Output: sample text with tags
Result for strip_tags_content($text):
Output: text with
Result for strip_tags_content($text, ''):
Output: <b>sample</b> text with
Result for strip_tags_content($text, '', TRUE);
Output: text with <div>tags</div>
I hope that someone is useful :)
source link
Simply use the PHP function strip_tags. See
http://php.net/manual/de/function.strip-tags.php
$string = "<div>hello</div>";
echo strip_tags($string);
Will output
hello
You also can provide a list of tags to keep.
==
Another approach is this:
// Load a file into $html
$html = file_get_contents('scratch.html');
$matches = [];
preg_match_all("/<\/*([^\s>]*)>/", $html, $matches);
// Have a list of all Tags only once
$tags = array_unique($matches[1]);
// Find the script index and remove it
$scriptTagIndex = array_search("script", $tags);
if($scriptTagIndex !== false) unset($tags[$scriptTagIndex]);
// Taglist must be a string containing <tagname1><tagename2>...
$allowedTags = array_map(function ($s) { return "<$s>"; }, $tags);
// Stript the HTML and keep all Tags except for removed ones (script)
$noScript = strip_tags($html,join("", $allowedTags));
echo $noScript;
So I currently have this...
<?php
$textblockwithformatedlinkstoecho = preg_replace('!(((f|ht)tp(s)?://)[-a-zA-
Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1',
$origtextwithlinks);
echo $textblockwithformatedlinkstoecho;
?>
But, I would like to also shorten the clickable link to around 15 chars in length...
Example input text
I recommend you visit http://www.example.com/folder1/folder2/page3.html?
longtext=ugsdfhsglshghsdghlsg8ysd87t8sdts8dtsdtygs9ysd908yfsd0fyu for more
information.
Required output text
I recommend you visit example.com/fol... for more information.
You can use preg_replace_callback() to manipulate the matches.
Example:
$text = "I recommend you visit http://www.example.com/folder1/folder2/page3.html?longtext=ugsdfhsglshghsdghlsg8ys\d87t8sdts8\dtsdtygs9ysd908yfsd0fyu for more information.";
$fixed = preg_replace_callback(
'!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i',
function($matches) {
// Get the fully matched url
$url = $matches[0];
// Do some magic for the link text, like only show the first 15 characters
$text = strlen($url) > 15
? substr($url, 0, 15) . '...'
: $url;
// Return the new html link
return '' . $text . '';
},
$text
);
echo $fixed;
You probably need to modify your regex though, since it doesn't match the \-characters you have in the query string in the url.
What is the easiest way of applying highlighting of some text excluding text within OCCASIONAL tags "<...>"?
CLARIFICATION: I want the existing tags PRESERVED!
$t =
preg_replace(
"/(markdown)/",
"<strong>$1</strong>",
"This is essentially plain text apart from a few html tags generated with some
simplified markdown rules: <a href=markdown.html>[see here]</a>");
Which should display as:
"This is essentially plain text apart from a few html tags generated with some simplified markdown rules: see here"
... BUT NOT MESS UP the text inside the anchor tag (i.e. <a href=markdown.html> ).
I've heard the arguments of not parsing html with regular expressions, but here we're talking essentially about plain text except for minimal parsing of some markdown code.
Actually, this seems to work ok:
<?php
$item="markdown";
$t="This is essentially plain text apart from a few html tags generated
with some simplified markdown rules: <a href=markdown.html>[see here]</a>";
//_____1. apply emphasis_____
$t = preg_replace("|($item)|","<strong>$1</strong>",$t);
// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=
// <strong>markdown</strong>.html>[see here]</a>"
//_____2. remove emphasis if WITHIN opening and closing tag____
$t = preg_replace("|(<[^>]+?)(<strong>($item)</strong>)([^<]+?>)|","$1$3$4",$t);
// this preserves the text before ($1), after ($4)
// and inside <strong>..</strong> ($2), but without the tags ($3)
// "This is essentially plain text apart from a few html tags generated
// with some simplified <strong>markdown</strong> rules: <a href=markdown.html>
// [see here]</a>"
?>
A string like $item="odd|string" would cause some problems, but I won't be using that kind of string anyway... (probably needs htmlentities(...) or the like...)
You could split the string into tag/no-tag parts using preg_split:
$parts = preg_split('/(<(?:[^"\'>]|"[^"<]*"|\'[^\'<]*\')*>)/', $str, -1, PREG_SPLIT_DELIM_CAPTURE);
Then you can iterate the parts while skipping every even part (i.e. the tag parts) and apply your replacement on it:
for ($i=0, $n=count($parts); $i<$n; $i+=2) {
$parts[$i] = preg_replace("/(markdown)/", "<strong>$1</strong>", $parts[$i]);
}
At the end put everything back together with implode:
$str = implode('', $parts);
But note that this is really not the best solution. You should better use a proper HTML parser like PHP’s DOM library. See for example these related questions:
Highlight keywords in a paragraph
Regex / DOMDocument - match and replace text not in a link
First replace any string after a tag, but force your string is after a tag:
$t=preg_replace("|(>[^<]*)(markdown)|i",'$1<strong>$2</strong>',"<null>$t");
Then delete your forced tag:
$show=preg_replace("|<null>|",'',$show);
You could split your string into an array at every '<' or '>' using preg_split(), then loop through that array and replace only in entries not beginning with an '>'. Afterwards you combine your array to an string using implode().
This regex should strip all HTML opening and closing tags: /(<[.*?]>)+/
You can use it with preg_replace like this:
$test = "Hello <strong>World!</strong>";
$regex = "/(<.*?>)+/";
$result = preg_replace($regex,"",$test);
actually this is not very efficient, but it worked for me
$your_string = '...';
$search = 'markdown';
$left = '<strong>';
$right = '</strong>';
$left_Q = preg_quote($left, '#');
$right_Q = preg_quote($right, '#');
$search_Q = preg_quote($search, '#');
while(preg_match('#(>|^)[^<]*(?<!'.$left_Q.')'.$search_Q.'(?!'.$right_Q.')[^>]*(<|$)#isU', $your_string))
$your_string = preg_replace('#(^[^<]*|>[^<]*)(?<!'.$left_Q.')('.$search_Q.')(?!'.$right_Q.')([^>]*<|[^>]*$)#isU', '${1}'.$left.'${2}'.$right.'${3}', $your_string);
echo $your_string;
I have a paragraph of text in the following format:
text text text <age>23</age>. text text <hobbies>...</hobbies>
I want to be able to
1) Extract the text found between each <age> and <hobbies> tag found in the string. So for example, I would have an array called $ages which will contain all ages found between all the <age></age> tags, and then another array $hobbies which will have the text between the <hobbies></hobbies> tags found throughout the string.
2) Be able to replace the tags which are extracted with a marker, such as {age_444}, so e.g the above text would become
text text text {age_444}. text text {hobbies_555}
How can this be done?
//Extract the age
preg_match_all("#<age>(.*?)</age>#",$string,$match);
$ages=$match[1];
//Extract the hobby
preg_match_all("#<hobbies>(.*?)</hobbies>#",$string,$match);
$hobbies=$match[1];
//Replace the age
$agefn=create_function('$match','$query=mysql_query("select ageid...where age=".$match[1]); return "<age>{age_".mysql_fetch_object($query)->ageid."}</age>"');
$string=preg_replace_callback("#<age>(.*?)</age>#",$agefn,$string);
//Replace the hobby
$hobfn=create_function('$match','$query=mysql_query("select hobid...where hobby=".$match[1]); return "<hobbies>{hobbies_".mysql_fetch_object($query)->hobid."}</hobbies>"');
$string=preg_replace_callback("#<hobbies>(.*?)</hobbies>#",$hobfn,$string);
If your source document is a kind of well-formed XML (or if it can easily be brought into this shape at least), you can use XSLT/XSL-FO to transform your document.
Finding informations enclosed by <> tags and rearranging/extracting them is one of the main features. You can use XSLT/XSL-FO stand-alone or within various languages (Java, C, even Visual Basic)
What you need is your source document and a document describing the transformation rules. The rendering machine or library function will do the rest.
Hope that helps. Good luck
$string = '<age>23</age><hobbies>hobbietext</hobbies>';
$ageTemp = explode('<age>', $string );
foreach($ageTemp as $key=>$value)
{
$age = explode('</age>', $value);
if(isset($age[0])) $ages[] = $age[0];
}
$hobbiesTemp = explode('<hobbies>', $string );
foreach($hobbiesTemp as $key=>$value)
{
$hobbie = explode('</hobbies>', $value);
if(isset($hobbie[0])) $hobbies[] = $hobbie[0];
}
final arrays are $hobbies and $ages
after that you just replace your sting like this:
foreach($ages as $key=>$value)
{
$string = str_replace('<age>'.$value.'</age>', '{age_'.$yourId.'}', $string);
}
foreach($hobbies as $key=>$value)
{
$string = str_replace('<hobbies>'.$value.'</hobbies>', '{hobbie_'.$yourId.'}', $string);
}