Regex find the first word - php

I'm trying to use regex to add a span to the first word of content for a page, however the content contains HTML so I am trying to ensure just a word gets chosen. The content changes for every page.
Current script is:
preg_match('/(<(.*?)>)*/i',$page_content,$matches);
$stripped = substr($page_content,strlen($matches[0]));
preg_match('/\b[a-z]* \b/i',$stripped,$strippedmatch);
echo substr($page_content, 0, strlen($matches[0])).'<span class="h1">'.$strippedmatch[0].'</span>'.substr($stripped, strlen($strippedmatch[0]));
However if the $page_content is
<p><span class="title">This is </span> my title!</p>
Then my regex thinks the first word is "span" and adds the tags around that.
Is there any way to fix this? (or a better way to do it).

This seems to work...
(?<=\>)\b\w*\b|^\w*\b
If you wanna allow spaces in front also (remember to trim the resulting string):
(?<=>)\s*\b\w*\b|^\s*\w*\b

If i understand you correct you want a tag around the first word (none tag)
with regex you could get that by using this regex
$code = preg_replace('/^(<.+?>\s*)+?(\w+)/i', '\1<span class="h1">\2</span>', $code);
this one just loops over the tags and waits until it finds text outside the tags

You shouldn't be using regex for this, but if you insist, you can try something like this:
<?php
$texts = array(
'<p><span class="title">This is </span> my title!</p>',
'<1> <2> <3> blah blah <4> <5> blah',
'garbage <1> <2> real stuff begins <3> <4>',
);
foreach ($texts as $text) {
print preg_replace('/(>\s*)(\w+)/', '\1{{\2}}', $text, 1)."\n";
}
?>
This prints:
<p><span class="title">{{This}} is </span> my title!</p>
<1> <2> <3> {{blah}} blah <4> <5> blah
garbage <1> <2> {{real}} stuff begins <3> <4>

Related

how to do echo from a string, only from values that are between a specific stretch[href tag] of the string?

[PHP]I have a variable for storing strings (a BIIGGG page source code as string), I want to echo only interesting strings (that I need to extract to use in a project, dozens of them), and they are inside the quotation marks of the tag
but I just want to capture the values that start with the letter: N (news)
[<a href="/news7044449/exclusive_news_sunday_"]
<a href="/n[ews7044449/exclusive_news_sunday_]"
that is, I think you will have to work with match using: [a href="/n]
how to do that to define that the echo will delete all the texts of the variable, showing only:
note that there are other hrefs tags with values that start with other letters, such as the letter 'P' : href="/profiles... (This does not interest me.)
$string = '</div><span class="news-hd-mark">HD</span></div><p>exclusive_news_sunday_</p><p class="metadata"><span class="bg">Czech AV<span class="mobile-hide"> - 5.4M Views</span>
- <span class="duration">7 min</span></span></p></div><script>xv.thumbs.preparenews(7044449);</script>
<div id="news_31720715" class="thumb-block "><div class="thumb-inside"><div class="thumb"><a href="/news31720715/my_sister_running_every_single_morning"><img src="https://static-hw.xnewss.com/img/lightbox/lightbox-blank.gif"';
I imagine something like this:
$removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n = ('/something regex expresion I think /' or preg_match, substring?);
echo $string = str_replace($removes_everything_except_values_from_the_href_tag_starting_with_the_letter_n,'',$string);
expected output: /news7044449/exclusive_news_sunday_
NOTE: it is not essential to be through a variable, it can be from a .txt file the place where the extracts will be extracted, and not necessarily a variable.
thanks.
I believe this will help her.
<?php
$source = file_get_contents("code.html");
preg_match_all("/<a href=\"(\/n(?:.+?))\"[^>]*>/", $source, $results);
var_export( end($results) );
Step by Step Regex:
Regex Demo
Regex Debugger
To get just the links out of the $results array from Valdeir's answer:
foreach ($results as $r) {
echo $r;
// alt: to display them with an HTML break tag after each one
echo $r."<br>\n";
}

preg_replace : getting a html tag inside an other html tag from BBCode

So I'm trying to make a php function to get HTML tags from a BBCode-style form. The fact is, I was able to get tags pretty easily with preg_replace. But I have some troubles when I have a bbcode inside the same bbcode...
Like this :
[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]
So, when I "parse" it, I always have remains bbcode for the blue ones. Something like :
My house is [blue]very[/blue] beautiful today
Everything is colored except for the blue-tag inside the black-tag inside the first blue-tag.
How the hell can I do that ?
With more informations, I tried :
Regex: "/\[blue\](.*)\[\/blue\]/si" or "/\[blue\](.*)\[\/blue\]/i"
Getting : "My house is [blue]very[/blue] beautiful today"
Regex : "/\[blue\](.*?)\[\/blue\]/si" or "/\[blue\](.*)\[\/blue\]/Ui"
Getting : "My house is [blue]very beautiful today[/blue]"
Do I have to loop the preg_replace ? Isn't there a way to do it, regex-style, without looping the thing ?
Thx for your concern. :)
It is right that you should not reinvent the wheel on products and rather choose well-tested plugins. However, if you are experimenting or working on pet projects, by all means, go ahead and experiment with things, have fun and obtain important knowledge in the process.
With that said, you may try following regex. I'll break it down for you on below.
(\[(.*?)\])(.*?)(\[/\2\])
Philosophy
While parsing markup like this, what you are actually seeking is to match tags with their pairs.
So, a clean approach you can take would be running a loop and capturing the most outer tag pair each time and replacing it.
So, on the given regex above, capture groups will give you following info;
Opening tag (complete) [black]
Opening tag (tag name) black
Content between opening and closing tag My [black]house is [blue]very[/blue] beautiful[/black] today
Closing tag [/blue]
So, you can use $2 to determine the tag you are processing, and replace it with
<tag>$3</tag>
// or even
<$2>$3</$2>
Which will give you;
// in first iteration
<tag>My [black]house is [blue]very[/blue] beautiful[/black] today</tag>
// in second iteration
<tag>My <tag2>house is [blue]very[/blue] beautiful</tag2> today</tag>
// in third iteration
<tag>My <tag2>house is <tag3>very</tag3> beautiful</tag2> today</tag>
Code
$text = "[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]";
function convert($input)
{
$control = $input;
while (true) {
$input = preg_replace('~(\[(.*?)\])(.*)(\[/\2\])~s', '<$2>$3</$2>', $input);
if ($control == $input) {
break;
}
$control = $input;
}
return $input;
}
echo convert($text);
As others mentionned, don't try to reinvent the wheel.
However, you could use a recursive approach:
<?php
$text = "[blue]My [black]house is [blue]very[/blue] beautiful[/black] today[/blue]";
$regex = '~(\[ ( (?>[^\[\]]+) | (?R) )* \])~x';
$replacements = array( "blue" => "<bleu>",
"black" => "<noir>",
"/blue" => "</bleu>",
"/black" => "</noir>");
$text = preg_replace_callback($regex,
function($match) use ($replacements) {
return $replacements[$match[2]];
},
$text);
echo $text;
# <bleu>My <noir>house is <bleu>very</bleu> beautiful</noir> today</bleu>
?>
Here, every colour tag is replaced by its French (just made it up) counterpart, see a demo on ideone.com. To learn more about recursive patterns, have a look at the PHP documentation on the subject.

How to wrap every word in spans with PHP?

I have a some html paragraphs and I want to wrap every word in . Now I have
$paragraph = "This is a paragraph.";
$contents = explode(' ', $paragraph);
$i = 0;
$span_content = '';
foreach ($contents as $c){
$span_content .= '<span>'.$c.'</span> ';
$i++;
}
$result = $span_content;
The above codes work just fine for normal cases, but sometimes the $paragraph would contains some html tags, for example
$paragraph = "This is an image: <img src='/img.jpeg' /> This is a <a href='/abc.htm'/>Link</a>'";
How can I not wrap "words" inside html tag so that the htmnl tags still works but have the other words wrapped in spans? Thanks a lot!
Some (*SKIP)(*FAIL) mechanism?
<?php
$content = "This is an image: <img src='/img.jpeg' /> ";
$content .= "This is a <a href='/abc.htm'/>Link</a>";
$regex = '~<[^>]+>(*SKIP)(*FAIL)|\b\w+\b~';
$wrapped_content = preg_replace($regex, "<span>\\0</span>", $content);
echo $wrapped_content;
See a demo on ideone.com as well as on regex101.com.
To leave out the Link as well, you could go for:
(?:<[^>]+> # same pattern as above
| # or
(?<=>)\w+(?=<) # lookarounds with a word
)
(*SKIP)(*FAIL) # all of these alternatives shall fail
|
(\b\w+\b)
See a demo for this on on regex101.com.
The short version is you really do not want to attempt this.
The longer version: If you are dealing with HTML then you need an HTML parser. You can't use regexes. But where it becomes even more messy is that you are not starting with HTML, but with an HTML fragment (which may, or may not be well-formed. It might work if Hence you need to use an HTML praser to identify the non-HTML extents, separate them out and feed them into a secondary parser (which might well use regexes) for translation, then replace the translted content back into the DOM before serializing the document.

Stripping html tags using php

How can i strip html tag except the content inside the pre tag
code
$content="
<div id="wrapper">
Notes
</div>
<pre>
<div id="loginfos">asdasd</div>
</pre>
";
While using strip_tags($content,'') the html inside the pre tag too stripped of. but i don't want the html inside pre stripped off
Try :
echo strip_tags($text, '<pre>');
You may do the following:
Use preg_replace with 'e' modifier to replace contents of pre tags with some strings like ###1###, ###2###, etc. while storing this contents in some array
Run strip_tags()
Run preg_relace with 'e' modifier again to restore ###1###, etc. into original contents.
A bit kludgy but should work.
<?php
$document=html_entity_decode($content);
$search = array ("'<script[^>]*?>.*?</script>'si","'<[/!]*?[^<>]*?>'si","'([rn])[s]+'","'&(quot|#34);'i","'&(amp|#38);'i","'&(lt|#60);'i","'&(gt|#62);'i","'&(nbsp|#160);'i","'&(iexcl|#161);'i","'&(cent|#162);'i","'&(pound|#163);'i","'&(copy|#169);'i","'&#(d+);'e");
$replace = array ("","","\1","\"","&","<",">"," ",chr(161),chr(162),chr(163),chr(169),"chr(\1)");
$text = preg_replace($search, $replace, $document);
echo $text;
?>
$text = 'YOUR CODE HERE';
$org_text = $text;
// hide content within pre tags
$text = preg_replace( '/(<pre[^>]*>)(.*?)(<\/pre>)/is', '$1###pre###$3', $text );
// filter content
$text = strip_tags( $text, '<pre>' );
// insert back content of pre tags
if ( preg_match_all( '/(<pre[^>]*>)(.*?)(<\/pre>)/is', $org_text, $parts ) ) {
foreach ( $parts[2] as $code ) {
$text = preg_replace( '/###pre###/', $code, $text, 1 );
}
}
print_r( $text );
Ok!, you leave nothing but one choice: Regular Expressions... Nobody likes 'em, but they sure get the job done. First, replace the problematic text with something weird, like this:
preg_replace("#<pre>(.+?)</pre>#", "||k||", $content);
This will effectively change your
<pre> blah, blah, bllah....</pre>
for something else, and then call
strip_tags($content);
After that, you can just replace the original value in ||k||(or whatever you choose) and you'll get the desired result.
I think your content is not stored very well in the $content variable
could you check once by converting inner double quotes to single quotes
$content="
<div id='wrapper'>
Notes
</div>
<pre>
<div id='loginfos'>asdasd</div>
</pre>
";
strip_tags($content, '<pre>');
You may do the following:
Use preg_replace with 'e' modifier to replace contents of pre tags with some strings like ###1###, ###2###, etc. while storing this contents in some array
Run strip_tags()
Run preg_relace with 'e' modifier again to restore ###1###, etc. into original contents.
A bit kludgy but should work.
Could you please write full code. I understood, but something goes wrong. Please write full programming code

How to grab the contents of HTML tags?

Hey so what I want to do is snag the content for the first paragraph. The string $blog_post contains a lot of paragraphs in the following format:
<p>Paragraph 1</p><p>Paragraph 2</p><p>Paragraph 3</p>
The problem I'm running into is that I am writing a regex to grab everything between the first <p> tag and the first closing </p> tag. However, it is grabbing the first <p> tag and the last closing </p> tag which results in me grabbing everything.
Here is my current code:
if (preg_match("/[\\s]*<p>[\\s]*(?<firstparagraph>[\\s\\S]+)[\\s]*<\\/p>[\\s\\S]*/",$blog_post,$blog_paragraph))
echo "<p>" . $blog_paragraph["firstparagraph"] . "</p>";
else
echo $blog_post;
Well, sysrqb will let you match anything in the first paragraph assuming there's no other html in the paragraph. You might want something more like this
<p>.*?</p>
Placing the ? after your * makes it non-greedy, meaning it will only match as little text as necessary before matching the </p>.
If you use preg_match, use the "U" flag to make it un-greedy.
preg_match("/<p>(.*)<\/p>/U", $blog_post, &$matches);
$matches[1] will then contain the first paragraph.
It would probably be easier and faster to use strpos() to find the position of the first
<p>
and first
</p>
then use substr() to extract the paragraph.
$paragraph_start = strpos($blog_post, '<p>');
$paragraph_end = strpos($blog_post, '</p>', $paragraph_start);
$paragraph = substr($blog_post, $paragraph_start + strlen('<p>'), $paragraph_end - $paragraph_start - strlen('<p>'));
Edit: Actually the regex in others' answers will be easier and faster... your big complex regex in the question confused me...
Using Regular Expressions for html parsing is never the right solution. You should be using XPATH for this particular case:
$string = <<<XML
<a>
<b>
<c>texto</c>
<c>cosas</c>
</b>
<d>
<c>código</c>
</d>
</a>
XML;
$xml = new SimpleXMLElement($string);
/* Busca <a><b><c> */
$resultado = $xml->xpath('//p[1]');

Categories