I have a string variable that contains a lot of HTML markup and I want to get the last <li> element from it.
Im using something like:
$markup = "<body><div><li id='first'>One</li><li id='second'>Two</li><li id='third'>Three</li></div></body>";
preg_match('#<li(.*?)>(.*)</li>#ims', $markup, $matches);
$lis = "<li ".$matches[1].">".$matches[2]."</li>";
$total = explode("</li>",$lis);
$num = count($total)-2;
echo $total[$num]."</li>";
This works and I get the last <li> element printed. But I cant understand why I have to subtract the last 2 indexes of the array $total. Normally I would only subtract the last index since counting starts on index 0. What im i missing?
Is there a better way of getting the last <li> element from the string?
HTML is not regular, and so can't be parsed with a regular expression. Use a proper HTML parser.
#OP, your requirement looks simple, so no need for parsers or regex.
$markup = "<body><div><li id='first'>One</li><li id='second'>Two</li><li id='third'>Three</li></div></body>";
$s = explode("</li>",$markup,-1);
$t = explode(">",end($s));
print end($t);
output
$ php test.php
Three
If you already know how to use jQuery, you could also take a look at phpQuery. It's a PHP library that allows you to easily access dom elements, just like in jQuery.
From the PHP.net documentation:
If matches is provided, then it is filled with the results of search. $matches[0] will contain the text that matched the full pattern, $matches[1] will have the text that matched the first captured parenthesized subpattern, and so on.
$matches[0] is the complete match (not just the captured bits)
You have to extract the second index because you have 2 capturing groupds:
$matches[0]; // Contains your original string
$matches[1]; // Contains the argument for the LI start-tag (.*?)
$matches[2]; // Contains the string contained by the LI tags (.*)
'parsing' (x)html strings is with regular expressions is hard and can be full of unexpected problems. parsing more than simple tagged strings is not possible because (x)html is not a regular language.
you could improve your regex by using (not tested):
/#<li([^>]*)>(.+?)</li>#ims/
strrpos — Find position of last occurrence of a char in a string
Related
I have 2 sets of tags on page, first is
{tip}tooltip text{/tip}
and second is
{tip class="someClass"}tooltip text{/tip}
I need to replace those with
<span class=„sstooltip”><i>?</i><em>tooltip text</em></span>
I dont know how to deal with adding new class to the <span> tag. (The tooltip class is always present)
This is my regex /\{tip.*?(?:class="([a-z]+)")?\}(.*?)\{\/tip\}/.
I guess I need to check array indexes for class value, but those are different, depending on {tip} tag version. Do I need two regular expressions, one for each version, or there is some way to extract and replace class value?
php code:
$regex = "/\{tip.*?(?:class=\"([a-z]+)\")?\}(.*?)\{\/tip\}/";
$matches = null;
preg_match_all($regex, $article->text, $matches);
if (is_array($matches)) {
foreach ($matches as $match) {
$article->text = preg_replace(
$regex,
"<span class=tooltip \$1"."><i>?</i><em>"."\$2"."</em></span>",
$article->text
);
}
}
Here's your answer (I've also made it a bit more robust):
{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^{]*){\/tip}
PCRE (which PHP uses, if memory serves) will automatically pick up that the first capture group (which grabs the classes) is empty in the first case, and just substitute the empty string in the replacement. The second case is self-explanatory.
Your replacement code, then, will look like this:
$article->text = preg_replace(
'/{tip(?:\s+class\s*=\s*"([a-zA-Z\s]+)")?}([^}]*){\/tip}/',
'<span class="tooltip $1"><i>?</i><em>$2</em></span>',
$article->text
);
Yout don't need to check if the regex matches beforehand - that's implied by preg_replace, which is performing a regex match and then replacing any text matched by the pattern with that text. If there are no matches, no replacement occurs.
Regex Demo on Regex101
Code Demo on repl.it
I'm trying to read an HTML file and capture all anchor tags that match a specific URL pattern in order to display those links on another page. The pattern looks like this:
https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web
I'm lousy with RegEx. I've tried a bunch of things and read a bunch of answers here on Stack Overflow, but I'm not hitting on the correct syntax.
Here's what I have now:
preg_match ('/<a href="https:\/\/docs.google.com\/file\/d\/(.*)<\/a>/', $file, $matches)
When I test this on an HTML page with two matching anchor tags, the first result includes the first and second match and everything in between, while the second result includes part of the first match, part of the second match, and everything in between.
While I'd be happy to capture matching anchor tags along with the inner HTML, I'd be even happier if I could generate a multidimensional array with the HREF attribute of each matching anchor tag, along with the matching inner HTML (so I can format the links myself, without having to use even more RegEx to get rid of unwanted attributes). Would I use preg_match_all for that? What would that look like?
Am I even on the right path here, or should I be using DOM and XPath queries to find this stuff?
Thanks.
Oh jeez, I can't believe every answer here uses "/" delimiters. If your pattern has slashes in it, use something else for the sake of readability.
Here's a better answer (you may need to tweak if your anchors may have additional attributes other than href):
$hrefPattern = "(?P<href>https://docs\.google\.com/file/d/[a-z0-9]+/edit\?usp=drive_web)";
$innerPattern = "(?P<inner>.*?)";
$anchorPattern = "$innerPattern";
preg_match_all("#$anchorPattern#i", $file, $matches);
This will give you something like:
[
0 => ['<span>More foo</span>'],
"href" => ["https://docs.google.com/file/d/foo/edit?usp=drive_web"],
"inner" => ["<span>More foo</span>"]
]
And absolutely, you should use the DOM for this.
Replace (.*) with (.*?) - use lazy quantification:
preg_match('/<a href="https:\/\/docs.google.com\/file\/d\/(.*?)<\/a>/', $file, $matches);
You could use the following regular expression:
/<a.*?href="(https:\/\/docs\.google\.com\/file\/d\/.*?)".*?>(.*?)<\/a>/
Which would give you the URL from the href and the innerHTML.
Break down
<a.*?href=" Matches the opening a tag and any charachters up until href="
(https:\/\/docs\.google\.com\/file\/d\/.*?)" Matches (and captures) until the end of the href (i.e. until "
.*?> Matches all characters to the end of the a tag >
(.*?)<\/a> Matches (and captures) the innerHTML until the closing a tag (i.e. </a>).
Dave,
The DOM would be better. But here is the Regex that works.
$url = 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"';
preg_match ('/href="https:\/\/docs.google.com\/file\/d\/(.*?)"/', $url, $matches);
Results:
array (size=2)
0 => string 'href="https://docs.google.com/file/d/aBunchOfLettersAndNumbers/edit?usp=drive_web"' (length=82)
1 => string 'aBunchOfLettersAndNumbers/edit?usp=drive_web' (length=44)
You can can the html tags, but most importantly, in your question, your code in the preg_match line didn't contain the ending > of the opening tag which threw it off and it needed to have (.?) instead of (.). The added ? tells it to looking for any characters, of an unknown quantity. (.*) means any one character I believe.
I got a HTML code containing following:
<span rel="url">example.com</span>
<span rel="url">example.net.pl [SOMETHING]</span>
<span rel="url">[SOMETHING]imjustanexample.com</span> [..]
The question is, if there is a way to get the "url" string from between span tags. eg. it should get the following: example.com, example.net.pl (without the [SOMETHING] string), and imjustanexample.com.
I guess I will have to use regex for this purpose.
Try this regular expression in javascript,
/((http|https):\/\/(\w+:{0,1}\w*#)?(\S+)|)(:[0-9]+)?(\/|\/([\w#!:.?+=&%#!\-\/]))?/
to validate text from span tag
I would go this way (either in regex or just PHP code, like you prefer):
Locate next ""
Take everything from it's end until the next (but not including) space or lower-than sign < (whichever of those tow comes first).
Repeat until nothing is matched any longer.
Done. If regular expression is too complicated for you, you can also take string functions http://php.net/strings .
This should work:
$str = '<span rel="url">http://google.ca</span>';
$match = preg_match('#<span(.*)?>((http|https|ftp)://(\S*?\.\S*?))(\s|\;|\)|\]|\[|\{|\}|,|"|\'|:|\<|$|\.\s)</span>#i', $str, $matches);
if($match)
var_dump($matches);
else
echo 'Nope<br />';
Regex from: https://stackoverflow.com/a/206087/1533203
Check out Simple HTML Dom Parser ( here ).
With it you can simply access elements on the DOM tree.
Your problem could be solved with:
$html->find("span[rel=url]");
And then you could simply use a loop on all elements and some regex which fits your needs.
So I'm working with some pretty awesome HTML strings stored in our DB and I need to be able to parse out the string between the "forum-style" youtube tags as in the example below. I have a solution, but it feels a bit hackish. I'm thinking there's probably a more elegant way to handle this problem.
<?php
$video_string = '<p><span style="font-size: 12px;"><span style="font-family: verdana,geneva,sans-serif;">[youtube]KbI_7IHAsyw[/youtube]<br /></span></span></p>';
$matches = array();
preg_match('/\][_A-Za-z0-9]+\[/', $video_string, $matches);
$yt_vid_key = substr($matches[0], 1, strlen($matches[0]) - 2 );
I'd change the regex a bit:
'/\[youtube\](.*?)\[\/youtube\]/is'
Adding the 'youtube' part to not replace ALL bb-codes - only the right ones.
I've also added the '?' to make the regex less greedy (incase there are multiple YT videos in one post.
I added the pattern modifiers i and s, to be able to match case-insensitive and multiline strings.
Edit:
You may also rather want to use preg_replace, it'll be a bit less code that way.
Try this:
preg_match('!\[youtube\]([_A-Za-z0-9]+?)\[/youtube\]!',$subject, $matches);
$yt_vid_key = $matches[1];
if you expect multiple occurances, use preg_match_all instead.
All of the answers provided here are correct if you don't expect nested tags if so then you have to come up with a way to match the tags properly, which can't really be done in regex and you will have to create some sort of way to handle it.
Here is some pseudo like code to help you out
find opening tag to tag match
openTags = 0
closeTags = 0
position = 0
do{
Move through the string: increase position
if open tag matches: openTags++
if close tag matches: closeTags++, positionOfCloseTag = position
}while(openTags > closeTags);
first occurence of close tag after the last close tag you found in do-while loop is the correct matching of the tag.
I built a site a long time ago and now I want to place the data into a database without copying and pasting the 400+ pages that it has grown to so that I can make the site database driven.
My site has meta tags like this (each page different):
<meta name="clan_name" content="Dark Mage" />
So what I'm doing is using cURL to place the entire HTML page in a variable as a string. I can also do it with fopen etc..., but I don't think it matters.
I need to shift through the string to find 'Dark Mage' and store it in a variable (so i can put into sql)
Any ideas on the best way to find Dark Mage to store in a variable? I was trying to use substr and then just subtracting the number of characters from the e in clan_name, but that was a bust.
Just parse the page using the PHP DOM functions, specifically loadHTML(). You can then walk the tree or use xpath to find the nodes you are looking for.
<?
$doc = new DomDocument;
$doc->loadHTML($html);
$meta = $doc->getElementsByTagName('meta');
foreach ($meta as $data) {
$name = $meta->getAttribute('name');
if ($name == 'clan_name') {
$content = $meta->getAttribute('content');
// TODO handle content for clan_name
}
}
?>
EDIT If you want to remove certain tags (such as <script>) before you load your HTML string into memory, try using the strip_tags() function. Something like this will keep only the meta tags:
<?
$html = strip_tags($html, '<meta>');
?>
Use a regular expression like the following, with PHP's preg_match():
/<meta name="clan_name" content="([^"]+)"/
If you're not familiar with regular expressions, read on.
The forward-slashes at the beginning and end delimit the regular expression. The stuff inside the delimiters is pretty straightforward except toward the end.
The square-brackets delimit a character class, and the caret at the beginning of the character-class is a negation-operator; taken together, then, this character class:
[^"]
means "match any character that is not a double-quote".
The + is a quantifier which requires that the preceding item occur at least once, and matches as many of the preceding item as appear adjacent to the first. So this:
[^"]+
means "match one or more characters that are not double-quotes".
Finally, the parentheses cause the regular-expression engine to store anything between them in a subpattern. So this:
([^"]+)
means "match one or more characters that are not double-quotes and store them as a matched subpattern.
In PHP, preg_match() stores matches in an array that you pass by reference. The full pattern is stored in the first element of the array, the first sub-pattern in the second element, and so forth if there are additional sub-patterns.
So, assuming your HTML page is in the variable "$page", the following code:
$matches = array();
$found = preg_match('/<meta name="clan_name" content="([^"]+)"/', $page, $matches);
if ($found) {
$clan_name = $matches[1];
}
Should get you what you want.
Use preg_match. A possible regular expression pattern is /clan_name.+content="([^"]+)"/