Using preg_match_all to get items from HTML

Using preg_match_all to get items from HTML - php

I have a number of items in a table, formatted like this
<td class="product highlighted">
Item Name
</td>
and I am using the following PHP code
$regex_pattern = "/<td class=\"product highlighted\">(.*)<\/td>/";
preg_match_all($regex_pattern,$buffer,$matches);
print_r($matches);
I am not getting any output, yet I can see the items in the html.
Is there something wrong with my regexp?

Apart from your using regex to parse HTML, yes, there is something wrong: The dot doesn't match newlines.
So you need to use
$regex_pattern = "/<td class=\"product highlighted\">(.*?)<\/td>/s";
The /s modifier allows the dot to match any character, including newlines. Note the reluctant quantifier .*? to avoid matching more than one tag at once.

In order to match your example, you will need to add the dot all flag, s, so the . will match newlines.
Try the following.
$regex_pattern = "/<td class=\"product highlighted\">(.*?)<\/td>/s";
Also note that I changed the capture to non-greedy, (.*?). It's best to do so when matching open ended text.
It's worth noting regular expressions are not the right tool for HTML parsing, you should look into DOMDocument. However, for such a simple match you can get away with regular expressions provided your HTML is well-formed.

Related

(PHP) How to find words beginning with a pattern and replace all of them?

I have a string. An example might be "Contact /u/someone on reddit, or visit /r/subreddit or /r/subreddit2"
I want to replace any instance of "/r/x" and "/u/x" with "[/r/x](http://reddit.com/r/x)" and "[/u/x](http://reddit.com/u/x)" basically.
So I'm not sure how to 1) find "/r/" and then expand that to the rest of the word (until there's a space), then 2) take that full "/r/x" and replace with my pattern, and most importantly 3) do this for all "/r/" and "/u/" matches in a single go...
The only way I know to do this would be to write a function to walk the string, character by character, until I found "/", then look for "r" and "/" to follow; then keep going until I found a space. That would give me the beginning and ending characters, so I could do a string replacement; then calculate the new end point, and continue walking the string.
This feels... dumb. I have a feeling there's a relatively simple way to do this, and I just don't know how to google to get all the relevant parts.

A simple preg_replace will do what you want.
Try:
$string = preg_replace('#(/(?:u|r)/[a-zA-Z0-9_-]+)#', '[\1](http://reddit.com\1)', $string);
Here is an example: http://ideone.com/dvz2zB
You should see if you can discover what characters are valid in a Reddit name or in a Reddit username and modify the [a-zA-Z0-9_-] charset accordingly.

You are looking for a regular expression.
A basic pattern starts out as a fixed string. /u/ or /r/ which would match those exactly. This can be simplified to match one or another with /(?:u|r)/ which would match the same as those two patterns. Next you would want to match everything from that point up to a space. You would use a negative character group [^ ] which will match any character that is not a space, and apply a modifier, *, to match as many characters as possible that match that group. /(?:u|r)/[^ ]*
You can take that pattern further and add a lookbehind, (?<= ) to ensure your match is preceded by a space so you're not matching a partial which results in (?<= )/(?:u|r)/[^ ]*. You wrap all of that to make a capturing group ((?<= )/(?:u|r)/[^ ]*). This will capture the contents within the parenthesis to allow for a replacement pattern. You can express your chosen replacement using the \1 reference to the first captured group as [\1](http://reddit.com\1).
In php you would pass the matching pattern, replacement pattern, and subject string to the preg_replace function.

In my opinion regex would be an overkill for such a simple operation. If you just want to replace instance of "/r/x" with "[r/x](http://reddit.com/r/x)" and "/u/x" with "[/u/x](http://reddit.com/u/x)" you should use str_replace although with preg_replace it'll lessen the code.
str_replace("/r/x","[/r/x](http://reddit.com/r/x)","whatever_string");
use regex for intricate search string and replace. you can also use http://www.jslab.dk/tools.regex.php regular expression generator if you have something complex to capture in the string.

PHP regex lookbehind with wildcard

I have two strings in PHP:
$string = '<a href="http://localhost/image1.jpeg" /></a>';
and
$string2 = '[caption id="attachment_5" align="alignnone" width="483"]<a href="http://localhost/image1.jpeg" /></a>[/caption]';
I'm trying to match strings of the first type. That is strings that are not surrounded by '[caption ... ]' and '[/caption]'. So far, I would like to use something like this:
$pattern = '/(?<!\[caption.*\])(?!\[\/caption\])(<a.*><img.*><\/a>)/';
but PHP matches out the first string as well with this pattern even though it is NOT preceeded by '[caption' and zero or more characters followed by ']'. What gives? Why is this and what's the correct pattern?
Thanks.

Variable length look-behind is not supported in PHP, so this part of your pattern is not valid:
(?<!\[caption.*\])
It should be warning you about this.
In addition, .* always matches the larges possible amount. Thus your pattern may result in a match that overlaps multiple tags. Instead, use [^>] (match anything that is not a closing bracket), because closing brackets should not occur inside the img tag.
To solve the look-behind problem, why not just check for the closing tag only? This should be sufficient (assuming the caption tags are only used in a way similar to what you have shown).
$pattern = '|(<a[^>]*><img[^>]*></a>)(?!\[/caption\])|';
When matching patterns that contain /, use another character as the pattern delimiter to avoid leaning toothpick syndrome. You can use nearly any non-alphanumeric character around the pattern.
Update: the previous regex is based on the example regex you gave, rather than the example data. If you want to match links that don't contain images, do this:
$pattern = '|(<a[^>]*>[^<]*</a>)(?!\[/caption\])|';
Note that this doesn't allow any tags in the middle of the link. If you allow tags (such as by using .*?), a regex could match something starting within the [caption] and ending elsewhere.

I don't see how your regexp could match either string, since you're looking for <a.*><img.*><\/a>, and both anchors don't contain an <img... tag. Also, the two subexpressions looking for and prohibiting the caption-bits look oddly positioned to me. Finally, you need to ensure your tag-matching bits don't act greedy, i.e. don't use .* but [^>]*.
Do you mean something like this?
$pattern = '/(<a[^>]*>(<img[^>]*>)?<\/a>)(?!\[\/caption\])/'
Test it on regex101.
Edit: Removed useless lookahead as per dan1111's suggestion and updated regex101 link.

Lookbehind doesn't allow non fixed length pattern i.e. (*,+,?), I think this /<a.*><\/a>(?!\[\/caption\])/ is enough for your requirement

How to regex match text with different endings?

This is what I have at the moment.
<h2>Information</h2>\n +<p>(.*)<br />|</p>
^ that is a tab space, didn't know if there was
a better way to represent one or more (it seems to work)
Im trying to match the 'bla bla.' text, but my current regex doesn't quite work, it will match most of the line, but I want it to match the first
<h2>Information</h2>
<p>bla bla.<br /><br />google<br />
or
<h2>Information</h2>
<p>bla bla.</p> other code...
Oh and my php code:
preg_match('#h2>Information</h2>\n +<p>(.*)<br />|</p>#', $result, $postMessage);

Don't use regex to parse HTML. PHP provides DOMDocument that can be used for this purpose.
Having said that you have some errors in your regular expression:
You need parentheses around the alternation.
You need lazy modifiers.
You can't type 'header' to match 'Information'.
With these changes it would look like this:
<h2>.*?</h2>\n\t+<p>.*?(<br />|</p>)
Your regular expression is also very fragile. For example, if the input contains spaces instead of tabs or the line ending is Windows-style, your regular expression will fail. Using a proper HTML parser will give a much more robust solution.

Use \s to match any whitespace character (including spaces, tabs, new-line feeds, etc.), e.g.
preg_match('#<h2>header</h2>\s*<p>(.*)<br />|</p>#', $result, $postMessage);
But, as already mentioned, do not use regular expressions to parse HTML.

the .* match should be non greedy (match the minimum of arbitrary characters instead of the maxium), that is (.*?) i guess in PHP.

Try making your match non-greedy by using (.*?) in place of (.*)

recursive regular expression to process nested strings enclosed by {| and |}

In a project I have a text with patterns like that:
{| text {| text |} text |}
more text
I want to get the first part with brackets. For this I use preg_match recursively. The following code works fine already:
preg_match('/\{((?>[^\{\}]+)|(?R))*\}/x',$text,$matches);
But if I add the symbol "|", I got an empty result and I don't know why:
preg_match('/\{\|((?>[^\{\}]+)|(?R))*\|\}/x',$text,$matches);
I can't use the first solution because in the text something like { text } can also exist. Can somebody tell me what I do wrong here? Thx

Try this:
'/(?s)\{\|(?:(?:(?!\{\||\|\}).)++|(?R))*\|\}/'
In your original regex you use the character class [^{}] to match anything except a delimiter. That's fine when the delimiters are only one character, but yours are two characters. To not-match a multi-character sequence you need something this:
(?:(?!\{\||\|\}).)++
The dot matches any character (including newlines, thank to the (?s)), but only after the lookahead has determined that it's not part of a {| or |} sequence. I also dropped your atomic group ((?>...)) and replaced it with a possessive quantifier (++) to reduce clutter. But you should definitely use one or the other in that part of the regex to prevent catastrophic backtracking.

You've got a few suggestions for working regular expressions, but if you're wondering why your original regexp failed, read on. The problem lies when it comes time to match a closing "|}" tag. The (?>[^{}]+) (or [^{}]++) sub expression will match the "|", causing the |} sub expression to fail. With no backtracking in the sub expression, there's no way to recover from the failed match.

See PHP - help with my REGEX-based recursive function
To adapt it to your use
preg_match_all('/\{\|(?:^(\{\||\|\})|(?R))*\|\}/', $text, $matches);

Getting a random string within a string

I need to find a random string within a string.
My string looks as follows
{theme}pink{/theme} or {theme}red{/theme}
I need to get the text between the tags, the text may differ after each refresh.
My code looks as follows
$str = '{theme}pink{/theme}';
preg_match('/{theme}*{\/theme}/',$str,$matches);
But no luck with this.

* is only the quantifier, you need to specify what the quantifier is for. You've applied it to }, meaning there can be 0 or more '}' characters. You probably want "any character", represented by a dot.
And maybe you want to capture only the part between the {..} tags with (.*)
$str = '{theme}pink{/theme}';
preg_match('/{theme}(.*){\/theme}/',$str,$matches);
var_dump($matches);

'/{theme}(.*?){\/theme}/' or even more restrictive '/{theme}(\w*){\/theme}/' should do the job

preg_match_all('/{theme}(.*?){\/theme}/', $str, $matches);
You should use ungreedy matching here. $matches[1] will contain the contents of all matched tags as an array.

$matches = array();
$str = '{theme}pink{/theme}';
preg_match('/{([^}]+)}([^{]+){\/([^}]+)}/', $str, $matches);
var_dump($matches);
That will dump out all matches of all "tags" you may be looking for. Try it out and look at $matches and you'll see what I mean. I'm assuming you're trying to build your own rudimentary template language so this code snippet may be useful to you. If you are, I may suggest looking at something like Smarty.
In any case, you need parentheses to capture values in regular expressions. There are three captured values above:
([^}]+)
will capture the value of the opening "tag," which is theme. The [^}]+ means "one or more of any character BUT the } character, which makes this non-greedy by default.
([^{]+)
Will capture the value between the tags. In this case we want to match all characters BUT the { character.
([^}]+)
Will capture the value of the closing tag.

preg_match('/{theme}([^{]*){\/theme}/',$str,$matches);
[^{] matches any character except the opening brace to make the regex non-greedy, which is important, if you have more than one tag per string/line

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Using preg_match_all to get items from HTML - php

Related

(PHP) How to find words beginning with a pattern and replace all of them?

PHP regex lookbehind with wildcard

How to regex match text with different endings?

recursive regular expression to process nested strings enclosed by {| and |}

Getting a random string within a string

Categories

Resources