How to use use php preg_split with an html string - php

I am trying to parse a badly formed html table:
A couple of lines of this are:
Food:</b> Yes<b><br>
Pool: </b>Beach<b></b><b><br>
Centre:</b> Yes<b><br>
After spending a lot of time on this with Xpath, I think it is probably better to split the above text into lines use preg_split and parse from there.
The pattern I think would work uses:
<\b><\br>*: <\b>
my code is as follows:
$pattern='</b></br>*:</b>';
$pattern=preg_quote($pattern,'#');
$chars = preg_split($pattern, $output);
print_r($chars);
I am getting the following error:
Delimiter must not be alphanumeric or backslash
What I am doing wrong?

Try this:
$pattern='</b></br>*:</b>';
$pattern=preg_quote($pattern,'#');
$chars = preg_split('#'.$pattern.'#', $output);
print_r($chars);
The preg_quote function just makes it safely escaped, it doesn't actually add the delimiters for you.
As other people will surely point out, using regular expressions is not a good way to parse HTML :)
Your regular expression is also not going to match what you hope. Here's a version that will probably work for your input:
$in = " Pool: </b>Beach<b></b><b><br>";
$out = explode(':', strip_tags($in));
$key = trim($out[0]);
$value = trim($out[1]);
echo "$key = $value\n";
This removes all the HTML, then splits on the colon, and then removes any surrounding whitespace.

Your pattern needs to start and end with a delimiter; looks like you're using # if I'm reading this correctly, so you should have $pattern = '#</b></br>.*:</b>#';.
Also, you're mixing things up; * is not a simple wildcard in regex. If you mean "any number of any characters," the pattern you need is .*. I've included this above.

Related

PHP regex with quotes

I want to match all href values in my page content. I wrote regex for that and tested it on regex101
href[ ]*=[ ]*("|')(.+?)\1
This finds all my href values properly. If I use
href[ ]*=[ ]*(?:"|')(.+?)(?:"|')
its even better since I do not have to use certain group later.
With " and ' in regex string I cannot run the regex properly with
$matches = array();
$pattern = "/href[ ]*=[ ]*("|')(.+?)\1/"; // syntax error
$numOfMatches = preg_match_all($pattern, $pattern, $matches);
print_r($matches);
If I "escape" double quote and thus repair the syntax error I get no matches.
So - what is the correct way to apply the given regex in PHP?
Thanks for any help
Notes:
addslashes or preg_quote won't help since I need to pass legit string first
escaping all the special chars \ + * ? [ ^ ] $ ( ) { } = ! < > | : - didn't help either
EDIT: Ok, I see I really shouldn't be doing this with regex. Could you please provide some helpful DOM parsers or any other tool I 'should' use with PHP for instance ?
For your case, the following should work:
/<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
Given the nature of the WWW there are always going to be cases where the regular expression breaks down. Small changes to the patterns can fix these.
spaces around the = after href:
/<a\s[^>]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
matching only links starting with http:
/<a\s[^>]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
single quotes around the link address:
/<a\s[^>]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
Source
I had to use this regex to make it work. Next time I will definitely try with DOM parser :)
$regexForHREF = "/href[ ]*=[ ]*(?:\"|')(.+?)(?:\"|')/";

Php preg_split for forwardslash?

I've some text that i wish to parse
$str = "text1<br/>text2<br/>text3
I've tried using
print_r( preg_split("<br/>", $str));
but it is not giving me the desired output
Try the following:
$str = "text1<br/>text2<br/>text3";
print_r(preg_split("/<br\/>/", $str));
I'm assuming missing the closing quote " at the end of the $str = "text1<br/>text2<br/>text3" is just a typo.
Take a look at this page on how to specify the string $pattern parameter: http://php.net/manual/en/function.preg-split.php
It's because you're not using the correct regular expression. Is there a reason you can't use explode()? Regex is problematic, overly complicated at times, and much slower. If you know you'll always be splitting at the BR tag, explode is much more efficient.
Parsing HTML with regex is a bad idea, but here you go:
var_dump(preg_split('/(<br\ ?\/?>)+/', $str));

Make me understand preg_replace

I've been looking all over the internet for some useful information and I think I found too much. I'm trying to understand regular expressions but don't get it.
Lets for instance say $data="A bunch of text [link=123] another bunch of text.", and it should get replaced with "< a href=\"123.html\">123< /a>".
I've been trying around a lot with code similar to this:
$find = "/[link=[([0-9])]/";
$replace = "< a href=\"$1\">$1< /a>";
echo preg_replace ($find, $replace, $data);
but the output is always the same as the original $data.
I think I have to see something relevent to my problem understand the basics.
Remove the extra [] around the (), and add + after the [0-9] to quantify it. Also, escape the [] that make up the tag itself.
$find = "/\[link=(\d+)\]/"; // "\d" is equivalent to "[0-9]"
$replace = "$1";
echo preg_replace($find,$replace,$data);
The regex would be \[link=([\d]+)\]
A good source for an quick overview of regular expression can you find here http://www.regular-expressions.info/
When you really interested in the power of regular expression, you should buy this book: Mastering Regular Expressions
A good Programm to test your RexEx on a Windows Client is: RegEx-Trainer
You are missing the + quantifier and as a result of this your pattern matches if there is a single digit following link=.
And there is an extra pair of [..] as a result of this the outer [...] will be treated as the character class.
You also forgot the escape the closing ].
Solution:
$find = "/[link=([0-9]+)\]/";
<?php
$data= "A bunch of text [link=123] another bunch of text.";
$find = '/\[link=([0-9]+?)\]/';
echo preg_replace($find, "$1", $data);

Regex pattern matching literal repeated \n

Given a literal string such as:
Hello\n\n\n\n\n\n\n\n\n\n\n\nWorld
I would like to reduce the repeated \n's to a single \n.
I'm using PHP, and been playing around with a bunch of different regex patterns. So here's a simple example of the code:
$testRegex = '/(\\n){2,}/';
$test = 'Hello\n\n\n\n\n\n\n\n\nWorld';
$test2 = preg_replace($testRegex ,'\n',$test);
echo "<hr/>test regex<hr/>".$test2;
I'm new to PHP, not that new to regex, but it seems '\n' conforms to special rules. I'm still trying to nail those down.
Edit: I've placed the literal code I have in my php file here, if I do str_replace() I can get good things to happen, but that's not a complete solution obviously.
To match a literal \n with regex, your string literal needs four backslashes to produce a string with two backlashes that’s interpreted by the regex engine as an escape for one backslash.
$testRegex = '/(\\\\n){2,}/';
$test = 'Hello\n\n\n\n\n\n\n\n\n\n\n\nWorld';
$test2 = preg_replace($testRegex, '\n', $test);
Perhaps you need to double up the escape in the regular expression?
$pattern = "/\\n+/"
$awesome_string = preg_replace($pattern, "\n", $string);
Edit: Just read your comment on the accepted answer. Doesn't apply, but is still useful.
If you're intending on expanding this logic to include other forms of white-space too:
$output = echo preg_replace('%(\s)*%', '$1', $input);
Reduces all repeated white-space characters to single instances of the matched white-space character.
it indeed conforms to special rules, and you need to add the "multiline"-modifier, m. So your pattern would look like
$pattern = '/(\n)+/m'
which should provide you with the matches. See the doc for all modifiers and their detailed meaning.
Since you're trying to reduce all newlines to one, the pattern above should work with the rest of your code. Good luck!
Try this regular expression:
/[\n]*/

Replacing HTML attributes using a regex in PHP

OK,I know that I should use a DOM parser, but this is to stub out some code that's a proof of concept for a later feature, so I want to quickly get some functionality on a limited set of test code.
I'm trying to strip the width and height attributes of chunks HTML, in other words, replace
width="number" height="number"
with a blank string.
The function I'm trying to write looks like this at the moment:
function remove_img_dimensions($string,$iphone) {
$pattern = "width=\"[0-9]*\"";
$string = preg_replace($pattern, "", $string);
$pattern = "height=\"[0-9]*\"";
$string = preg_replace($pattern, "", $string);
return $string;
}
But that doesn't work.
How do I make that work?
PHP is unique among the major languages in that, although regexes are specified in the form of string literals like in Python, Java and C#, you also have to use regex delimiters like in Perl, JavaScript and Ruby.
Be aware, too, that you can use single-quotes instead of double-quotes to reduce the need to escape characters like double-quotes and backslashes. It's a good habit to get into, because the escaping rules for double-quoted strings can be surprising.
Finally, you can combine your two replacements into one by means of a simple alternation:
$pattern = '/(width|height)="[0-9]*"/i';
Your pattern needs the start/end pattern character. Like this:
$pattern = "/height=\"[0-9]*\"/";
$string = preg_replace($pattern, "", $string);
"/" is the usual character, but most characters would work ("|pattern|","#pattern#",whatever).
I think you're missing the parentheses (which can be //, || or various other pairs of characters) that need to surround a regular expression in the string. Try changing your $pattern assignments to this form:
$pattern = "/width=\"[0-9]*\"/";
...if you want to be able to do a case-insensitive comparison, add an 'i' at the end of the string, thus:
$pattern = "/width=\"[0-9]*\"/i";
Hope this helps!
David

Categories