Parsing html string in php using regular expression [duplicate] - php

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I want to parse a html string using php (Simple number matching).
<i>1002</i><i>999</i><i>344</i><i>663</i>
and I want the result as an array. eg: [1002,999,344,633,...]
I tried like this :
<?php
$html="<i>1002</i><i>999</i><i>344</i><i>663</i>";
if(preg_match_all("/<i>[0-9]*<\/i>/",$html, $matches,PREG_SET_ORDER))
foreach($matches as $match) {
echo strip_tags($match[0])."<br/>";
}
?>
and I got the exact output which I want.
1002
999
344
663
But when I try the same code by making a small change in regular expression I'm getting different answer.
Like this:
<?php
$html="<i>1002</i><i>999</i><i>344</i><i>663</i>";
if(preg_match_all("/<i>.*<\/i>/",$html, $matches,PREG_SET_ORDER))
foreach($matches as $match) {
echo strip_tags($match[0])."<br/>";
}
?>
Output :
1002999344663
(The regular expression matched the entire string.)
Now I want to know why I'm getting like this?
What is the difference if use .* (zero or more) instead of [0-9]* ?

The .* in your regex matches any character ([0-9]* only matches numbers and </i><i> isn't a number). The regex /<i>.*<\/i>/ matches:
<i>1002</i><i>999</i><i>344</i><i>663</i>
^ from here ------------------- to here ^
Since, the whole string is inside <i></i>.
This is because * is greedy. It takes the max amount of characters it can match.
To fix your problem, you need to use .*?. This makes it takes the minimum amount of characters it can match.
The regex /<i>.*?<\/i>/ will work as you want.

Related

preg_match_all not ignoring characters after pattern [duplicate]

This question already has answers here:
Regular Expression Word Boundary and Special Characters
(3 answers)
Regular expression to stop at first match
(9 answers)
Closed 2 years ago.
If my string is
<?php $str = 'hello world!'; function func_hello($str) { echo $str; }
I want to find the name of any functions in the string
I'm using the code
preg_match_all('%func_.* ?\(%', $c, $matches);
This is a basic example of what I'm doing. In the real world I'm getting results like this
func_check_error($ajax_action_check, array(
func_post('folder') == '/' || func_post(
func_check_error($fvar_ajax_action_check, array(
Whereas I want the result to be
func_check_error(
func_post(
func_check_error(
I've tried \b to set a boundary but it's not working. i.e.
preg_match_all('%\bfunc_.* ?\(\b%', $c, $matches);
The .* capture the opening parenthesis, and then the first parenthesis (after the function name) is captured, because there is the following parenthesis (the one of the array) which correspond to the \( of your pattern.
You should try a more restrictive condition on the function name, such as alphanumeric only or anything but a parenthesis, maybe replace the func_.* by func_[^(]* wich will stop at the first parenthesis match
Simple regex should work fine:
~func_.*\(~
If this is not giving the results you expect, it may be due to another issue with your code and how you're using the regex.

How to get URL value from html with PHP [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I found a problem with my homework on how to get the URL value from html using php. I tried a website to try my code, but i need get some URL with pattern (specific result)
example : https: //video.xxxxxxx/
my code :
$regexp = "/<a\s[^>]*href=([\"\']??)([^\\1 >]*?)\\1[^>]*>(.*)<\/a>/siU";
if(preg_match_all("$regexp", $data, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
echo $match[0];
}
}
You can try this:
<a.*?href\s*=\s*([\"\'])(.*?)\1.*?>.*?<\/a>
As seen here
I've never used PHP before, so you might have to use \\1 instead of \1
Explanation:
It's tedious to explain every single element of this, so I'll give you a general idea. First you match the a tag, followed by any number of characters, styles, or different attributes, then followed by href=. Here, we start the capturing group 1, which contains your ' or ". Capturing group 2 contains your website's url without the quotations. Then we use \1 to refer to the type of quotation first used.
If you want the text within the a tag, for whatever reason, you can refer to it using \3
Do note: You'll need to use match[2] instead of match[0]

regex to replace contend of second <p>-tag [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 5 years ago.
Output (var $DESC)
<p>erster Absatz</p>
<p>zweiter Absatz</p>
Regex (PHP)
preg_replace("<([a-z][a-z0-9]*)\b[^>]*>(.*?)</\1>{2}", '', $DESC)
I would like to delete only the second p but this regex finds both. Thanks for any help.
Normally I would just tell you to use an HTML parser instead of regex, but since your requirement is so specific, this can actually be accomplished with regex quite safely.
(?<=<\/p>)\s+<p>[\w ]+<\/p>
https://regex101.com/r/Yqaajy/6
Explanation:
(?<=<\/p>) - Make sure the rest of the pattern is preceded by a <\p> ending tag (positive lookbehind).
\s+ - Any number of whitespace characters. Note that this will not match correctly if you have single line mode enabled.
<p>[\w ]+<\/p> - A paragraph block containing one or more word characters (digits, letters, and underscore) and spaces.
Try this:
$DESC ='<p>erster Absatz</p>
<p>zweiter Absatz</p>';
$DESC = preg_replace('#\</p\>[^\<]*\<p[^\>]*\>(.*?)\</p\>#i', '</p>', $DESC);
echo $DESC; // <p>erster Absatz</p>

PHP isolate character surrounded by special character [duplicate]

This question already has answers here:
Extract a single (unsigned) integer from a string
(23 answers)
Closed 6 years ago.
I have the following string:
$db_string = '/var/www/html/1_wlan_probes.db';
I want to isolate/strip the number character so that I only have the following left:
$db_string = '1';
So far I havn't found an simply solution since the number that needs to be found is random and could be any positive number. I have tried strstr, substr and custom functions but none produce what I am looking after, or I'm simply overlooking somehthing really simple.
Thanks in advance
You should use the preg_match() function:
$db_string = '/var/www/html/1_wlan_probes.db';
preg_match('/html\/(\d+)/', $db_string, $matches);
print_r($matches[1]); // 1
html\/(\d+) - capture all the numbers that come right after the html/
You can test it out Here. It does not matter how long the number is, you're using a regular expression to match all of them.

PHP preg_match (.*) not matching past line breaks [duplicate]

This question already has answers here:
How do I match any character across multiple lines in a regular expression?
(26 answers)
Closed 2 years ago.
I have this data in a LONGTEXT column (so the line breaks are retained):
Paragraph one
Paragraph two
Paragraph three
Paragraph four
I'm trying to match paragraph 1 through 3. I'm using this code:
preg_match('/Para(.*)three/', $row['file'], $m);
This returns nothing. If I try to work just within the first line of the paragraph, by matching:
preg_match('/Para(.*)one/', $row['file'], $m);
Then the code works and I get the proper string returned. What am I doing wrong here?
Use the s modifier.
preg_match('/Para(.*)three/s', $row['file'], $m);
Pattern Modifiers
Add the multi-line modifier.
Eg:
preg_match('/Para(.*)three/m', $row['file'], $m)
Try setting the regex to dot-all (PCRE_DOTALL), so it includes line breaks (the extra 's' parameter at the end):
preg_match('/Para(.*)three/s', $row['file'], $m);
If you don't like / at the start and and, use T-Regx
$m = Pattern::of('Para(.*)three')->match($row['file'])->first();

Categories