This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I found a problem with my homework on how to get the URL value from html using php. I tried a website to try my code, but i need get some URL with pattern (specific result)
example : https: //video.xxxxxxx/
my code :
$regexp = "/<a\s[^>]*href=([\"\']??)([^\\1 >]*?)\\1[^>]*>(.*)<\/a>/siU";
if(preg_match_all("$regexp", $data, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
echo $match[0];
}
}
You can try this:
<a.*?href\s*=\s*([\"\'])(.*?)\1.*?>.*?<\/a>
As seen here
I've never used PHP before, so you might have to use \\1 instead of \1
Explanation:
It's tedious to explain every single element of this, so I'll give you a general idea. First you match the a tag, followed by any number of characters, styles, or different attributes, then followed by href=. Here, we start the capturing group 1, which contains your ' or ". Capturing group 2 contains your website's url without the quotations. Then we use \1 to refer to the type of quotation first used.
If you want the text within the a tag, for whatever reason, you can refer to it using \3
Do note: You'll need to use match[2] instead of match[0]
Related
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 2 years ago.
I'm trying to remove a script that contains a malware from my database.
It was injected in a lot of registers of my table.
The script starts with a <script> tag and ends with a </script> tag.
I'm using the following code to find and replace it:
$content = $post->post_content;
$new_content = preg_replace('/(<script>.+?)+(<\/script>)/i', '', $content);
I've tested it on regx101.com and it's working fine but on my code, it doesn't work.Does anyone know what's wrong?
Here is my goto regex for <script>...</script> tags with their contents:
(\<script\>)([\s\S]*?)(<\/script>)
You're not escaping some key characters and you're not capturing everything which could be in the contents of the tags.
Here is an explanation of the content capturing group:
\s matches any whitespace character
\S matches any non-whitespace character
*? matches between zero and unlimited times, as few times as possible, expanding as needed
As I stated before, you really shouldn't do this. You should use a PHP DOM parser instead.
This question already has answers here:
What does the $1$2$4 mean in this preg_replace?
(3 answers)
Closed 4 years ago.
I want to loop through an array converting specific key/value pairs that contain markup to HTML.
So an example value for $comment['comment_text'] would be:
This has *bolded* text
And should become:
This has <strong>bolded</strong> text
Here's what I've tried:
$pattern = "/\*\b.*?\b\*/i";
$newComment = preg_replace($pattern, "<strong>$&</strong>",
$comment['comment_text']);
And what I get:
This has $& text
I realize I'm mashing up Javascript with PHP, but reading about back references in PHP hasn't made things any clearer.
My strings may have multiple bolded (in markup) instances...
Any help appreciated.
UPDATE:
Apologies - I didn't realize that Stackoverflow was converting asterisks to italics. I converted the example to code.
Also, my confusion came down to the use of $0 vs. $1. Which I still don't fully understand. I thought the numbers referred to the matches in the string...so if you had 5 instances you could refer to them by $0 through $4.
If you use $0 you get:
This has <strong>*bolded*</strong> text
But if you use $1 you get the desired result.
Do this.
$pattern = "/\*\b(.*?)\b\*/";
$newComment = preg_replace($pattern, "<strong>$1</strong>", $comment['comment_text']);
Here $1 refers to the group 1 match. Here I'm supposing that you want to make text between ** bolded.
This question already has answers here:
My regex is matching too much. How do I make it stop? [duplicate]
(5 answers)
Closed 5 years ago.
I am using the following regex
'/\#(.*)\((.*)\)/'
And I am trying to get #ONE(TWO) one and two from the expression. Which works as long as it's the only time that it can be found before an end of line (I think)
I am quite green with regex and I really cannot understand what I am doing wrong.
What I need is to be able to get all the ONE/TWO couples. Can you please help me.
I am working with PHP and the following function
$parsed_string = preg_replace_callback(
// Placeholder for not previously created article
// Pattern example: #George Ioannidis(person)
'/\#(.*)\((.*)\)/',
function ($matches) {
return $this->parsePlaceholders( $matches );
},
$string
);
The results I am getting from https://regexr.com
* expression is greedy by default. For example such regexp (.*)a will return you bdeabde result on bdeabdea string. You should use special ? symbol for non-greedy * behavior. In your case try to use /\#(.*?)\((.*?)\)/ regexp.
This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I want to parse a html string using php (Simple number matching).
<i>1002</i><i>999</i><i>344</i><i>663</i>
and I want the result as an array. eg: [1002,999,344,633,...]
I tried like this :
<?php
$html="<i>1002</i><i>999</i><i>344</i><i>663</i>";
if(preg_match_all("/<i>[0-9]*<\/i>/",$html, $matches,PREG_SET_ORDER))
foreach($matches as $match) {
echo strip_tags($match[0])."<br/>";
}
?>
and I got the exact output which I want.
1002
999
344
663
But when I try the same code by making a small change in regular expression I'm getting different answer.
Like this:
<?php
$html="<i>1002</i><i>999</i><i>344</i><i>663</i>";
if(preg_match_all("/<i>.*<\/i>/",$html, $matches,PREG_SET_ORDER))
foreach($matches as $match) {
echo strip_tags($match[0])."<br/>";
}
?>
Output :
1002999344663
(The regular expression matched the entire string.)
Now I want to know why I'm getting like this?
What is the difference if use .* (zero or more) instead of [0-9]* ?
The .* in your regex matches any character ([0-9]* only matches numbers and </i><i> isn't a number). The regex /<i>.*<\/i>/ matches:
<i>1002</i><i>999</i><i>344</i><i>663</i>
^ from here ------------------- to here ^
Since, the whole string is inside <i></i>.
This is because * is greedy. It takes the max amount of characters it can match.
To fix your problem, you need to use .*?. This makes it takes the minimum amount of characters it can match.
The regex /<i>.*?<\/i>/ will work as you want.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Getting DIV content with Regular Expression
Let me first tell you that DOM is not an option on this one.
I simply have the html :
className">Name</div>......</div>....</div>
Now, i have created a regular expression like :
$match_count = preg_match_all('/className\">(.*)\<\/div\>/', $page, $matches);
This would seem fine to me, but for some reason, it gets more data than expected. That is, it finishes some closing divs later. How can i restrict it so that it gets the data only inside the first closing div ?
$match_count = preg_match_all('/className">(.*?)<\/div>/', $page, $matches);
use non greedy selector .*?
Use preg_match instead. It will stop searching after the first matched pattern.
This works:
$match_count = preg_match_all('/className\">(.*)\<\/div\>/', $page, $matches);
The U pattern modifier will make sure it finds the smallest possible match, not the biggest.