Matching string without specific pattern between specific places - php

$example_string = "<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>"
what i need to match is the classes and the "rating" part (8/10).
Something like this, except i dont know how to write (ANYTHING EXCEPT <br> here) in regexp:
preg_match_all('#class="([0-9]{3})"><br>(ANYTHING EXCEPT <br> here)*?([0-9]/10)#',
$example_string, matches);
So a preg_match_all should give these results:
$matches[1][1] = '190';
$matches[1][2] = '8/10';
$matches[2][1] = '154';
$matches[2][2] = '9/10';

to work off of your pattern, and to answer your question
class="([0-9]{3})"><br>(?:(?!<br>).)*?([0-9]\/10)
Demo

I don't know php, but it should work as it does in python...
get the matches between "classes", and iterate to get your data in the returned matched strings
import re # the regex module
example_string = '"<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>"'
for match in re.findall(r'(?:class[^\d]")([^\/]+)(?!class)', example_string):
print(list(re.findall(r'(\d+)', match)))
yields the following lists:
['190', '8']
['154', '9']

A simple DOM parser would be able to give you that information:
$example_string = '<a class="190"><br>hello.. 8/10<br><a class="154"><br>9/10<br>';
$dom = new DOMDocument;
$dom->loadHTML($example_string);
$xpath = new DOMXPath($dom);
// get all text nodes that have an anchor parent with a class attribute
$query = '//text()[parent::a[#class]]';
foreach ($xpath->query($query) as $node) {
echo $node->textContent, "\n";
echo "parent node: ", $node->parentNode->getAttribute('class'), "\n";
}
Output
hello.. 8/10
parent node: 190
9/10
parent node: 154

(?<=class=")(\d+)|(\d+\/\d+)
Try this.See demo.
https://regex101.com/r/yR3mM3/58
$re = "/(?<=class=\")(\\d+)|(\\d+\\/\\d+)/";
$str = "<a class=\"190\"><br>hello.. 8/10<br><a class=\"154\"><br>9/10<br>";
preg_match_all($re, $str, $matches);

Related

replace all occurrences of a string

I want to add a class to all p tags that contain arabic text in it. For example:
<p>لمبارة وذ</p>
<p>do nothing</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
should become
<p class="foo">لمبارة وذ</p>
<p>do nothing</p>
<p class="foo">خمس دقائق يخ</p>
<p class="foo">مراعاة إبقاء 3 لاعبين</p>
I am trying to use PHP preg_replace function to match the pattern (arabic) with following expression:
preg_replace("~(\p{Arabic})~u", "<p class=\"foo\">$1", $string, 1);
However it is not working properly. It has two problems:
It only matches the first paragraph.
Adds an empty <p>.
Sandbox Link
It only matches the first paragraph.
This is because you added the last argument, indicating you want only to replace the first occurrence. Leave that argument out.
Adds an empty <p>.
This is in fact the original <p> which you did not match. Just add it to the matching pattern, but keep it outside of the matching group, so it will be left out when you replace with $1.
Here is a corrected version, also on sandbox:
$text = preg_replace("~<p>(\p{Arabic}+)~u", "<p class=\"foo\">$1", $string);
Your first problem is that you weren't telling it to match the <p>, so it didn't.
Your main problem is that spaces aren't Arabic. Simply adding the alternative to match them fixes your problem:
$text = preg_replace("~<p>(\p{Arabic}*|\s*)~u", "<p class=\"foo\">$1", $string);
Using DOMDocument and DOMXPath:
$html = <<<'EOD'
<p>لمبارة وذ</p>
<p>خمس دقائق يخ</p>
<p>مراعاة إبقاء 3 لاعبين</p>
EOD;
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML('<div>'.$html.'</div>', LIBXML_HTML_NOIMPLIED);
$xpath = new DOMXPath($dom);
// here you register the php namespace and the preg_match function
// to be able to use it in the XPath query
$xpath->registerNamespace("php", "http://php.net/xpath");
$xpath->registerPhpFunctions('preg_match');
// select only p nodes with at least one arabic letter
$pNodes = $xpath->query("//p[php:functionString('preg_match', '~\p{Arabic}~u', .) > 0]");
foreach ($pNodes as $pNode) {
$pNode->setAttribute('class', 'foo');
}
$result = '';
foreach ($dom->documentElement->childNodes as $childNode) {
$result .= $dom->saveHTML($childNode);
}
echo $result;

preg_match all paragraphs in a string

The following string contains multiple <p> tags. I want to match the contents of each of the <p> with a pattern, and if it matches, I want to add a css class to that specific paragraph.
For example in the following string, only the second paragraph content matches, so i want to add a class to that paragraph only.
$string = '<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>';
With the following code, I can match all of the string, but I am unable to figure out how to find the specific paragraph.
$rtl_chars_pattern = '/[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]/u';
$return = preg_match($rtl_chars_pattern, $string);
Create a capture group on the <p> tag
Use preg_replace
https://regex101.com/r/nE5pT1/1
$str = "<p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p>";
$result = preg_replace("/(<p>)[\\x{0590}-\\x{05ff}\\x{0600}-\\x{06ff}]/u", "<p class=\"foo\">", $str, 1);
Use a combination of SimpleXML, XPath and regular expressions (regex on text(), etc. are only supported as of XPath 2.0).
The steps:
Load the DOM first
Get all p tags via an xpath query
If the text / node value matches your regex, apply a css class
This is the actual code:
<?php
$html = "<html><p>para 1</p><p>نص عربي أو فارسي</p><p>para3</p></html>";
$xml = simplexml_load_string($html);
# query the dom for all p tags
$ptags = $xml->xpath("//p");
# your regex
$regex = '~[\x{0590}-\x{05ff}\x{0600}-\x{06ff}]~u';
# alternatively:
# $regex = '~\p{Arabic}~u';
# loop over the tags, if the regex matches, add another attribute
foreach ($ptags as &$p) {
if (preg_match($regex, (string) $p))
$p->addAttribute('class', 'some cool css class');
}
# just to be sure the tags have been altered
echo $xml->asXML();
?>
See a demo on ideone.com. The code has the advantage that you only analyze the content of the p tag, not the DOM structure in general.

What's wrong with my PHP regex?

I'm trying to pull a specific link from a feed where all of the content is on one line and there are multiple links present. The one I want has the content of "[link]" in the the A tag. Here's my example:
test1 test2 [link] test3test4
... could be more links before and/or after
How do I isolate just the href with the content "[link]"?
This regex goes to the correct end of the block I want, but starts at the first link:
(?<=href\=\").*?(?=\[link\])
Any help would be greatly appreciated! Thanks.
Try this updated regex:
(?<=href\=\")[^<]*?(?=\">\[link\])
See demo.
The problem is that the dot matches too many characters and in order to get the right 'href' you need to just restrict the regex to [^<]*?.
Alternatively :)
This code :
$string = 'test1 test2 [link] test3test4';
$regex = '/href="([^"]*)">\[link\]/i';
$result = preg_match($regex, $string, $matches);
var_dump($matches);
Will return :
array(2) {
[0] =>
string(41) "href="http://www.amazingpage.com/">[link]"
[1] =>
string(27) "http://www.amazingpage.com/"
}
You can avoid using regular expression and use DOM to do this.
$doc = DOMDocument::loadHTML('
test1
test2
[link]
test3
test4
');
foreach ($doc->getElementsByTagName('a') as $link) {
if ($link->nodeValue == '[link]') {
echo $link->getAttribute('href');
}
}
With DOMDocument and XPath:
$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);
foreach ($xpath->query('//a[. = "[link]"]/#href') as $node) {
echo $node->nodeValue;
}
or if you are looking for only one result:
$dom = DOMDOcument::loadHTML($yourHTML);
$xpath = DOMXPath($dom);
$nodeList = $xp->query('//a[. = "[link]"][1]/#href');
if ($nodeList->length)
echo $nodeList->item(0)->nodeValue;
xpath query details:
//a # 'a' tag everywhere in the DOM tree
[. = "[link]"] # (condition) which has "[link]" as value
/#href # "href" attribute
The reason your regex pattern doesn't work:
The regex engine walks from left to right and for each position in the string it tries to succeed. So, even if you use a non-greedy quantifier, you obtain always the leftmost result.

I have to find title but print value how?

My code is given below:-
$text = "<div class='title'>Title</div><div class='content'>This is title</div>";
$words = array('Title');
$words = join("|", $words);
$matches = array();
if ( preg_match('/' . $words . '/i', $text, $matches) ){
echo "Words matched: <br/>";
print_r($matches);
}
else{
echo "Not match";
}
The problem is that in above code I am finding title but i don't want to print title; I want to print this: "This is title" and I am not understanding how I can print this by finding title.
Because title is like keyword that will not change but value which i want to print it is dynamic value and it will change every time, that's why i cannot finding value of title. So how can i do it?
Don't use regex for parsing HTML. Use a DOM Parser instead. In this case, you can use an XPath expression to get the element by class name:
$text = "<div class='title'>Title</div>
<div class='content'>This is title</div>";
$dom = new DOMDocument;
$dom->loadHTML($text);
$xpath = new DOMXPath($dom);
$title = $xpath->query('//*[#class="content"]')->item(0)->nodeValue;
Output:
This is title
This should get you started. If the title is in a different position, you can modify the expression accordingly to retrieve it.

php preg_match_all() how to get correct values in match-array

The following situation:
$text = "This is some <span class='classname'>example</span> text i'm writing to
demonstrate the <span class='classname otherclass'>problem</span> of this.<br />";
preg_match_all("|<[^>/]*(classname)(.+)>(.*)</[^>]+>|U", $text, $matches, PREG_PATTERN_ORDER);
I need an array ($matches) where in one field is "<span class='classname'>example</span>" and in another "example".
But what i get here is one field with "<span class='classname'>example</span>" and one with "classname".
It also should contain the values for the other matches, of course.
how can i get the right values?
You would be better off with a DOM parser, however this question is more to do with how capturing works in Regexes in general.
The reason you are getting classname as a match is because you are capturing it by putting () around it. They are completely unnecessary so you can just remove them. Similarly, you don't need them around .+ since you don't want to capture that.
If you had some group that you had to enclose in () as grouping rather than capturing, start the group with ?: and it won't be captured.
The safe/easy way:
$text = 'blah blah blah';
$dom = new DOM();
$dom->loadHTML($text);
$xp = new DOMXPath($dom);
$nodes = $xp->query("//span[#class='classname']");
foreach($nodes as $node) {
$innertext = $node->nodeValue;
$html = // see http://stackoverflow.com/questions/2087103/innerhtml-in-phps-domdocument
}

Categories