Is it possible to exclude parts of the matched string in preg_match? - php

when writing a script that is supposed to download content from a specific div I was wondering if it is possible to skip some part of the pattern in such a way that it will not be included in the matching result.
examlple:
<?php
$html = '
<div class="items">
<div class="item-s-1827">
content 1
</div>
<div class="item-s-1827">
content 2
</div>
<div class="item-s-1827">
content 3
</div>
</div>
';
preg_match_all('/<div class=\"item-s-([0-9]*?)\">([^`]*?)<\/div>/', $html, $match);
print_r($match);
/*
Array
(
[0] => Array
(
[0] => <div class="item-s-1827">
content 1
</div>
[1] => <div class="item-s-1827">
content 2
</div>
[2] => <div class="item-s-1827">
content 3
</div>
)
[1] => Array
(
[0] => 1827
[1] => 1827
[2] => 1827
)
[2] => Array
(
[0] =>
content 1
[1] =>
content 2
[2] =>
content 3
) ) */
Is it possible to omit class=\"item-s-([0-9]*?)\" In such a way that the result is not displayed in the $match variable?

In general, you can assert strings precede or follow your search string with positive lookbehinds / positive lookaheads. In the case of a lookbehind, the pattern must be of a fixed length which stands in conflict with your requirements. But fortunately there's a powerful alternative to that: You can make use of \K (keep text out of regex), see http://php.net/manual/en/regexp.reference.escape.php:
\K can be used to reset the match start since PHP 5.2.4. For example, the patter foo\Kbar matches "foobar", but reports that it has matched "bar". The use of \K does not interfere with the setting of captured substrings. For example, when the pattern (foo)\Kbar matches "foobar", the first substring is still set to "foo".
So here's the regex (I made some additional changes to that), with \K and a positive lookahead:
preg_match_all('/<div class="item-s-[0-9]+">\s*\K[^<]*?(?=\s*<\/div>)/', $html, $match);
print_r($match);
prints
Array
(
[0] => Array
(
[0] => content 1
[1] => content 2
[2] => content 3
)
)

The preferred way to parse HTML in PHP is to use DomDocument to load the HTML and then DomXPath to search the result object.
Update
Modified based on comments to question so that <div> class names just have to begin with item-s-.
$html = '<div class="items">
<div class="item-s-1827">
content 1
</div>
<div class="item-s-18364">
content 2
</div>
<div class="item-s-1827">
content 3
</div>
</div>';
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$divs = $xpath->query("//div[starts-with(#class,'item-s-')]");
foreach ($divs as $div) {
$values[] = trim($div->nodeValue);
}
print_r($values);
Output:
Array (
[0] => content 1
[1] => content 2
[2] => content 3
)
Demo on 3v4l.org

Related

Foreach does not get xpath results from node

I use xpath webdriver to find a div in the code and I need to get data on each node of this div, but this is not happening.
HTML:
<div class="elements">
<div class="element"><div class="title">Title A</div></div>
<div class="element"><div class="title">Title B</div></div>
<div class="element"><div class="title">Title C</div></div>
</div>
PHP Code:
$elements = array();
$data = $driver->findElements(WebDriverBy::xpath("//div[#class='elements']//div[#class='element']"));
foreach ($data as $i => $element) {
$elements[$i]["title"] = $element->findElement(WebDriverBy::xpath("//div[#class='title']"))->getText();
}
Result Array $elements being returned:
Array
(
[0] => Array
(
[title] => Title A
)
[1] => Array
(
[title] => Title A
)
[2] => Array
(
[title] => Title A
)
)
The above script is only returning Title A 3 times.
I need it to work like it has a numeral in xPath [x]. Exemple:
(//div[#class='elements']//div[#class='element'])[1]//div[#class='title'] for Title A
(//div[#class='elements']//div[#class='element'])[2]//div[#class='title'] for Title B
(//div[#class='elements']//div[#class='element'])[3]//div[#class='title'] for Title C
I can't use numeral because xPath is too big and would mess up the code a lot.
Surely the correct node xPath in foreach wasn't supposed to work?
When using WebElement to locate another WebElement with xpath you need to use current context . in the path
$element->findElement(WebDriverBy::xpath(".//div[#class='title']"))

Strange behavior of preg_match_all php

I have a very long string of html. From this string I want to parse pairs of rus and eng names of cities. Example of this string is:
$html = '
Абакан
Хакасия республика
Абан
Красноярский край
Абатский
Тюменская область
';
My code is:
$subject = $this->html;
$pattern = '/<a href="([\/a-zA-Z0-9-"]*)">([а-яА-Я]*)/';
preg_match_all($pattern, $subject, $matches);
For trying I use regexer . You can see it here http://regexr.com/399co
On the test used global modifier - /g
Because of in PHP we can't use /g modifier I use preg_match_all function. But result of preg_match_all is very strange:
Array
(
[0] => Array
(
[0] => <a href="/forecasts5000/russia/republic-khakassia/abakan">Абакан
[1] => <a href="/forecasts5000/russia/krasnoyarsk-territory/aban">Абан
[2] => <a href="/forecasts5000/russia/tyumen-area/abatskij">Аба�
[3] => <a href="/forecasts5000/russia/arkhangelsk-area/abramovskij-ma">Аб�
)
[1] => Array
(
[0] => /forecasts5000/russia/republic-khakassia/abakan
[1] => /forecasts5000/russia/krasnoyarsk-territory/aban
[2] => /forecasts5000/russia/tyumen-area/abatskij
[3] => /forecasts5000/russia/arkhangelsk-area/abramovskij-ma
)
[2] => Array
(
[0] => Абакан
[1] => Абан
[2] => Аба�
[3] => Аб�
)
)
First of all - it found only first match (but I need to get array with all matches)
The second - result is very strange for me. I want to get the next result:
pairs of /forecasts5000/russia/republic-khakassia/abakan and Абакан
What do I do wrong?
Element 0 of the result is an array of each of the full matches of the regexp. Element 1 is an array of all the matches for capture group 1, element 2 contains capture group 2, and so on.
You can invert this by using the PREG_SET_ORDER flag. Then element 0 will contain all the results from the first match, element 1 will contain all the results from the second match, and so on. Within each of these, [0] will be the full match, and the remaining elements will be the capture groups.
If you use this option, you can then get the information you want with:
foreach ($matches as $match) {
$url = $match[1];
$text = $match[2];
// Do something with $url and $text
}
You can also use T-Regx library which has separate methods for each case :)
pattern('<a href="([/a-zA-Z0-9-"]*)">([а-яА-Я]*)')
->match($this->html)
->forEach(function (Match $match) {
$match = $match->text();
$group = $match->group(1);
echo "Match $match with group $group"
});
I also has automatic delimiters

How Can I Display First 2 Paragraphs? And then Remaining Paragraphs? - PHP

I have 4 paragraphs of text in one string. Each paragraph is surrounded with <p></p>.
My first goal is to output the first 2 paragraphs.
My second goal it to output the remaining paragraphs somewhere else on the page. I could sometimes be dealing with strings containing more than 4 paragraphs.
I've searched on the web for anything already out there. There's quite a bit about displaying just the first paragraph, but nothing I could find about displaying paragraphs 1-2 and then the remaining paragraphs. Can anyone help here?
Not sure which to use if any, substr, strpos, etc.....?
EDIT - thanks for your answers, to clarify, the paragraphs don't contain HTML at the moment, but yes I will need the option to have HTML within each paragraph.
Use regular expression:
$str = '<p style="color:red;"><b>asd</b>para<img src="afs"/>graph 1</p >
<p>paragraph 2</p>
<p>paragraph 3</p>
<p>paragraph 4</p>
';
// preg_match_all('/<p.*>([^\<]+)<\/p\s*>/i',$str,$matches);
//for inside html like a comment sais:
preg_match_all('/<p[^\>]*>(.*)<\/p\s*>/i',$str,$matches);
print_r($matches);
prints:
Array
(
[0] => Array
(
[0] => <p style="color:red;"><b>asd</b>para<img src="afs"/>graph 1</p >
[1] => <p>paragraph 2</p>
[2] => <p>paragraph 3</p>
[3] => <p>paragraph 4</p>
)
[1] => Array
(
[0] => <b>asd</b>para<img src="afs"/>graph 1
[1] => paragraph 2
[2] => paragraph 3
[3] => paragraph 4
)
)
Use DOMDocument
Initialize with:
$dom = new DOMDocument;
$dom->loadHTML($myString);
$p = $dom->getElementsByTagName('p');
If each can contains other HTML elements(or not), create a function:
function getInner(DOMElement $node) {
$tmp = "";
foreach($node->childNodes as $c) {
$tmp .= $c->ownerDocument->saveXML($c);
}
return $tmp;
}
and then use that function when needing the paragraph like so:
$p1 = getInner($p->item(0));
You can read more about DOMDocument here

Regex pattern matches fine but output is not complete

I am trying this regex pattern:
$string = '<div class="className">AlwaysTheSame:</div>Subtitle <br /><span class="anotherClass">entry1</span><span class="anotherClass">entry2</span><span class="anotherClass">entry3</span>';
preg_match_all('|<div class="className">AlwaysTheSame:</div>(.*?)<br />(<span class="anotherClass">(.*?)</span>)*|', $string, $matches);
print_r($matches);
exit;
The <span class="anotherClass">entry</span> can not exists or exists multiple times, the pattern seems to match it fine works both when exists and when it doesn't, but the output is:
Array
(
[0] => Array
(
[0] => <div class="className">AlwaysTheSame:</div>Subtitle <br /><span class="anotherClass">entry1</span><span class="anotherClass">entry2</span><span class="anotherClass">entry3</span>
)
[1] => Array
(
[0] => Subtitle
)
[2] => Array
(
[0] => <span class="anotherClass">entry3</span>
)
[3] => Array
(
[0] => entry3
)
)
Array[0][0] contains the full string so its matching all I need, but in Array[2] and [3] I only get the last <span...
How can I get all those <span... in the output array and not just the last one?
You can't directly, at least not in PHP. Repeated capturing groups always contain the last expression they matched. The exception is .NET where regex matches have an additional property that allows you to access every single match of a repeated group. Also, Perl 6 can do something like this - but not PHP.
Solution: Use
~<div class="className">AlwaysTheSame:</div>(.*?)<br />((?:<span class="anotherClass">(.*?)</span>)*)~
Now the second capturing group contains all the <span> tags. With another regex you can then extract all the matches:
~(?<=<span class="anotherClass">).*?(?=</span>)~
I'm using ~ as a regex delimiter, by the way - using | is confusing IMO.

Find h3 and h4 tags beneath it

This is my HTML:
<h3>test 1</h3>
<p>blah</p>
<h4>subheading 1</h4>
<p>blah</p>
<h4>subheading 2</h4>
<h3>test 2</h3>
<h4>subheading 3</h4>
<p>blah</p>
<h3>test 3</h3>
I am trying to build an array of the h3 tags, with the h4 tags nested within them. An example of the array would look like:
Array
(
[test1] => Array
(
[0] => subheading 1
[1] => subheading 2
)
[test 2] => Array
(
[0] => subheading 3
)
[test 3] => Array
(
)
)
Happy to use preg_match or DOMDocument, any ideas?
With DOMDocument:
use XPath "//h3" to find all <h3>. These will be the first-level entries in your array
for each of them:
count a variable $i (count from 1!) as part of the loop
use XPath "./following::h4[count(preceding::h3) = $i]" to find any sub-ordinate <h4>
these will be second-level in you array
The XPath expression is "select all <h4> that have a the same constant number of preceding <h3>". For the first <h3> that count is 1, naturally, for the second the count is 2, and so on.
Be sure to execute the XPath expression in the context of the respective <h3> nodes.

Categories