Find h3 and h4 tags beneath it - php

This is my HTML:
<h3>test 1</h3>
<p>blah</p>
<h4>subheading 1</h4>
<p>blah</p>
<h4>subheading 2</h4>
<h3>test 2</h3>
<h4>subheading 3</h4>
<p>blah</p>
<h3>test 3</h3>
I am trying to build an array of the h3 tags, with the h4 tags nested within them. An example of the array would look like:
Array
(
[test1] => Array
(
[0] => subheading 1
[1] => subheading 2
)
[test 2] => Array
(
[0] => subheading 3
)
[test 3] => Array
(
)
)
Happy to use preg_match or DOMDocument, any ideas?

With DOMDocument:
use XPath "//h3" to find all <h3>. These will be the first-level entries in your array
for each of them:
count a variable $i (count from 1!) as part of the loop
use XPath "./following::h4[count(preceding::h3) = $i]" to find any sub-ordinate <h4>
these will be second-level in you array
The XPath expression is "select all <h4> that have a the same constant number of preceding <h3>". For the first <h3> that count is 1, naturally, for the second the count is 2, and so on.
Be sure to execute the XPath expression in the context of the respective <h3> nodes.

Related

Foreach does not get xpath results from node

I use xpath webdriver to find a div in the code and I need to get data on each node of this div, but this is not happening.
HTML:
<div class="elements">
<div class="element"><div class="title">Title A</div></div>
<div class="element"><div class="title">Title B</div></div>
<div class="element"><div class="title">Title C</div></div>
</div>
PHP Code:
$elements = array();
$data = $driver->findElements(WebDriverBy::xpath("//div[#class='elements']//div[#class='element']"));
foreach ($data as $i => $element) {
$elements[$i]["title"] = $element->findElement(WebDriverBy::xpath("//div[#class='title']"))->getText();
}
Result Array $elements being returned:
Array
(
[0] => Array
(
[title] => Title A
)
[1] => Array
(
[title] => Title A
)
[2] => Array
(
[title] => Title A
)
)
The above script is only returning Title A 3 times.
I need it to work like it has a numeral in xPath [x]. Exemple:
(//div[#class='elements']//div[#class='element'])[1]//div[#class='title'] for Title A
(//div[#class='elements']//div[#class='element'])[2]//div[#class='title'] for Title B
(//div[#class='elements']//div[#class='element'])[3]//div[#class='title'] for Title C
I can't use numeral because xPath is too big and would mess up the code a lot.
Surely the correct node xPath in foreach wasn't supposed to work?
When using WebElement to locate another WebElement with xpath you need to use current context . in the path
$element->findElement(WebDriverBy::xpath(".//div[#class='title']"))

Is it possible to exclude parts of the matched string in preg_match?

when writing a script that is supposed to download content from a specific div I was wondering if it is possible to skip some part of the pattern in such a way that it will not be included in the matching result.
examlple:
<?php
$html = '
<div class="items">
<div class="item-s-1827">
content 1
</div>
<div class="item-s-1827">
content 2
</div>
<div class="item-s-1827">
content 3
</div>
</div>
';
preg_match_all('/<div class=\"item-s-([0-9]*?)\">([^`]*?)<\/div>/', $html, $match);
print_r($match);
/*
Array
(
[0] => Array
(
[0] => <div class="item-s-1827">
content 1
</div>
[1] => <div class="item-s-1827">
content 2
</div>
[2] => <div class="item-s-1827">
content 3
</div>
)
[1] => Array
(
[0] => 1827
[1] => 1827
[2] => 1827
)
[2] => Array
(
[0] =>
content 1
[1] =>
content 2
[2] =>
content 3
) ) */
Is it possible to omit class=\"item-s-([0-9]*?)\" In such a way that the result is not displayed in the $match variable?
In general, you can assert strings precede or follow your search string with positive lookbehinds / positive lookaheads. In the case of a lookbehind, the pattern must be of a fixed length which stands in conflict with your requirements. But fortunately there's a powerful alternative to that: You can make use of \K (keep text out of regex), see http://php.net/manual/en/regexp.reference.escape.php:
\K can be used to reset the match start since PHP 5.2.4. For example, the patter foo\Kbar matches "foobar", but reports that it has matched "bar". The use of \K does not interfere with the setting of captured substrings. For example, when the pattern (foo)\Kbar matches "foobar", the first substring is still set to "foo".
So here's the regex (I made some additional changes to that), with \K and a positive lookahead:
preg_match_all('/<div class="item-s-[0-9]+">\s*\K[^<]*?(?=\s*<\/div>)/', $html, $match);
print_r($match);
prints
Array
(
[0] => Array
(
[0] => content 1
[1] => content 2
[2] => content 3
)
)
The preferred way to parse HTML in PHP is to use DomDocument to load the HTML and then DomXPath to search the result object.
Update
Modified based on comments to question so that <div> class names just have to begin with item-s-.
$html = '<div class="items">
<div class="item-s-1827">
content 1
</div>
<div class="item-s-18364">
content 2
</div>
<div class="item-s-1827">
content 3
</div>
</div>';
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$divs = $xpath->query("//div[starts-with(#class,'item-s-')]");
foreach ($divs as $div) {
$values[] = trim($div->nodeValue);
}
print_r($values);
Output:
Array (
[0] => content 1
[1] => content 2
[2] => content 3
)
Demo on 3v4l.org

PHP Query in what tag is the word.

I'm currently using this code to retrieve tags.
$title = $pq->find("title")->text();
$h1 = $pq->find("h1")->text();
$p = $pq->find("p")->text();
Is this the proper way of doing it?
Secondly I have to see what word from my array $array_words is in which tag. So i have retrieved the file_get_contents and removed all tags and put all words in an array. Now lets take this for example:
Array
(
[0] => hello
[1] => there
[2] => this
[3] => is
[4] => a
[8] => test
[9] => array
)
and this would be the HTML:
<html>
<head>
<title>
hello there
</title>
</head>
<body>
<h1>
this is a
</h1>
<p>
test array
</p>
</body>
</html>
How can I find out which word is found in which tag?
I hope I made somewhat clear what I'm trying to do.
Based on the question, the point is that you need to create a reference of which word from $array_words is in some HTML tag.
So you have a array of tags that you want to check, right?
What i'm seen is it:
Get All Tags That you Want to Check.
Put All Tags on a Foreach to check all.
On Foreach, use phpQuery to find the words inside those tags.
phpQuery should return text, so you should break in into a new array of words called "$words_from_text", using explode. A new array are created.
Use a "in_array" comparator into a new foreach (inside the old one) to find what words from $array_words are inside the text.
If a Key From $words_from_text is find in the $array_words, put in on the array of Tags by setting a new array attached to the tag key.
$array_tags = (
'h1','div','title',
)
$array_words =
(
[0] => hello
[1] => there
[2] => this
[3] => is
[4] => a
[8] => test
[9] => array
)
Final Array with the results should be like it :
$array_tags = array(
['title'] = array('word1','word2'),
['h1'] = array('word3','word4'),
['div'] = array('word5','word6')
);
So if this example is what you need, you can use this guideline to resolve your problem.

How Can I Display First 2 Paragraphs? And then Remaining Paragraphs? - PHP

I have 4 paragraphs of text in one string. Each paragraph is surrounded with <p></p>.
My first goal is to output the first 2 paragraphs.
My second goal it to output the remaining paragraphs somewhere else on the page. I could sometimes be dealing with strings containing more than 4 paragraphs.
I've searched on the web for anything already out there. There's quite a bit about displaying just the first paragraph, but nothing I could find about displaying paragraphs 1-2 and then the remaining paragraphs. Can anyone help here?
Not sure which to use if any, substr, strpos, etc.....?
EDIT - thanks for your answers, to clarify, the paragraphs don't contain HTML at the moment, but yes I will need the option to have HTML within each paragraph.
Use regular expression:
$str = '<p style="color:red;"><b>asd</b>para<img src="afs"/>graph 1</p >
<p>paragraph 2</p>
<p>paragraph 3</p>
<p>paragraph 4</p>
';
// preg_match_all('/<p.*>([^\<]+)<\/p\s*>/i',$str,$matches);
//for inside html like a comment sais:
preg_match_all('/<p[^\>]*>(.*)<\/p\s*>/i',$str,$matches);
print_r($matches);
prints:
Array
(
[0] => Array
(
[0] => <p style="color:red;"><b>asd</b>para<img src="afs"/>graph 1</p >
[1] => <p>paragraph 2</p>
[2] => <p>paragraph 3</p>
[3] => <p>paragraph 4</p>
)
[1] => Array
(
[0] => <b>asd</b>para<img src="afs"/>graph 1
[1] => paragraph 2
[2] => paragraph 3
[3] => paragraph 4
)
)
Use DOMDocument
Initialize with:
$dom = new DOMDocument;
$dom->loadHTML($myString);
$p = $dom->getElementsByTagName('p');
If each can contains other HTML elements(or not), create a function:
function getInner(DOMElement $node) {
$tmp = "";
foreach($node->childNodes as $c) {
$tmp .= $c->ownerDocument->saveXML($c);
}
return $tmp;
}
and then use that function when needing the paragraph like so:
$p1 = getInner($p->item(0));
You can read more about DOMDocument here

Regular expression in PHP to return array with all images from html, eg: all src="images/header.jpg" instances

I'd like to be able to return an array with a list of all images (src="" values) from html
[0] = "images/header.jpg"
[1] = "images/person.jpg"
is there a regular expression that can do this?
Many thanks in advance!
Welcome to the world of the millionth "how to exactract these values using regex" question ;-) I suggest to use the search tool before seeking an answer -- here is just a handful of topics that provide code to do exactly what you need;
replacing all image src tags in HTML text
getting image src in php
How to extract img src, title and alt from html using php?
Matching SRC attribute of IMG tag using preg_match
php regex : get src value
Dynamically replace the “src” attributes of all <img> tags (redux)
preg_match_all , get all img tag that include a string
/src="([^"]+)"/
The image will be in group 1.
Example:
preg_match_all('/src="([^"]+)"/', '<img src="lol"><img src="wat">', $arr, PREG_PATTERN_ORDER);
Returns:
Array
(
[0] => Array
(
[0] => src="lol"
[1] => src="wat"
)
[1] => Array
(
[0] => lol
[1] => wat
)
)
Here is a more polished version of the regular expression provided by Håvard:
/(?<=src=")[^"]+(?=")/
This expression uses Lookahead & Lookbehind Assertions to get only what you want.
$str = '<img src="/img/001.jpg"><img src="/img/002.jpg">';
preg_match_all('/(?<=src=")[^"]+(?=")/', $str, $srcs, PREG_PATTERN_ORDER);
print_r($srcs);
The output will look like the following:
Array
(
[0] => Array
(
[0] => /img/001.jpg
[1] => /img/002.jpg
)
)
I see that many peoples struggle with Håvard's post and <script> issue. Here is same solution on more strict way:
<img.*?src="([^"]+)".*?>
Example:
preg_match_all('/<img.*?src="([^"]+)".*?>/', '<img src="lol"><img src="wat">', $arr, PREG_PATTERN_ORDER);
Returns:
Array
(
[1] => Array
(
[0] => "lol"
[1] => "wat"
)
)
This will avoid other tags to be matched. HERE is example.

Categories