PHP preg_match_all extract id and name, where id in tag is optional - php

I have following code:
<?php
$html = '<div>
<div class="block">
<div class="id">10</div>
<div class="name">first element</div>
</div>
<div class="block">
<div class="name">second element</div>
</div>
<div class="block">
<div class="id">30</div>
<div class="name">third element</div>
</div>
</div>';
preg_match_all('/<div class="block">[\s]+<div class="id">(.*?)<\/div>[\s]+<div class="name">(.*?)<\/div>[\s]+<\/div>/ms', $html, $matches);
print_r($matches);
I want to get array with id and name, but the second position doesn't have id, so my preg match skipped this one. How can I generate array without skip and print sth like this [ ... [id => 0 // or null, name => 'second element'] ...]?

Use DOMDocument to solve this task; there are a lot of good reasons not to use regular expressions.
Assuming your HTML code is stored in $html variable, create an instance of DOMDocument, load the HTML code, and initialize DOMXPath:
$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML($html, LIBXML_NOBLANKS);
$dom->formatOutput = True;
$xpath = new DOMXPath($dom);
Use DOMXPath to search for all <div> nodes with class "name" and prepare an empty array for the results:
$nodes = $xpath->query('//div[#class="name"]');
$result = array();
For each node found, run an additional query to find the optional node with class "id", then add a record to the results array:
foreach ($nodes as $node) {
$id = $xpath->query('div[#class="id"]', $node->parentNode);
$result[] = array(
'id' => $id->count() ? $id->item(0)->nodeValue : null,
'name' => $node->nodeValue
);
}
print_r($result);
This is the result:
Array
(
[0] => Array
(
[id] => 10
[name] => first element
)
[1] => Array
(
[id] =>
[name] => second element
)
[2] => Array
(
[id] => 30
[name] => third element
)
)

Related

Foreach does not get xpath results from node

I use xpath webdriver to find a div in the code and I need to get data on each node of this div, but this is not happening.
HTML:
<div class="elements">
<div class="element"><div class="title">Title A</div></div>
<div class="element"><div class="title">Title B</div></div>
<div class="element"><div class="title">Title C</div></div>
</div>
PHP Code:
$elements = array();
$data = $driver->findElements(WebDriverBy::xpath("//div[#class='elements']//div[#class='element']"));
foreach ($data as $i => $element) {
$elements[$i]["title"] = $element->findElement(WebDriverBy::xpath("//div[#class='title']"))->getText();
}
Result Array $elements being returned:
Array
(
[0] => Array
(
[title] => Title A
)
[1] => Array
(
[title] => Title A
)
[2] => Array
(
[title] => Title A
)
)
The above script is only returning Title A 3 times.
I need it to work like it has a numeral in xPath [x]. Exemple:
(//div[#class='elements']//div[#class='element'])[1]//div[#class='title'] for Title A
(//div[#class='elements']//div[#class='element'])[2]//div[#class='title'] for Title B
(//div[#class='elements']//div[#class='element'])[3]//div[#class='title'] for Title C
I can't use numeral because xPath is too big and would mess up the code a lot.
Surely the correct node xPath in foreach wasn't supposed to work?
When using WebElement to locate another WebElement with xpath you need to use current context . in the path
$element->findElement(WebDriverBy::xpath(".//div[#class='title']"))

Is it possible to exclude parts of the matched string in preg_match?

when writing a script that is supposed to download content from a specific div I was wondering if it is possible to skip some part of the pattern in such a way that it will not be included in the matching result.
examlple:
<?php
$html = '
<div class="items">
<div class="item-s-1827">
content 1
</div>
<div class="item-s-1827">
content 2
</div>
<div class="item-s-1827">
content 3
</div>
</div>
';
preg_match_all('/<div class=\"item-s-([0-9]*?)\">([^`]*?)<\/div>/', $html, $match);
print_r($match);
/*
Array
(
[0] => Array
(
[0] => <div class="item-s-1827">
content 1
</div>
[1] => <div class="item-s-1827">
content 2
</div>
[2] => <div class="item-s-1827">
content 3
</div>
)
[1] => Array
(
[0] => 1827
[1] => 1827
[2] => 1827
)
[2] => Array
(
[0] =>
content 1
[1] =>
content 2
[2] =>
content 3
) ) */
Is it possible to omit class=\"item-s-([0-9]*?)\" In such a way that the result is not displayed in the $match variable?
In general, you can assert strings precede or follow your search string with positive lookbehinds / positive lookaheads. In the case of a lookbehind, the pattern must be of a fixed length which stands in conflict with your requirements. But fortunately there's a powerful alternative to that: You can make use of \K (keep text out of regex), see http://php.net/manual/en/regexp.reference.escape.php:
\K can be used to reset the match start since PHP 5.2.4. For example, the patter foo\Kbar matches "foobar", but reports that it has matched "bar". The use of \K does not interfere with the setting of captured substrings. For example, when the pattern (foo)\Kbar matches "foobar", the first substring is still set to "foo".
So here's the regex (I made some additional changes to that), with \K and a positive lookahead:
preg_match_all('/<div class="item-s-[0-9]+">\s*\K[^<]*?(?=\s*<\/div>)/', $html, $match);
print_r($match);
prints
Array
(
[0] => Array
(
[0] => content 1
[1] => content 2
[2] => content 3
)
)
The preferred way to parse HTML in PHP is to use DomDocument to load the HTML and then DomXPath to search the result object.
Update
Modified based on comments to question so that <div> class names just have to begin with item-s-.
$html = '<div class="items">
<div class="item-s-1827">
content 1
</div>
<div class="item-s-18364">
content 2
</div>
<div class="item-s-1827">
content 3
</div>
</div>';
$doc = new DomDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$divs = $xpath->query("//div[starts-with(#class,'item-s-')]");
foreach ($divs as $div) {
$values[] = trim($div->nodeValue);
}
print_r($values);
Output:
Array (
[0] => content 1
[1] => content 2
[2] => content 3
)
Demo on 3v4l.org

Storing XML Document with XPath and PHP, tag info isn't storing in array like needed

So, I want to iterate through the XML by the attributes of and then print the tags from within the coordinating tag. This is the structure:
<emp salesid="1">
<report>07-14-2015_DPLOH_SalesID_1.pdf</report>
<report>07-17-2015_DPLOH_SalesID_1.pdf</report>
<report>07-14-2015_DTE_SalesID_1.pdf</report>
<report>07-14-2015_IDT_SalesID_1.pdf</report>
<report>07-14-2015_Kratos_SalesID_1.pdf</report>
<report>07-14-2015_Spark_SalesID_1.pdf</report>
</emp>
Here is the my code:
$xml = new SimpleXMLElement($xmlStr);
foreach($xml->xpath("//emp/report") as $node) {
//For all found nodes retrieve its ID from parent <emp> and store in $arr
$id = $node->xpath("../#salesid");
$id = (int)$id[0];
if(!isset($arr[$id])) {
$arr[$id] = array();
}
//Then we iterate through all nodes and store <report> in $arr
foreach($node as $report) {
$arr[$id][] = (string)$report;
}
}
echo "<pre>";
print_r($arr);
echo "</pre>";
However, this is what I get for output:
Array
(
[1] => Array
(
)
[10] => Array
(
)
... and it continues to iterate through all of the attributes of tags, but never fills the array with any information.
If anyone could help tell me what I'm missing, I would GREATLY appreciate it. I feel like I'm losing my mind over what seems like should be rather simple.
Thanks!
You're very close. The code isn't working because of the second for loop. The outer loop will iterate through all of the report elements. So, node is a report element. When you try to iterate through the children of report, there's nothing there.
Instead of the second (inner) loop, simply do this:
$arr[$id][] = (string)$node;
When I did, I got the following result:
<pre>
Array
(
[1] => Array
(
[0] => 07-14-2015_DPLOH_SalesID_1.pdf
[1] => 07-17-2015_DPLOH_SalesID_1.pdf
[2] => 07-14-2015_DTE_SalesID_1.pdf
[3] => 07-14-2015_IDT_SalesID_1.pdf
[4] => 07-14-2015_Kratos_SalesID_1.pdf
[5] => 07-14-2015_Spark_SalesID_1.pdf
)
)
I updated your script to work slightly differently:
$emp = new SimpleXMLElement($xmlStr);
$id = intval($emp['salesid']);
$arr = array(
$id => array(),
);
$lst = $emp->xpath('/emp/report');
while (list(, $text) = each($lst))
{
$arr[$id][] = (string) $text;
}
echo "<pre>";
print_r($arr);
echo "</pre>";
Cheers

preg_match returns an empty string even there is a match

I am trying to extract all meta tags in web page, currently am using preg_match_all to get that, but unfortunately it returns an empty strings for the array indexes.
<?php
$meta_tag_pattern = '/<meta(?:"[^"]*"[\'"]*|\'[^\']*\'[\'"]*|[^\'">])+>/';
$meta_url = file_get_contents('test.html');
if(preg_match_all($meta_tag_pattern, $meta_url, $matches) == 1)
echo "there is a match <br>";
print_r($matches);
?>
Returned array:
Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) ) Array ( [0] => Array ( [0] => [1] => [2] => [3] => ) )
An example with DOMDocument:
$url = 'test.html';
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$metas = $dom->getElementsByTagName('meta');
foreach ($metas as $meta) {
echo htmlspecialchars($dom->saveHTML($meta));
}
UPDATED: Example grabbing meta tags from URL:
$meta_tag_pattern = '/<meta\s[^>]+>/';
$meta_url = file_get_contents('http://stackoverflow.com/questions/10551116/html-php-escape-and-symbols-while-echoing');
if(preg_match_all($meta_tag_pattern, $meta_url, $matches))
echo "there is a match <br>";
foreach ( $matches[0] as $value ) {
print htmlentities($value) . '<br>';
}
Outputs:
there is a match
<meta name="twitter:card" content="summary">
<meta name="twitter:domain" content="stackoverflow.com"/>
<meta name="og:type" content="website" />
...
Looks like part of the problem is the browser rendering the meta tags as meta tags and not displaying the text when you print_r the output, so they need to be escaped.

Can I find selected options in a form using simplexml?

I'm able to find a select's options on a website using the following code:
$dom = new DOMDocument();
$dom->loadHTMLFile('http://webseven.com.au/carl/testpage.htm');
$xml = simplexml_import_dom($dom);
//print_r($xml);
$select = $xml->xpath('//table/tr/td/select');
print_r($select);
I get (as an example)
[0] => SimpleXMLElement Object
(
[#attributes] => Array
(
[name] => product_OnWeb
[tabindex] => 4
)
[option] => Array
(
[0] => Yes
[1] => No
)
)
But I cannot find a way to find which of those is selected. Can this be done with SimpleXML or is there another method?
You need to loop through all the options (using foreach ( $node->option ... )), and check for the selected attribute (using $node['selected']):
$dom = new DOMDocument();
$dom->loadHTMLFile('http://webseven.com.au/carl/testpage.htm');
$xml = simplexml_import_dom($dom);
$selects = $xml->xpath('//table/tr/td/select');
foreach ( $selects as $select_node )
{
echo $select_node['name'] . ': ';
foreach ( $select_node->option as $option_node )
{
if ( isset($option_node['selected']) )
{
echo $option_node['value'] . ' ';
}
}
echo "\n";
}
As an aside, you are likely to be led astray if you use print_r to debug SimpleXML, as it doesn't show you the true state of the object. I've written a simplexml_dump function which might be more useful.

Categories