My string:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext [ref*summation]</p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
</div>
<p>fig1.2 [label*somefigure]</p>
<p>sometext [ref*somefigure]</p>
</div>
Objective: 1.In the string above label*string and ref*string are the cross references. In the place of [ref*string] I need to replace with a with the atributes of class and href, href is the id of div where related label* resides. And class of a is the class of div
As I mentioned above a element class and ID is their relative div class names and ID. But if div class="metadata" exists, need to ignore it should not take their class name and ID.
Expected output:
<div class="sect1" id="s9781473910270.i101">
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p>
</div>
</div>
<div class="sect1" id="s9781473910270.i103">
<p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p>
</div>
<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
<p>fig1.2 [label*somefigure]</p>
</div>
<p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p>
</div>
How to do it in simpler way without using DOM parser?
My idea is, have to store label* string and their ID in an array and will loop against ref string to match the label* string if string matches then their related id and class should be replaced in the place of ref* string ,
So I have tried this regex to get label*string and their related id and class name.
This approach consists to use the html structure to retrieve needed elements with DOMXPath. Regex are used in a second time to extract informations from text nodes or attributes:
$classRel = ['sect2' => 'section-ref',
'figure' => 'fig-ref'];
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url);
$xp = new DOMXPath($dom);
// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(#class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");
function hasClass($classNode, $className) {
if (!empty($classNode))
return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
return false;
}
$xp->registerPHPFunctions('hasClass');
// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.
$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[#id and #class and not(php:function('hasClass', #class, 'metadata'))][1]
EOD;
$idNodeList = $xp->query($labelQuery);
$links = [];
// For each div node, a new link node is created in the associative array $links.
// The keys are labels.
foreach($idNodeList as $divNode) {
// The pattern extract the first text part in group 1 and the label in group 2
if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
$links[$m[2]] = $dom->createElement('a');
$links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
$links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
$links[$m[2]]->nodeValue = $m[1];
}
}
if ($links) { // if $links is empty no need to do anything
$refNodeList = $xp->query("//text()[contains(., '[ref*')]");
foreach ($refNodeList as $refNode) {
// split the text with square brackets parts, the reference name is preserved in a capture
$parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);
// create a fragment to receive text parts and links
$frag = $dom->createDocumentFragment();
foreach ($parts as $k=>$part) {
if ($k%2 && isset($links[$part])) { // delimiters are always odd items
$clone = $links[$part]->cloneNode(true);
$frag->appendChild($clone);
} elseif ($part !== '') {
$frag->appendChild($dom->createTextNode($part));
}
}
$refNode->parentNode->replaceChild($frag, $refNode);
}
}
$result = '';
$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;
foreach ($childNodes as $childNode) {
$result .= $dom->saveXML($childNode);
}
echo $result;
This is not a task for regular expressions. Regular expressions are (usually) for regular languages. And what you want to do is some work on a context sensitive language (referencing an identifier which has been declared before).
So you should definately go with a DOM parser. The algorithm for this would be very easy, because you can operate on one node and it's children.
So the theoretical answer to your question is: you can't. Though it might work out with the many regex extensions in some crappy way.
Related
I have two strings I'm outputting to a page
# string 1
<p>paragraph1</p>
# string 2
<p>paragraph1</p>
<p>paragraph2</p>
<p>paragraph3</p>
What I'd like to do is turn them into this
# string 1
<p class="first last">paragraph1</p>
# string 2
<p class="first">paragraph1</p>
<p>paragraph2</p>
<p class="last">paragraph3</p>
I'm essentially trying to replicate the css equivalent of first-child and last-child, but I have to physically add them to the tags as I cannot use CSS. The strings are part of a MPDF document and nth-child is not supported on <p> tags.
I can iterate through the strings easy enough to split the <p> tags into an array
$dom = new DOMDocument();
$question_paragraphs = array();
$dom->loadHTML($q['quiz_question']);
foreach($dom->getElementsByTagName('p') as $node)
{
$question_paragraphs[] = $dom->saveHTML($node);
}
But once I have that array I'm struggling to find a nice clean way to append and prepend the first and last class to either end of the array. I end up with lots of ugly loops and array splicing that feels very messy.
I'm wondering if anyone has any slick ways to do this? Thank you :)
Edit Note: The two strings are outputting within a while(array) loop as they're stored in a database.
You can index the node list with the item() method, so you can add the attribute to the first and last elements in the list.
$dom = new DOMDocument();
$question_paragraphs = array();
$dom->loadHTML($q['quiz_question']);
$par = $dom->getElementsByTagName('p');
if ($par->length == 1) {
$par->item(0)->setAttribute("class", "first last");
} elseif ($par->length > 1) {
$par->item(0)->setAttribute("class", "first");
$par->item($par->length - 1)->setAttribute("class", "last");
}
foreach($par as $node)
{
$question_paragraphs[] = $dom->saveHTML($node);
}
I did the following which works with simple text fields:
$field = "How are you doing?";
$arr = explode(' ',trim($field));
$first_word = $arr[0];
$balance = strstr("$field"," ");
It didn't work because the field contains html markup, perhaps an image, video, div, div, paragraph, etc and resulted in all text within the html getting mixed in with the text.
I could possibly use strip_tags to strip out the html then obtain first word and reformat it, but then I would have to figure out how to add the html back into the data. I'm wondering if there is a php or custom function ready made for this purpose.
You can use DOMDocument to parse the HTML, modify the contents, and save it back as HTML. Also, find the words is not always as simple as using space delimiters since not all languages delimit their words with spaces and not all words are necessarily delimited by spaces. For example: mother-in-law this could be viewed as one word or as 3 depending on how you define a word. Also, things like pancake do you consider this one word or two (pan and cake)? One simple solution is to use the IntlBreakIterator::createWordInstance class which implements the Unicode Standard for text segmentation A.K.A UAX #29.
Here's an example of how you might go about implementing this:
$html = <<<'HTML'
<div>some sample text here</div>
HTML;
/* Let's extend DOMDocument to include a walk method that can traverse the entire DOM tree */
class MyDOMDocument extends DOMDocument {
public function walk(DOMNode $node, $skipParent = false) {
if (!$skipParent) {
yield $node;
}
if ($node->hasChildNodes()) {
foreach ($node->childNodes as $n) {
yield from $this->walk($n);
}
}
}
}
$dom = new MyDOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
// Let's traverse the DOMTree to find the first text node
foreach ($dom->walk($dom->childNodes->item(0)) as $node) {
if ($node->nodeName === "#text") {
break;
}
}
// Extract the first word from that text node
$iterator = IntlBreakIterator::createWordInstance();
$iterator->setText($node->nodeValue); // set the text in the word iterator
$it = $iterator->getPartsIterator(IntlPartsIterator::KEY_RIGHT);
foreach ($it as $offset => $word) {
break;
}
// You can do whatever you want to $word here
$word .= "s"; // I'm going to append the letter s
// Replace the text node with the modification
$unmodifiedString = substr($node->nodeValue, $offset);
$modifiedString = $word . $unmodifiedString;
$oldNode = $node; // Keep a copy of the old node for reference
$node->nodeValue = $modifiedString;
// Replace the node back into the DOM tree
$node->parentNode->replaceChild($node, $oldNode);
// Save the HTML
$newHTML = $dom->saveHTML();
echo $newHTML;
Outputs
<div>somes sample text here</div>
I can't find out how to solve this
<div>
<p id="p1"> Price is <span>$ 25</span></p>
<p id='p2'> But this price is $ <span id="s1">50,23</span> </p>
<p id='p3'> This one : $ 14540.12 dollar</p>
</div>
What i'm trying to do is find an element with a price in it and it's shortest path to it.
This is what i have sofar.
$elements = $dom->getElementsByTagName('*');
foreach($elements as $child)
{
if (preg_match("/.$regex./",$child->nodeValue)){
echo $child->getNodePath(). "<br />";
}
}
This results in
/html
/html/body
/html/body/div
/html/body/div/p[1]
/html/body/div/p[1]/span
/html/body/div/p[2]
/html/body/div/p[2]/span
/html/body/div/p[3]
These are the paths to the elements i want, so that's OK in this test HTML. But in real webpages these path's get very long and are error prone.
What i'd like to do is find the closest element with an ID attribute and refer to that.
So once found and element that matched the $regex, I need to travel up the DOM and find the first element with and ID attribute and create the new shorter path from that.
In the HTML example above, there are 3 prices matching the $regex. The prices are in:
//p[#id="p1"]/span
//p[#id="s1"]
//p[#id="p3"]
So that is what i'd like to have returned from my function. The means I also need to get rid of all the other paths that exist, because they don't contain $regex
Any help on this?
You could use XPath to follow the ancestor-path to the first node containing an #id attribute and then cut its path off. Did not clean up the code, but something like this:
// snip
$xpath = new DomXPath($doc);
foreach($elements as $child)
{
$textValue = '';
foreach ($xpath->query('text()', $child) as $text)
$textValue .= $text->nodeValue;
if (preg_match("/.$regex./", $textValue)) {
$path = $child->getNodePath();
$id = $xpath->query('ancestor-or-self::*[#id][1]', $child)->item(0);
$idpath = '';
if ($id) {
$idpath = $id->getNodePath();
$path = '//'.$id->nodeName.'[#id="'.$id->attributes->getNamedItem('id')->value.'"]'.substr($path, strlen($idpath));
}
echo $path."\n";
}
}
Printing something like
/html
/html/body
/html/body/div
//p[#id="p1"]
//p[#id="p1"]/span
//p[#id="p2"]
//span[#id="s1"]
//p[#id="p3"]
I tried all sorts of things but couldn't find a solution.
I want to retrieve elements from html code using xpath in php.
Ex:
<div class='student'>
<div class='name'>Michael</div>
<div class='age'>26</div>
</div>
<div class='student'>
<div class='name'>Joseph</div>
<div class='age'>27</div>
</div>
I want to retrieve the information and put them in an array as follows:
$student[0][name] = Michael;
$student[0][age] = 26;
$student[1][name] = Joseph;
$student[1][age] = 27;`
In other words i want the matching ages to stay with the names.
I tried the following:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpathDom = new DomXPath($dom);
$homepostcontentNodes = $xpathDom->query("//*[contains(#class, 'student')]//*[contains(#class, 'name')]");`
However, this is only grabbing me the nodes 'names'
How can i get the matching age nodes?
Of course it is only grabbing the nodes name - you are telling it to!
What you will need to do is in two steps:
Pick out all the student nodes
For each student node, pick out the columns
This is a pretty standard step in linearization of data, and the XPath queries are simple:
Step 1
You pretty much have it:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
This will return all your student nodes.
Step 2
This is where the magic happens. We have our nodes, we can loop through them (DOMNodeList implements Iterator, so we can foreach-loop through them). What we need to figure out is how to find its children...
...Oh wait. DOMNode implements a method called getNodePath which returns the full, direct XPath path to the node. This allows us to then simply append /div to get all the div direct descendents to the node!
Another quick foreach, and we get this code:
$studentNodes = $xpathDom->query("//div[contains(#class, 'student')]");
$result = array();
foreach ($studentNodes as $v) {
// Child nodes: student
$r = array();
$columns = $xpathDom->query($v->getNodePath()."/div");
foreach ($columns as $v2) {
// Attributes allows me to get the 'class' property of the node. Bit clunky, but there's no alternative
$r[$v2->attributes->getNamedItem("class")->textContent] = $v2->textContent;
}
$result[] = $r;
}
var_dump($result);
Full fiddle: http://codepad.viper-7.com/t868Wh
I am using Curl, XPath and PHP in order to scrape product names and prices from HTML source code. Here is a sample similar to the source code I am examining:
<div class="Gamesdb">
<p class="media-title">
Bluetooth Headset
</p>
<p class="sub-title"> Console </p>
<p class="rating star-50">
(1)
</p>
<p class="mt5">
<span class="price-preffix">
1 New
from
</span>
<a class="wt-link" href="/Games/Console/4-/105/Bluetooth-Headset/">
<span class="price">
<em>£34</em>
.99
</span>
<span class="free-delivery"> FREE delivery</span>
</a>
</p>
<p class="mt10">
<a class="primary button" href="/Games/Console/4-/105/Bluetooth-Headset/">
Product Details
<span style="color: rgb(255, 255, 255); margin-left: 6px; font-size: 16px;">»</span>
</a>
</p>
</div>
I want to extract the media title i.e:
<p class="media-title">
Bluetooth Headset
</p>
Only when the following price class is also present:
<span class="price">
<em>£34</em>
.99
</span>
Many of the other products listed don't include it.
I need to extract both the product name and price or nothing at all and move on to the next product.
Here is a sample of the code i am currently using which is effective at getting all the results regardless of any other conditions:
$results=file_get_contents('SCRAPEDHTML.txt');
$html = new DOMDocument();
#$html->loadHtml($results);
$xpath = new DOMXPath($html);
$nodelist = $xpath->query('//p[#class="media-title"]|//span[#class="price"]');
foreach ($nodelist as $n){
$results2[]=$n->nodeValue;
}
I believe this is possible using the correct xpath query but have so far been unable to achieve it. Many thanks in advance.
I am assuming there is only one "item" per div.Gamesdb. If not, there may not be enough structure in the source html to use xpath alone. You will probably have to index product names and look for prices near matching product names.
You can do this with a single giant XPath, but I recommend you use multiple XPaths. I'll show both ways.
First create your DOMXPath and register helper to match class names.
// This helper is the equivalent to the XPath:
// contains(concat(' ',normalize-space(#attr),' '), ' $token ')
// It's not necessary, but it's a bit easier to read and more
// bulletproof than #ATTR="TOKEN"
function has_token($attr, $token)
{
$attr = $attr[0];
$regex = '/(?:^|\s)'.preg_quote($token,'/').'(?:\s|$)/Su';
return (bool) preg_match($regex, $attr->value);
}
$xp = new DOMXPath($d);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions("has_token");
You can then use a giant XPath:
$xp_container = '/html/body//div[php:function("has_token", #class, "Gamesdb")]';
$xp_title = 'p[php:function("has_token", #class, "media-title")]';
$xp_price = '//span[php:function("has_token", #class, "price")]';
$xp_titles_prices = "$xp_container[{$xp_title}][{$xp_price}]/{$xp_title} | $xp_container[{$xp_title}][{$xp_price}]{$xp_price}";
$nodes = $xp->query($xp_items);
$items = array();
$i = 0; // enumerator
foreach ($nodes as $node) {
$key = ($node->nodeName==='p') ? 'title' : 'price';
$value = '';
switch ($key) {
case 'price':
// remove inner whitespace
$value = preg_replace('/\s+/Su', '', trim($node->textContent));
break;
case 'title':
$value = preg_replace('/\s+/Su', ' ', trim($node->textContent));
break;
}
$items[(int) floor($i/2)][$key] = $value;
$i += 1;
}
However, the overall code is brittle and unclear. The XPath union operator (|) returns nodes in document order so we can't bisect the list. The PHP code must walk through every item in the nodelist and using the DOM determine which field corresponds to this data. Think about the changes you would have to make if you wanted to extend the code to collect a third item (e.g., price). Now imagine making those changes three months from now, when this code is no longer fresh in your mind.
I recommend you use multiple XPath calls instead and do the "do we have data for both price and title" check in PHP rather than XPath:
$xpitems = '/html/body//div[php:function("has_token", #class, "Gamesdb")]';
// below use $xpitems context:
$xptitle = 'normalize-space(p[php:function("has_token", #class, "media-title")])';
$xpprice = 'normalize-space(//span[php:function("has_token", #class, "price")])';
$nodeitems = $xp->query($xpitems);
$items = array();
foreach ($nodeitems as $nodeitem) {
$item = array(
'title' => $xp->evaluate($xptitle, $nodeitem),
'price' => str_replace(' ', '', $xp->evaluate($xpprice, $nodeitem)),
);
// Only add this item if we have data for *all* fields:
if (count(array_filter($item)) === count($item)) {
$items[] = $item;
}
}
This is much easier to read and understand, and much easier to extend in the future.
You cannot have a single XPath that returns both the name of the product and its price and nothing else. My suggestion would be first to get all the div nodes that contain both informations:
//div[p[#class='media-title'] and //span[#class='price']]
('all div nodes that have a p child node with class media-title and a span descendent node with class price'); then loop on all the returned nodes and exctract the product namee and price using two other XPath:
p[#class='media-title']
and
//span[#class='price']