select children of the first element of a certain class using XPath - php

i have this type of code:
<div class="content">
<p></p>
<p></p>
<p></p>
</div>
<div class="content">
<p></p>
<p></p>
<p></p>
</div>
i wish to select all p elements from the first element with the class content.
i managed to select the first class by using:
(//div[#class="content"])[1]
but using (//div[#class="content"])[1]/p it still shows both classes

Here's an working example using PHP's SimpleXML. I've made some small changes to the HTML code you provided so the output would be more meaningful.
Regarding the XPath expression you provided I just removed the parenthesis and it all worked as expected.
NOTE: Following #LarsH's comment, I reverted the XPath expression as it was OK for starters. I took the liberty to update it based on its example.
<?php
$html = <<<HTML
<body>
<div class="content">
<p>1</p>
<p>2</p>
<p>3</p>
</div>
<div class="content">
<p>4</p>
<p>5</p>
<p>6</p>
</div>
<div>
<div class="content">
<p>7</p>
<p>8</p>
<p>9</p>
</div>
</div>
</body>
HTML;
$sxe = new SimpleXMLElement($html);
foreach ($sxe->xpath('(//div[#class="content"])[1]/p') as $p) {
echo "$p\n";
}
Output:
1
2
3
Link to codepad working example.

Related

How to get content text of div by simple html dom - php

I get the bottom html code by simple dom html (file_get_html('http://example.com'))
<div id="ship" class="fe" data-feature-name="box" data-cel-widget="sox">
<div class="a-medium b-di">
<div id="mer-info" class="a-section a-spacing-mini">
Hello World
<span class="">
</span>
</div>
</div>
</div>
How can I get 'Hello World" content text?
I tried a lot of things for example bottom text, but that gave me 'NULL'
$html->find('div[id="mer-info"]',0);
$html->find("div#mer-info");
$html->find("div#mer-info")->plaintext;
$html->find('div[id="mer-info"]')->innertext;
and ...
But I got NULL still!
You only passed the second argument (0) to find method where you used div[id="mer-info"] as selector, which seems not to be recognized by find method. Try the following:
require 'simple_html_dom.php';
$html =<<<html
<div id="ship" class="fe" data-feature-name="box" data-cel-widget="sox">
<div class="a-medium b-di">
<div id="mer-info" class="a-section a-spacing-mini">
Hello World
<span class="">
</span>
</div>
</div>
</div>
html;
$dom = str_get_html($html);
$elem = $dom->find('#mer-info', 0);
print $elem->plaintext;
print "\n";
$elem = $dom->find('div#mer-info', 0);
print $elem->plaintext;

php search and replace <h2> to <h1> in my view source [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 3 years ago.
I have the following html
<!-- START: .paragraph-content -->
<div class="paragraph-content">
<div class="container"><div class="row"><div class="col-sm-10">
<!-- START: .paragraph-columns -->
<div class="paragraph-columns">
<div class="field-wysiwyg">
<div data-quickedit-field-id="paragraph/167/field_mt_body/en/default" class="field field--name-field-mt-body field--type-text-long field--label-hidden field__items">
<div class="field__item">
<h2> </h2>
<h2> </h2>
<h2>INNOVATION.</h2>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
</div>
</div>
</div>
</div>
<!-- END: .paragraph-columns -->
</div></div></div>
</div>
<!-- END: .paragraph-content -->
I want to capture where the html begins with <div class="paragraph-content">
in that block, I want to change the <h2> to <h1>
so the end result will look like this:
<!-- START: .paragraph-content -->
<div class="paragraph-content">
<div class="container"><div class="row"><div class="col-sm-10">
<!-- START: .paragraph-columns -->
<div class="paragraph-columns">
<div class="field-wysiwyg">
<div data-quickedit-field-id="paragraph/167/field_mt_body/en/default" class="field field--name-field-mt-body field--type-text-long field--label-hidden field__items">
<div class="field__item">
<h2> </h2>
<h2> </h2>
<h1>INNOVATION.</h1>
<p> </p>
<p> </p>
<p> </p>
<p> </p>
</div>
</div>
</div>
</div>
<!-- END: .paragraph-columns -->
</div></div></div>
</div>
<!-- END: .paragraph-content -->
I have tried it with this regex pattern but nothing works:
'/(?:<h2((?!\s").*?)?>)(.*?)(?:<\/h2>)/si'
If you have the HTML page as a string variable, accomplished by:
$fileStr = file_get_contents('HTML_FILE.htm');
You can then find the start of the section you are after by using the text "<!-- START: .paragraph-content -->" and the end of the section of the string by using the text "<!-- END: .paragraph-content -->".
Having the start and end of the string, we can extract the portion of the $fileStr in which we want to run our regular expression against.
The regular expression required to find the string you want to change is:
<h2>.{2,}<\/h2>
The issue you have to to extract and replace the <h2> and </h2> with <h1> and </h1> whilst retaining everything in between these.
Doing that isn't going to be a simple neat solution. I would do a loop which would look for <h2>, then find if there is any alphanumerics between that and the closing </h2>, then extract the contents between the two if there is, replacing the tags appropriately.
Whilst not providing you with code to cut and paste, I hope I've given you something to ponder.
Regex works as a finite state machine, it has no way to parse recursive things, like XML tags that might contain other XML tags.
Basically, you cant match exactly the closing tag that matches the opening tag, because that requires recursion, which is not possible in finite state machines (there is Python module regex that has recursion and some other implementations, but this is not true regex).
For your problem exaclty you need a whole top-down recursive parser or some tool that works with XML/HTML specifically.
Just replacing the h2 tags with h1 in the whole regex'ed string is as simple as <(/?)h2> -> <$1h1> though.

How access nested elements with repeated class xpath

I need access a nested element with repeated class, like that:
<div class="container">
<div class="first"></div>
<div class="first"></div>
<div class="first">
<div class="second"></div>
<div class="second">
<p>I need that text</p>
</div>
</div>
</div>
So i try something like that:
$localizacao_x = $xpath_det_page->query('//div[#class="container"]/div[#class="first"][3]/div[#class="second"][2]/p');
$localizacao = $localizacao_x->item(0)->nodeValue;
echo "[Localizacao] : [".$localizacao."]"."<br/>";
But result in non object, any tip?
Your XPath seems to be correct. I tested
//div[#class="container"]/div[#class="first"][3]/div[#class="second"][2]/p
which result is
I need that text

Xpath Exclude p.class of a div

This is my HTML example:
<div id="Texte">
<div class="pagination">
...
</div>
<p>...</p>
<p>....</p>
<p class="Foot">...</p>
</div>
I want to use Xpath to get all content of my <div id="Texte"> without the <p class="foot">.
I use this, but it's not ok, I have the class='Foot' in my result :
$crawler->filterXPath("//*[#id='Texte' and not(#class='Foot')]")->html();
Almost.
// correct
$crawler->filterXPath("//*[#id='Texte']/*[not(#class='Foot')]")->html();
// yours, for comparison
$crawler->filterXPath("//*[#id='Texte' and not(#class='Foot')]")->html();

How to use php regular get the 2ed and 3rd `<div class="partright">`?

There have some texts, how to use php regular get the 2ed and 3rd <div class="partright">? Thanks.
<div class="wrap">
<div class="content>
<div class="partleft">
text1
</div>
<div class="partright">
text2
</div>
</div>
<div class="content>
<div class="partleft">
text3
</div>
<div class="partright">
text4
</div>
</div>
<div class="content>
<div class="partleft">
text5
</div>
<div class="partright">
text6
</div>
</div>
</div>
I want output
<div class="partright">
text4
</div>
<div class="partright">
text6
</div>
Your question is very incomplete but I assume your talking about traversing the elements to modify them in some way.
You should look at the following library called SimpleDOM
And usage would be like:
require_once 'simple_dom.class.php';
$html = "<html_data_here>";
$html = str_get_html($html);
foreach($html->find(".partleft:nth(2),.partleft:nth(3)") as $p)
{
echo $p->outerText;
}
Note: The above is an example and may not work as expected, for working examples please see the Simple Dom site linked above.
You can't parse [X]HTML with regex. RegEx match open tags except XHTML self-contained tags use SimpleXML indeed.
You could build a regular expression that would match
<div class="partright">(.*)</div>
Put all matches in an array and take the 2nd and third element from the array.

Categories