I am trying to extract data from a website using PHP with the help of simplehtmldom.
The problem is that there is a text which doesn't have any parent element.
<div class="div_1">First div</div> The text i need to grab <div class="div_2">Second div</div>
In the above demonstration i need to extract The text i need to grab.
If simplehtmldom is not required, try Regular Expressions:
<div[^<>]*class="div_1"[^<>]*>.*?</div>(.*?)<div[^<>]*class="div_2"[^<>]*>.*?</div>
http://regexr.com/3bd5f
Related
I have this HTML:
<span id="bla">text</span>more text
I want to get text and more text.
I have this XPath:
//span[#id="bla"]/text()
I can't figure out how to get the closing tag and what comes after it.
The more text is called a "tail" of an element and can be retrieved via following-sibling:
//span[#id="bla"]/following-sibling::text()
<span id="bla">text</span>more text alone is not well-formed and cannot be processed via XPath.
Let's put it in context:
<div><span id="bla">text</span>more text</div>
Then, you can simply take the string value of the parent element, div:
string(/div)
to get
textmore text
as requested.
If there's other surrounding content that you don't want:
<div>DO NOT WANT<span id="bla">text</span>more text<b/>DO NOT WANT</div>
You can follow #alecxe's lead with the following-sibling:: axis and use concat() to combine the parts you want:
concat(//span[#id="bla"], //span[#id="bla"]/following-sibling::text()[1])
to again get
textmore text
as requested.
Preface: I cannot rename the source tags or edit their IDs. Any changes to the tags must happen after they have been fetched.
What I'm doing: using file_get_contents in PHP, I am requesting data from a remote site. This data is just two <p> tags. I need to hide or rename the second of the two <p> tags.
Is this possible with PHP or jQuery?
What I'm working with:
<p>Hello my name is test</p><p>I like studying geology.</p>
If you need to hide second text, you can do this with Jquery:
$('p:eq(1)').hide();
Jsfiddle
You could try a php string replace
$new_string = str_replace('</p><p>','',file_get_contents('somecontent'));
If you need to do it before render HTML, you need to parse contents and remove/replace second p tag and create a new content.
Here is a DOM parser Simple HTML DOM Parser
Find similar questions below
How to match second <a> tag in this string
How to add attribute to first P tag using PHP regular expression?
Or you can do it after rendering HTML as rNix suggested.
I am trying to replace content inside particular div tags (id="dd-header") with a comment. Tried several methods, and regular expressions. This is my latest try:
$html = preg_replace('/(<div\sid=\"dd\-header\">)[^<]+(<\/div>)/i', '<!-- Comment -->', $html);
Couldn't get it working. What am I doing wrong here?
NOTE: div tags further have multiple tags
Sample Code to Replace
<div id="dd-header">
<a id="logo-small" href="/" title="title"></a>
Link 1 |
Link 2
<!-- Image | -->
| Link 3</div>
$html = preg_replace('/(<div\sid="dd-header">)([^<]|<.+>.*<\/.+>)+(<\/div>)/i', '$1<!-- Comment -->$3', $html);
see http://codepad.org/GpYkteh4
While in simple cases you can do this, as posted by rabudde, you can't do the general case with regular expressions. It is a limitation of the regular expression language, and has been discussed extensively here on SO.
rabudde's code failes when a div contains sub-tags.
The correct way to do it is to parse the tree with an (X)HTML parser, find the div node, remove it's children, and replace with whatever you like.
Just use DOMDocument. It'll parse it into a DOM that is ridiculously easy to traverse, search by ID, and manipulate.
See the documentation, starting with loadHTML: http://docs.php.net/manual/en/domdocument.loadhtml.php
I have a project I'm working on where I need to scrape the text out of a specific div tag but only the text no html tags.
Here is example of the html:
<div id="divid1" class="divclass1">
<h1>
TEXT INSIDE DIV
</h1>
</div>
I need to scrape the text inside the DIV w/ out the H1 tags. I've tried this numerous ways and just can't get it right.
Any suggestions? Thanks!
use PHP domparser, that is good for this purpose.
http://www.php.net/manual/en/domdocument.loadhtml.php
I would use PHP Simple HTML DOM Parser.
http://simplehtmldom.sourceforge.net/
You could say:
foreach ($html->find('div[divid1] h1') as $e)
echo $e->innertext;
This will echo the text within the h1 tag inside #divid1 (but not the tag itself).
The documentation is simple, but helps a bunch: http://simplehtmldom.sourceforge.net/manual.htm
How do I use the DOM parser to extract the content of a html element in a variable.
More exactly:
I have a form where user inputs html in a text area. I want to extract the content of the first paragraph.
I know there are many tutorials on this, but could not find any on extracting from variable and not a file(page)
Thanks
If you're taking HTML as user input, I recommend using simplehtmldom. It has a loose parser with tolerance for buggy html and lets you use CSS selectors to pull element and their content out of the DOM.
I didn't test this, but it should work:
$html = str_get_html($_POST['input']);
print $html->find('p:first')->plaintext; // first paragraph