Get div by id from string using PHP - php

I have a while dom document string inside variable, I want to grab only content of a div using id, I can't use PHP dom parser, is it possible using Regex?
<html>
....
<body>
....
<div id="something">
// content
</div>
<div id="other_divs">
// content
</div>
</body>
</html>

preg_match('%<div[^>]+id="something"[^>]*>(.*?)</div>%si', $string)
If there is no additional div in the content itself.

you can try to use this regexp, hope it will help you.
preg_match_all('/<div\s*id="[^>]+">(.*?)<\/div>/s', $html, $match);
print_r($match[0]);

Related

How to Remove the nested paragraph tag using Regular Expression in PHP?

I am using Simple Html Dom for parsing through Html.In this I wasn't able to load <p> tag if it is in nested manner
<p>Hello there <p>Some Content </p>outer content <p>Some More content</p></p>
I don't know how to replace the inner <p></p> tags using Regex .
My expected output is :
<p>Hello there Some content outer content Some More content</p>
Someone please help me in getting this done
Assuming that your whole problematic <p> tag is in a single line, you can use the following regex
((?!^)<p>)|(<\/p>(?!$))
(?!^)<p>) matches all <p> tags excluding the <p> in the beginning of the string
(<\/p>(?!$) matches all </p> tags excluding the </p> in the end of the string
you can just replace these captured <p> and </p>s with null and remove them.
Here is a working demo
EDIT:
Since your input is a html file you can try this updated regex
(<p>)((?!<\/p>).)*?(<p>).*?(<\/p>)
(<p>) searches for <p> tag
((?!<\/p>).)*?(<p>) captures <p> tag inside the first <p> tag without any </p> tag in between (nested <p> tag)
.*?(<\/p>) captures the closing tag of the nested <p> .
just remove the capture groups 3 and 4 and you have removed the nested tag. You need to run this again and again till there are no more matches.
you can find the updated regex demo here
UPDATE:
Use this regex (.*<p>)(((?!<\/p>).)*?)(<p>)(.*?)(<\/p>)(.*)
and replace it with \1\2\5\7 which will remove the nested tags alone.
Demo here
Please try this function for removing <p></p> tags
<?php function remove_p($input) {
$input=str_ireplace('<p>','',$input);
$input=str_ireplace('</p>','',$input);
return "<p>".$input."</p>";
}
?>
Please see how to use this function:
<?php $val = "<p>Hello there <p>Some Content </p>outer content <p>Some More content</p></p>";
echo remove_p($val);
?>
Hope, it may be helpful to you.
Nested p tags are not allowed. In place of that you can use:
<p>Hello there <span>Some Content </span>outer content</p>
See the below link for more details
Nesting <p> won't work while nesting <div> will?

How to select content without element using simplehtmldom

I am trying to extract data from a website using PHP with the help of simplehtmldom.
The problem is that there is a text which doesn't have any parent element.
<div class="div_1">First div</div> The text i need to grab <div class="div_2">Second div</div>
In the above demonstration i need to extract The text i need to grab.
If simplehtmldom is not required, try Regular Expressions:
<div[^<>]*class="div_1"[^<>]*>.*?</div>(.*?)<div[^<>]*class="div_2"[^<>]*>.*?</div>
http://regexr.com/3bd5f

Keeping certain html in an xpath query or retrieving original html from xpath query

I am using xpath technology to extract text from articles. What I want to do is query for the text and if certain tags (notice plural) exist I want to keep the tags and html in tact. Another solution is to retrieve the original html from an xpath query and I can process it via php.
Here's an example of an article:
<html>
<body>
<div id="main">
<div id="content">
<p>Some content</p>
<blockquote>Some blockquote</blockquote>
<embed src="someembed source"></embed>
<br/>
</div>
</div>
</body>
</html>
What I'm looking for is:
Some content (from the p tags)
<blockquote>Some blockquote</blockquote>
<embed src="someembed source"></embed>
<br/>
My xpath isn't designed to handle anything right now but the <p> tags.
$xpath = '//div[#id="main"]//div[#id="content]//p';
Let me interpret your question this way: you want to detect all occurences of //div[#id="main"]//div[#id="content] that also have the specific combination of child tags that you mentioned.
You can select these div occurences using the following XPath expression:
//div[#id="main"]//div[#id="content" and p and blockquote and embed and br]
If you just want the child nodes you could also write:
//div[#id="main"]//div[#id="content" and p and blockquote and embed and br]/*

Take the content from first <p>...</p>

I have got field in database with <p>intro content</p><p>rest of content....</p>
I would like to take the part from intro content means the content between first <p>...</p>
After that I want to remove the intro content <p>intro content</p> from the content part.
Could you help me with that?
If you want to do any kind of parsing on HTML, you really shouldn't use regex. However, your question is fairly simple if all you want is the text content of the first <p> tag in a string:
preg_match('#<p>.*?</p>#', $string, $matches);

Get text with PHP Simple HTML DOM Parser

i'm using PHP Simple HTML DOM Parser to get text from a webpage.
The page i need to manipulate is something like:
<html>
<head>
<title>title</title>
<body>
<div id="content">
<h1>HELLO</h1>
Hello, world!
</div>
</body>
</html>
I need to get the h1 element and the text that has no tags.
to get the h1 i use this code:
$html = file_get_html("remote_page.html");
foreach($html->find('#content') as $text){
echo "H1: ".$text->find('h1', 0)->plaintext;
}
But the other text?
I also tried this into the foreach but i get the full text:
$text->plaintext;
but it returned also the H1 tag...
It looks like $text->find('text',2); gets what you're looking for, however I'm not sure how well that will work when the amount of text nodes is unknown. I'll keep looking.
You can simply strip html tags using strip_tags
<?php
strip_tags($input, '<br>');
?>
Use strip tags, as #Peachy pointed out. However, passing it a second argument <br> means string will ignore <br> tags, which is unnecessary. In your case,
<?php
strip_tags($text);
?>
would work as you'd like, given that you are only selecting content in the content id.
Try it
echo "H1: ".$text->find('h1', 0)->innertext;

Categories