How can i scrape invalid html using php simple dom? - php

I'm trying to scrape a webpage using phpsimpledom.
$html = '<div class="namepageheader">
<div class="u">Name: Noor Shaad
<div class="u">Age: </div>
</div> '
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;
I tried my best to get text from each class="u" but it didn't work because there is missing closing tag </div> on first tag <div class="u">. Can anyone help me out with that....

You can find an element close to where the tag should have been closed and then standardize the html by replacing it.
For example, you can replace the </a> tag by </a></div>.
str_replace('</a>','</a></div>',$html)
or if there are too many closed </a> tags , replace </a><div class="u"> with </a></div><div class="u">
str_replace('</a><div class="u">','</a></div><div class="u">',$html)
There may be another problem. There is a gap between the tags and the replacement does not work properly. To solve this problem, you can first delete the spaces between the tags and then replace them.
$html = '<div class="namepageheader">
<div class="u">Name: Noor Shaad
<div class="u">Age: </div>
</div> ' ;
$html = preg_replace('~>\\s+<~m', '><', $html);
str_replace('</a><div class="u">','</a></div><div class="u">',$html);
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

Related

Remove span tag from element html dom parser

I have code like this, and it's fetching data from other website.
require('simple_html_dom.php');
$html = file_get_html("www.example.com");
$info['diesel'] = $html->find(".on .price",0)->innertext;
$info['pb95'] = $html->find(".pb .price",0)->innertext;
$info['lpg'] = $html->find(".lpg .price",0)->innertext;
The html code from other website looks:
<a href="#" class="station-detail-wrapper on text-center active">
<h3 class="fuel-header">ON</h3>
<div class="price">
5,97
<span>zł</span>
</div>
</a>
So if i use echo $info['diesel'] it shows me 5,97 zł. I would like to delete this <span>zł</span> to show price only.
May be you can replace that span tag with blank:
echo $info['diesel']=str_replace("<span>zł</span>","",$info['diesel']);

Replacing content between two div tags - PHP

I'm trying to replace content between two div tags using str_replace but I'm unsure how.
The content will be
<div class="profile-details">
<div class="username">Paradigm</div>
<div class="dob">01/01/2015</div>
</div>
What I want to do is replace the content between the <div class="profile-details">content </div> tags. The content is variable depending on the user profile.
Assuming this is a HTML string.
This would be my approach
echo preg_replace('/<div class="username">.+?</div>/im', '<div class="username">Special Username<\/div>', $string) ;

How to get content from Div which have other HTML tags using Regexp

I have div which contain other html tags along with text
I want to extract only text from this div OR inside all html tags
<div class="rpr-help m-chm">
<div class="header">
<h2 class="h6">Repair Help</h2>
</div><!-- /end .header -->
<div class="inner m-bsc">
<ul>
<li>Repair Video</li>
<li>Repair Q&A</li>
</ul>
</div>
<div>
<br>
<span class="h4">Cross Reference Information</span><br>
<p>Part Number 285753A (AP3963893) replaces 1195967, 280152, 285140, 285743, 285753, 3352470, 3363664, 3364002, 3364003, 62672, 62693, 661560, 80008, 8559748, AH1485646, EA1485646, PS1485646.
<br>
</p>
</div>
</div>
Here is my Regexp
preg_match_all("/<div class=\"rpr-help m-chm\">(.*)<\/.*>/s", $urlcontent, $description);
Its working fine whenever I assign this complete div to $urlcontent variable.
But when I am fetching data from real url like $urlcontent = "www.test.com/test.html";
its returning complete webpage script.
How can I get inside content of <div class="rpr-help m-chm"> ?
Is there any correction require in my regexp?
Any help would be appreciated. Thanks
It's not possible to parse HTML/XHTML by regex. Source
You can't parse [X]HTML with regex. Because HTML can't be parsed by
regex. Regex is not a tool that can be used to correctly parse HTML
Based on the language you use, Please consider using a thirdpart library for HTML parsing.
use this function
function GetclassContent($tagStart,$tagEnd,$content)
{
$first_step = explode( $tagStart,$content );
$second_step = explode($tagEnd,$first_step[1] );
return $second_step[0];
}
Steps to Use Above function
$website="www.test.com/test.html";
$content=file_get_contents($website);
$tagStart ='<div class="rpr-help m-chm">';
$tagEnd = "</div >";
$RequiredContent = GetclassContent($tagStart,$tagEnd,$content);

PHP preg_replace to insert string inside an existing string

Unfortunately I really cannot get my head around regular expressions so my last resort is to ask the help of you fine people.
I have this existing code:
<li id="id-21" class="listClass" data-author="newbie">
<div class="someDiv">
<span class="spanClass">Some content</span>
</div>
<div class="controls faint">
Link 2
Link 3
</div>
</li>
Due to a number of reasons, I have to use preg_replace to inject an additional piece of code:
Link 1
I think you can guess where that should go, but for the sake of clarity, my desire is for the resulting string to look like:
<li id="id-21" class="listClass" data-author="newbie">
<div class="someDiv">
<span class="spanClass">Some content</span>
</div>
<div class="controls faint">
Link 1
Link 2
Link 3
</div>
</li>
Can anyone help me with the appropriate regular expression to achieve this?
try this
$html = '<li id="id-21" class="listClass" data-author="newbie">
<div class="someDiv">
<span class="spanClass">Some content</span>
</div>
<div class="controls faint">
Link 2
Link 3
</div>
</li>';
$eleName = 'a';
$eleAttr = 'href';
$eleAttrValue = 'link2';
$addBefore = 'Link 1';
$result = regexAddBefore($html, $eleName, $eleAttr, $eleAttrValue, $addBefore);
var_dump($result);
function regexAddBefore($subject, $eleName, $eleAttr, $eleAttrValue, $addBefore){
$regex = "/(<\s*".$eleName."[^>]*".$eleAttr."\s*=\s*(\"|\')?\s*".$eleAttrValue."\s*(\"|\')?[^>]*>)/s";
$replace = $addBefore."\r\n$1";
$subject = preg_replace($regex, $replace, $subject);
return $subject;
}
I can suggest two things (Although I couldn't understand your problem clearly)
$newStr = preg_replace ('/<[^>]*>/', ' ', $htmlText);
this will remove all the html tags from the string. I don't know if it will be usefull for you.
Another recommendation would be to use strip_tags function. The second parameter of strip_tags is optional. You can define the tags you want to keep with the help of 2nd parameter.
$str = '<li id="id-21" class="listClass" data-author="newbie">
<div class="someDiv">
<span class="spanClass">Some content</span>
</div>
<div class="controls faint">
Link 2
Link 3
</div>
</li>';
echo strip_tags ($str,'<a>');
This will give you an output just with the links and whatever text in the html string.
Sorry if this also doesn't help.

How to get everything between two tags if the closing tag appears in the parent?

This is the problem: The script I use stops looking at the first tag.
I'm sceaping a website, and this is the part of the site I want to 'extract'.
<div class="i-want-this-div">
<div class="annoying-sub-div">
Bla bla bla
</div>
<div class="annoying-sub-div">
etc...
</div>
<div class="annoying-sub-div">
</div>
<div class="annoying-sub-div">
</div>
<div class="annoying-sub-div">
</div>
</div>
I want to display all those 'annoying'(because they mess up the function of the script by being there) divs on my site, but how do I do this?
This is my current approach: get the position of the first tag, get the position of the closing tag and subtract that part form the entire string that holds the whole website source.
$startPos = strpos($siteIAmScreaping, '<div class="i-want-this-div">');
$endPos = strpos($siteIAmScreaping, '</div>', $startPos) + 8;
$annoyingDivs = substr($siteIAmScreaping, $startPos, $endPos-$startPos);
The problem is: I want it to stop on the main divs closing tag and not on the first closing tag it finds.
Use DOMDocument for stuff like this.
Use querypath (or phpquery) for simplicity. You can then extract the <div> content by class or id most easily:
print htmlqp($page)->find("div.i-want-this-div")->html();
Are you saying to want to show the actual code? If so put your code inside the pre tags.
<pre></pre>
Everything within will remail formatted and all tags/code will be visible.

Categories