I'm testing a parser using SIMPLE_HTML_DOM and while parsing
the returned HTML DOM from this URL: HERE
It is not finding the H1 elements...
I tried returning all the div's with success.
I'm using a simple request for diagnosing this problem:
foreach($html->find('H1') as $value) { echo "<br />F: ".htmlspecialchars($value); }
While looking at the source code I realized that:
h1 is upper case -> H1 - but the SIMPLE_HTML... is handling that:
//PaperG - If lowercase is set, do a case insensitive test of the value of the selector.
if ($lowercase) {
$check = $this->match($exp, strtolower($val), strtolower($nodeKeyValue));
} else {
$check = $this->match($exp, $val, $nodeKeyValue);
}
if (is_object($debugObject)) {$debugObject->debugLog(2, "after match: " . ($check ? "true" : "false"));}
Can any body help me understanding what is going on here?
Try This
$oHtml = str_get_html($html);
foreach($oHtml->find('h1') as $element)
{
echo $element->innertext;
}
You will also use regular expression following function return an array of all h1 tag's innertext
function getH1($yourhtml)
{
$h1tags = preg_match_all("/(<h1.*>)(\w.*)(<\/h1>)/isxmU", $yourhtml, $patterns);
$res = array();
array_push($res, $patterns[2]);
array_push($res, count($patterns[2]));
return $res;
}
Found it...
But cant explain it!
I tested with another code including H1 (uppercase) and it worked.
While playing with the SIMPLE_HTML_DOM code i commented the "remove_noise" and now its working
perfectly, I think it's because that this website has invalid HTML and
the noise remover is removing too much and not ending after the end tags scripts:
// $this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");
// $this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");
Thank you all for your help.
Related
I've searched around and around and I'm not sure how this really works.
I have the tags
<taghere>content</taghere>
and i want to pull the "content" so i can put an ifstatement depending on what the "content" is as the "content" is varrying depending on the page
i.e
<taghere>HelloWorld</taghere>
$content = //function that returns the text between <taghere> and </taghere>
if($content == "HelloWorld")
{
//execute function;
}
else if($content =="Bonjour")
{
//execute seperate function
}
i tried using preg but it doesnt seem to work and just returns whatever value is in the lines field instead of actually giving me the information within the tags
If I understand your question correctly, you want the data INSIDE the tag "taghere".
If you are parsing HTML, you should use DOMDocument
Try something similar to this:
<?php
// Assuming your content (the html where those tags are found) is available as $html
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your HTML
libxml_clear_errors();
// Note: Tag names are case sensitive
$text = $dom->getElementsByTagName('taghere');
// Echo the content
echo $text
you can use DomDocument and loadXML to do this
<?php
function doAction($word=""){
$html="<taghere>$word</taghere>";
$doc = new DOMDocument();
$doc->loadXML($html);
//discard white space
$hTwo= $doc->getElementsByTagName('taghere'); // here u use your desired tag
if($hTwo->item(0)->nodeValue== "HelloWorld")
{
echo "1";
}
else if($hTwo->item(0)->nodeValue== "Bonjour")
{
echo "2";
//execute seperate function
}
}
doAction($word="Bonjour");
You cannot do it like that. Technically it is possible but it's more than an overkill. And you mixed up PHP with HTML in a way that doesn't work.
To achieve the thing that you want you have to do something like this:
$content = 'something';
if ($comtent === 'something') {
//do something
}
if ($content === 'something else') {
//do something else
}
echo '<tag>'. $content . '</tag>' ;
Of course you can change $content in the ifs.
Dont forget, you can allways add an ID into a tag so you can reference it with java script.
<tag id='tagid'>blah blah blah </tag>
<script>
document.getElementById(tagid)
</script>
This might be a much simpler way to get what you are thinking about then some of the other responses
I don't know what regex you tried and therefor not what would have been wrong. Might have been the escaping of the <
<?php
if(preg_match('#\<taghere>(.*)\</taghere>#', $document, $a)){
$content = $a[1];
}
?>
I suppose there will be only one
I am trying to parse html page of Google play and getting some information about apps. Simple-html-dom works perfect, but if page contains code without spaces, it completely ingnores attributes. For instance, I have html code:
<div class="doc-banner-icon"><img itemprop="image"src="https://lh5.ggpht.com/iRd4LyD13y5hdAkpGRSb0PWwFrfU8qfswGNY2wWYw9z9hcyYfhU9uVbmhJ1uqU7vbfw=w124"/></div>
As you can see, there is no any spaces between image and src, so simple-html-dom ignores src attribute and returns only <img itemprop="image">. If I add space, it works perfectly. To get this attribute I use the following code:
foreach($html->find('div.doc-banner-icon') as $e){
foreach($e->find('img') as $i){
$bannerIcon = $i->src;
}
}
My question is how to change this beautiful library to get full inner text of this div?
I just create function which adds neccessary spaces to content:
function placeNeccessarySpaces($contents){
$quotes = 0; $flag=false;
$newContents = '';
for($i=0; $i<strlen($contents); $i++){
$newContents.=$contents[$i];
if($contents[$i]=='"') $quotes++;
if($quotes%2==0){
if($contents[$i+1]!== ' ' && $flag==true) {
$newContents.=' ';
$flag=false;
}
}
else $flag=true;
}
return $newContents;
}
And then use it after file_get_contents function. So:
$contents = file_get_contents($url, $use_include_path, $context, $offset);
$contents = placeNeccessarySpaces($contents);
Hope it helps to someone else.
I have this function to get title of a website:
function getTitle($Url){
$str = file_get_contents($Url);
if(strlen($str)>0){
preg_match("/\<title\>(.*)\<\/title\>/",$str,$title);
return $title[1];
}
}
However, this function make my page took too much time to response. Someone tell me to get title by request header of the website only, which won't read the whole file, but I don't know how. Can anyone please tell me which code and function i should use to do this? Thank you very much.
Using regex is not a good idea for HTML, use the DOM Parser instead
$html = new simple_html_dom();
$html->load_file('****'); //put url or filename
$title = $html->find('title');
echo $title->plaintext;
or
// Create DOM from URL or file
$html = file_get_html('*****');
// Find all images
foreach($html->find('title') as $element)
echo $element->src . '<br>';
Good read
RegEx match open tags except XHTML self-contained tags
Use jQuery Instead to get Title of your page
$(document).ready(function() {
alert($("title").text());
});
Demo : http://jsfiddle.net/WQNT8/1/
try this will work surely
include_once 'simple_html_dom.php';
$oHtml = str_get_html($url);
$Title = array_shift($oHtml->find('title'))->innertext;
$Description = array_shift($oHtml->find("meta[name='description']"))->content;
$keywords = array_shift($oHtml->find("meta[name='keywords']"))->content;
echo $title;
echo $Description;
echo $keywords;
I have an XML file which contains text with some very simple layout constructs:
<?xml version='1.0'?>
<page>
<section>
<header>Header</header>
<par>Some paragraph</par>
<par>Another paragraph with <emph>formatting</emph></par>
</section>
</page>
In PHP then I read this file using SimpleXML (Note that I intentionally strip other tags!):
$page = file_get_contents("page.xml");
if ($page) {
$stripped = strip_tags($page, "<?xml><page><section><header><par><emph>");
$xml = new SimpleXMLElement($stripped);
}
Now I would like to iterate over the XML elements and print them in order as HTML for my website. The final result should be the following snippet:
<h1>Header</h1>
<p>Some paragraph
<p>Another paragraph with <i>formatting</i>
I've noodled through SimpleXML and XPath and tried to figure out how I can iterate over the XML tree in order so that I can digest the original XML file into HTML output. I can produce a somewhat desired result but the <emph></emph> is just gone; how do I descent further into the tree? My code so far:
foreach ($xml->section as $s) {
echo "<h1>" . $s->header . "</h1>";
foreach ($s->par as $p) {
echo "<p>" . $p;
// Do some magic here to ensure <emph> tags are recognized and responded to properly.
}
}
Any hints and pointers are appreciated! Thanks :-)
Well, without an answer I just had to noodle myself :-) So here is what I did and it worked out just fine.
Turned out that the SimpleXML thing didn't cut it, so I used the XMLReader:
$xml = new XMLReader();
Then I manually parsed the XML string, jumped from element to element and acted upon each of them:
if ($xml->xml($stripped)) { // $stripped here is a string that's been validated (see below).
while (false !== $xml->read()) {
$t = $xml->nodeType;
if ($t === XMLReader::ELEMENT) {
$n = $xml->name;
switch ($n) {
case "page":
case "section":
// Nothing to echo here.
break;
case "header":
// Handle attributes here
echo "<h1>";
break;
case "par":
echo "<p> ";
break;
case "emph";
echo "<i>"; // This can also open a <span> for more flexibility later.
break;
default:
// Nothing should arrive here.
echo "Gah!"
}
}
else if ($t === XMLReader::END_ELEMENT) {
... // Close the opened tags here.
}
else if ($t === XMLReader::TEXT) {
$s = $xml->readString();
echo $s;
}
else {
// Everything else are comments or white spaces.
}
}
}
You get the drift. I basically had to bounce through the XML structure myself and, dependent on the element type, handle attributes and nodes of elements manually.
In fact, this is a two-step process. What you see here assumes a valid XML document. I also have a validator that runs before the above code, and which makes sure that the correct elements are nested properly and that the given XML is "well formed" as per my own definitions of nesting, attributes, whatnot. The validator operates after the exact same principle.
Hope this helps.
find() here is a function of the simple_html_dom library, that should return dom node elements when given an id/class.
$urlFetched->find("#".$id) always fails to find and return something when the $id is "fk-list-MP3-Players-/-IPods". I am guessing the problem is with the forward slash and simple_html_dom, because there is no problem with the other ids and urls(snipped).
What do I do? my program is almost complete and dependent on simple html dom.
Thanks
The code:
$urlAndIds = array(
array("http://www.flipkart.com/audio" , array('fk-list-Home-Audio', htmlentities("fk-list-MP3-Players-/-IPods"), 'fk-list-Accessories'),array('ALL','AllBrands')) );
foreach($urlAndIds as $uAI) {
$url = file_get_contents($uAI[0]) ;
$urlFetched = str_get_html($url) ;
if ($url == false){
echo 'page '.$uAI[0] . " not found" ."<br>" ."<br>";
} else {
foreach ($uAI[1] as $id) {
$idFound = $urlFetched->find("#".$id) ;
if(!$idFound) {
echo 'In page '.$uAI[0].' -id not found- '.$id ."<br>";
}
}
}
}
The slash is being interpreted as part of the XPath expression, so it's looking for a child element named -IPods. There is no XPath "quote" type function either. I'm not sure whether adding a backslash would work, but it may be easier for you to just use a normal attribute selector with id: [#id='fk-list-MP3-Players-/-IPods']