Yes, I realize that you can use javascript/jQuery to do this, but I want to use PHP (it's more a learning thing). I can't install queryPath or phpQuery since this is on a client's webhost.
Basically I'm modifying
function getElementById($id) {
$xpath = new DOMXPath($this->domDocument);
return $xpath->query("//*[#id='$id']")->item(0);
}
to use, but
Fatal error: Using $this when not in object context in blahblahblah on line #
get thrown and $this is undefined.
Basically what I'm trying to do is get the body id value of the same page the PHP is on.
Any ideas?
It looks like he is trying to just make a function to do some of this xpath stuff easier.
Something like
<?php
function getElementById($id,$url){
$html = new DOMDocument();
$html->loadHtmlFile($url); //Pull the contents at a URL, or file name
$xpath = new DOMXPath($html); // So we can use XPath...
return($xpath->query("//*[#id='$id']")->item(0)); // Return the first item in element matching our id.
}
?>
I didn't test this, but it looks about right.
Related
I am trying to read the html <audio> tag in PHP, But it is creating dynamically
This is the URL! I'm using to read
$dom = new DOMDocument();
#$dom->loadHTML($html);
foreach (iterator_to_array($dom->getElementsByTagName('audio')) as $node) {
$this->printnode($node);
}
In printnode() function it is showing like no <audio> tag exits because it is creating dynamically
After seeing the structure, yes the url for the actual audio is being loading dynamically via JS.
But the audio playlist data is still visible. Use that:
$xpath = new DOMXPath($dom);
$playlist_data = $xpath->evaluate('string(//script[#id="playlist-data"])');
$data = json_decode($playlist_data, 1);
echo $data['audio'];
Its inside another script tag on JSON string format. So basically, access this data and get the value as a string. Then you'll get the JSON string, and as usual, load it into json_decode and the parser will do its thing returning you with an array, then access the audio url like any normal array
Sidenote: I just used xpath as personal preference, you can use:
$playlist_data = $dom->getElementById('playlist-data')->nodeValue;
if you choose to do so.
I'm trying to use SimplePie to pull a list of links via RSS feeds and then scrape those feeds using Simple HTML DOM to pull out images. I'm able to get SimplePie working to pull the links and store them in an array. I can also also use the Simple HTML DOM parser to get the image link that I'm looking for. The problem is that when I try to use SimplePie and Simple HTML DOM at the same time, I get a 500 error. Here's the code:
set_time_limit(0);
error_reporting(0);
$rss = new SimplePie();
$rss->set_feed_url('http://contently.com/strategist/feed/');
$rss->init();
foreach($rss->get_items() as $item)
$urls[] = $item->get_permalink();
unset($rss);
/*
$urls = array(
'https://contently.com/strategist/2016/01/22/whats-in-a-spotify-name-and-5-other-stories-you-should-read/',
'https://contently.com/strategist/2016/01/22/how-to-make-content-marketing-work-inside-a-financial-services-company/',
'https://contently.com/strategist/2016/01/22/glenn-greenwald-talks-buzzfeed-freelancing-the-future-journalism/',
...
'https://contently.com/strategist/2016/01/19/update-a-simpler-unified-workflow/');
*/
foreach($urls as $url) {
$html = new simple_html_dom();
$html->load_file($url);
$images = $html->find('img[class=wp-post-image]',0);
echo $images;
$html->clear();
unset($html);
}
I commented out the urls array, but it is identical to the array created by the SimplePie loop (I created it manually from the results). It fails on the find command the first time through the loop. If I comment out the $rss->init() line and use the static url array, the code all runs with no errors, but doesn't give me the result I want - of course. Any help is greatly appreciated!
There's a strange incompatibility between simple_html_dom and SimplePie. Loading html, the simple_html_dom->root is not loaded, causing error for any other operation.
Curiously, passing to function-mode instead of object-mode, for me it works fine:
$html = file_get_html( $url );
instead of:
$html = new simple_html_dom();
$html->load_file($url);
Anyway, simple_html_dom is is known for causing problems, above all about memory usage.
Edited:
OK, I have found the bug.
It reside on simple_html_dom->load_file(), that call standard function file_get_contents() and then check the result through error_get_last() and - if error was found - unset this own data. But if an error has occurred before (in my test SimplePie output a warning ./cache is not writeable) this previously error is interpreted by simple_html_dom as file_get_contents() fail.
If you have PHP 7 installed, you can call error_clear_last() after unset($rss), and your code should be work. Otherwise, you can use my code above or pre-load html data to a variable and then call simple_html_dom->load() instead of simple_html_dom->load_file()
I'm currently trying to parse some data from a forum. Here is the code:
$xml = simplexml_load_file('https://forums.eveonline.com');
$names = $xml->xpath("html/body/div/div/form/div/div/div/div/div[*]/div/div/table//tr/td[#class='topicViews']");
foreach($names as $name)
{
echo $name . "<br/>";
}
Anyway, the problem is that I'm using google xpath extension to help me get the path, and I'm guessing that google is changing the html enough to make it not come up when i use my website to do this search. Is there some type of way I can make the host look at the site through google chrome so that it gets the right code? What would you suggest?
Thanks!
My suggestion is to always use DOMDocument as opposed to SimpleXML, since it's a much nicer interface to work with and makes tasks a lot more intuitive.
The following example shows you how to load the HTML into the DOMDocument object and query the DOM using XPath. All you really need to do is find all td elements with a class name of topicViews and this will output each of the nodeValue members found in the DOMNodeList returned by this XPath query.
/* Use internal libxml errors -- turn on in production, off for debugging */
libxml_use_internal_errors(true);
/* Createa a new DomDocument object */
$dom = new DomDocument;
/* Load the HTML */
$dom->loadHTMLFile("https://forums.eveonline.com");
/* Create a new XPath object */
$xpath = new DomXPath($dom);
/* Query all <td> nodes containing specified class name */
$nodes = $xpath->query("//td[#class='topicViews']");
/* Set HTTP response header to plain text for debugging output */
header("Content-type: text/plain");
/* Traverse the DOMNodeList object to output each DomNode's nodeValue */
foreach ($nodes as $i => $node) {
echo "Node($i): ", $node->nodeValue, "\n";
}
A double '/' will make xpath search. So if you would use the xpath '//table' you would get all tables.
You can also use this deeper in your xpath structure like 'html/body/div/div/form//table' to get all tables under xpath 'html/body/div/div/form'.
This way you can make your code a bit more resilient against changes in the html source.
I do suggest learning a little about xpath if you want to use it. Copy paste only gets you so far.
A simple explanation about the syntax can be found at w3schools.com/xml/xpath_syntax.asp
I am grabbing the contents from google with PhP, how can I search $page for elements with the id of "#lga" and echo out another property? Say #lga is an image, how would I echo out it's source?
No, i'm not going to do this with Google, Google is strictly an example and testing page.
<body><img id="lga" src="snail.png" /></body>
I want to find the element named "lga" and echo out it's source; so the above code I would want to echo out "snail.png".
This is what i'm using and how i'm storing what I found:
<?php
$url = "https://www.google.com/";
$page = file($url);
foreach($page as $part){
}
?>
You can achieve this using the built-in DOMDocument class. This class allows you to work with HTML in a structured manner rather than parsing plain text yourself, and it's quite versatile:
$dom = new DOMDocument();
$dom->loadHTML($html);
To get the src attribute of the element with the id lga, you could simply use:
$imageSrc = $dom->getElementById('lga')->getAttribute('src');
Note that DOMDocument::loadHTML will generate warnings when it encounters invalid HTML. The method's doc page has a few notes on how to suppress these warnings.
Also, if you have control over the website you are parsing the HTML from, it might be more appropriate to have a dedicated script to serve the information you are after. Unless you need to parse exactly what's on a page as it is served, extracting data from HTML like this could be quite wasteful.
I want to get src in image based on class or id.
Ex. On html page there are many <img src="url"> but only one have a class or id:
<img src="url" class="image" or id="image">
How to get right src attribute wich have a specific class or id?
Pls regex not dom
I gonna explain you why I dont want to use dom or other libraries because I'm getting a html page from an other site which not allow fopen or _file_get_contents or DOM but only Curl could do this. Sure I have a reason why I not use these libraries like simplehtmldom because sometimes is impossible to get remote html page and I should make by myself some scripts.
You say that you don't want to use DOM libraries because you need to use cURL. That's fine -- DOMDocument and simple_xml_load_string both take string arguments. So you can get your string from cURL and load it into your DOM library.
For instance:
$html = curl_exec($ch); // assuming CURLOPT_RETURNTRANSFER
$dom = new DOMDocument;
$dom->loadHTML($html); // load the string from cURL into the DOMDocument object
// using an ID
$el = $dom->getElementById('image');
// using a class
$xpath = new DOMXPath($dom);
$els = $xpath->query('//img[#class="image"]');
$el = $els->item(0);
$src = $el->getAttribute('src');
if you absolutely have to use regex, here it is
<img(?:[^>]+src="(.+?)"[^>]+(?:id|class)="image"|[^>]+(?:id|class)="image"[^>]+src="(.+?)")
That said, the right way to do it is to use jQuery or a similar DOM-parsing technique. Don't use the regex unless you have a very good reason to because it will miss many cases (for example, it won't work if single quotes are used instead of double quotes or if there are spaces before "image").