How to Extract HTML using XPath like YQL using php? - php

I am using YQL (https://developer.yahoo.com/yql/) but Per application limit (identified by your Access Key): 100,000 calls per day and Per IP limits: /v1/public/: 2,000 calls per hour; /v1/yql/: 20,000 calls per hour .
I need unlimited query. How to Extract HTML using XPath like YQL using php.
$homepage = file_get_contents('https://google.com');
$dom = new DOMDocument();
$dom->loadHTML($homepage);
$xpath = new DOMXPath($dom);
$result = '';
foreach($xpath->evaluate('div') as $childNode) {
$result .= $dom->saveHtml($childNode);
}
var_dump($result);
I just found this example from web but not working.
Edit
$homepage = file_get_contents('https://google.com');
$dom = new DOMDocument();
$dom->loadHTML($homepage);
$xpath = new DOMXPath($dom);
$result = '';
foreach($xpath->query('//a[#class="touch"]') as $childNode) {
// if output <a class="touch" href="url"><span alt="demo1" title="title2">Content</span> some</a> , How to get href/url and child tag span attribute alt/title ?
$result .= $dom->saveHtml($childNode);
}
var_dump($result);
If possible then how to extract full HTML to json/xml like yql using php?

There are several ways you can do further processing, one is by doing another query. To get the span node, use can use this query:
$span = $xpath->query('./span', $childNode); // all spans
$span->item(0)->attributes->getNamedItem("alt")->nodeValue; // first span
What you are doing is searching under the given node.
p.s. don't use attributes property as an array (attributes["attributeName"]) because it doesn't work in some versions of PHP.

Related

php read html and handle double id-appearance

For my project I'm reading an external website which has used the same ID twice. I can't change that.
I need the content from the second appearance of that ID but my code just results the first one and does not see the second one.
Also a count to $data results 1 but not 2.
I'm desperate. Does anyone have an idea how to access the second ID 'hours'?
<?PHP
$url = 'myurl';
$contents = file_get_contents($url);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$data = $dom->getElementById("hours");
echo $data->nodeValue."\n";
echo count($data);
?>
As #rickdenhaan points out, getElementById always returns a single element which is the first element that has that specific value of id. However you can use DOMXPath to find all nodes which have a given id value and then pick out the one you want (in this code it will find the second one):
$url = 'myurl';
$contents = file_get_contents($url);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$count = 0;
foreach ($xpath->query("//*[#id='hours']") as $node) {
if ($count == 1) echo $node->nodeValue;
$count++;
}
As #NigelRen points out in the comments, you can simplify this further by directly selecting the second input in the XPath i.e.
$node = $xpath->query("(//*[#id='hours'])[2]")[0];
echo $node->nodeValue;
Demo on 3v4l.org

How do I extract all URL links from an RSS feed? [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I need to extract all the links to news articles from the NY Times RSS feed to a MySQL database periodically. How do I go about doing this? Can I use some regular expression (in PHP) to match the links? Or is there some other alternative way? Thanks in advance.
UPDATE 2 I tested the code below and had to modify the
$links = $dom->getElementsByTagName('a');
and change it to:
$links = $dom->getElementsByTagName('link');
It successfully outputted the links. Good Luck
UPDATE Looks like there is a complete answer here: How do you parse and process HTML/XML in PHP.
I developed a solution so that I could recurse all the links in my website. I've removed the code which verified the domain was the same with each recursion (since the question didn't ask for this), but you can easily add one back in if you need it.
Using html5 DOMDocument, you can parse HTML or XML document to read links. It is better than using regex. Try something like this
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath allows you to fetch nodes using expressions.
RSS Item Links
To fetch the RSS link elements (the link for each item):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom Links
The atom:link have a different semantic, they are part of the Atom namespace and used to describe relations. NYT uses the standout relation to mark featured stories. To fetch the Atom links you need to register a prefix for the namespace. Attributes are nodes, too so you can fetch them directly:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[#rel="standout"]/#href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
Here are other relations like prev and next.
HTML Links (a elements)
The description elements contain HTML fragments. To extract the links from them you have to load the HTML into a separate DOM document.
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[#href]/#href') as $link) {
var_dump($link->value);
}
}

DOMNodeList, xPath and PHP

I am parsing an HTML page with DOM and XPath in PHP.
I have to fetch a nested <Table...></table> from the HTML.
I have defined a query using FirePath in the browser which is pointing to
html/body/table[2]/tbody/tr/td[2]/table[2]/tbody/tr/td/table
When I run the code it says DOMNodeList is fetched having length 0. My objective is to spout out the queried <Table> as a string. This is an HTML scraping script in PHP.
Below is the function. Please help me how can I extract the required <table>
$pageUrl = "http://www.boc.cn/sourcedb/whpj/enindex.html";
getExchangeRateTable($pageUrl);
function getExchangeRateTable($url){
$htmlTable = "";
$xPathTable = nulll;
$xPathQuery1 = "html/body/table[2]/tbody/tr/td[2]/table[2]/tbody/tr/td/table";
if(strlen($url)==0){die('Argument exception: method call [getExchangeRateTable] expects a string of URL!');}
// initialize objects
$page = tidyit($url);
$dom = new DOMDocument();
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
// $elements is sppearing as DOMNodeList
$elements = $xpath->query($xPathQuery1);
// print_r($elements);
foreach($elements as $e){
$e->firstChild->nodeValue;
}
}
have you try like this
$dom = new domDocument;
$dom->loadHTML($tes);
$dom->preserveWhiteSpace = false;
$tables = $dom->getElementsByTagName("table");
$rows = $tables->item(0)->getElementsByTagName("tr");
print_r($rows);
Remove the tbody's from your XPath query - they are in most cases inserted by your browser, as is with the page you are trying to scrape.
/html/body/table[2]/tr/td[2]/table[2]/tr/td/table
This will most likely work.
However, its probaly more safe to use a different XPath. Following XPath will select the first th based on it's textual content, then select the tr's parent - a tbody or table:
//th[contains(text(),'Currency Name')]/parent::tr/parent::*
The xpath query should be with a leading / like :-
/html/...

How to scrape links from a a page with DOM & XPath?

I have a page scraped with curl and am looking to grab all of the links with a certain id. As far as I can tell the best way to do this is with dom and xpath. The bellow code grabs a large number of the urls, but cuts many of them off and grabs text that is not a url.
$curl_scraped_page is the page scraped with curl.
$dom = new DOMDocument();
#$dom->loadHTML($curl_scraped_page);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("/html/body//a");
Am I on the right track? Do I just need to mess with the "/html/body//a" xpath syntax or do I need to add more to capture the id element?
You can also do it this way and you'll have onyl a tags which have an id and href :
$doc = new DOMDocument();
$doc->loadHTML($curl_scraped_page);
$xpath = new DOMXPath($doc);
$hrefs = $xpath->query('//a[#href][#id]');
$dom = new DOMDocument();
$dom->loadHTML($curl_scraped_page);
$links = $dom->getElementsByTagName('a');
$processed_links = array();
foreach ($links as $link)
{
if ($link->hasAttribute('id') && $link->hasAttribute('href'))
{
$processed_links[$link->getAttribute('id')] = $link->getAttribute('href');
}
}
This is the solution regarding your question.
http://simplehtmldom.sourceforge.net/
include('simple_html_dom.php');
$html = file_get_html('http://www.google.com/');
foreach($html->find('#www-core-css') as $e) echo $e->outertext . '<br>';
I think that the easiest way is combining 2 following classes to pull information from another website:
Pull info from any HTML tag, contents or tag attribute: http://simplehtmldom.sourceforge.net/
Easy to handle curl, supports POST requests: https://github.com/php-curl-class/php-curl-class
Example:
include('path/to/curl.php');
include('path/to/simple_html_dom.php');
$url = 'http://www.example.com';
$curl = new Curl;
$html = str_get_html($curl->get($url)); //full HTML of website
$linksWithSpecificID = $html->find('a[id=foo]'); //returns array of elements
Check Simple HTML DOM Parser Manual from the upper link for the manipulation with HTML data.

Finding number of nodes in PHP, DOM, XPath

I am loading HTML into DOM and then querying it using XPath in PHP. My current problem is how do I find out how many matches have been made, and once that is ascertained, how do I access them?
I currently have this dirty solution:
$i = 0;
foreach($nodes as $node) {
echo $dom->savexml($nodes->item($i));
$i++;
}
Is there a cleaner solution to find the number of nodes, I have tried count(), but that does not work.
You haven't posted any code related to $nodes so I assume you are using DOMXPath and query(), or at the very least, you have a DOMNodeList.
DOMXPath::query() returns a DOMNodeList, which has a length member. You can access it via (given your code):
$nodes->length
If you just want to know the count, you can also use DOMXPath::evaluate.
Example from PHP Manual:
$doc = new DOMDocument;
$doc->load('book.xml');
$xpath = new DOMXPath($doc);
$tbody = $doc->getElementsByTagName('tbody')->item(0);
// our query is relative to the tbody node
$query = 'count(row/entry[. = "en"])';
$entries = $xpath->evaluate($query, $tbody);
echo "There are $entries english books\n";

Categories