I'm trying to get the wheather data from http://www.weather-forecast.com/locations/Berlin/forecasts/latest
but preg_match just returns nothing
<?php
$contents=file_get_contents("http://www.weather-forecast.com/locations/Berlin/forecasts/latest");
preg_match('/3Day Weather Forecast Summary:<\/b><span class="phrase">(.*?)</s', $contents, $matches);
print_r($matches)
?>
Don't use a regex to parse html, user an html parser like DOMDocument,
$contents = file_get_contents("http://www.weather-forecast.com/locations/Berlin/forecasts/latest");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($contents);
$x = new DOMXpath($dom);
foreach($x->query('//span[contains(#class,"phrase")]') as $phase)
{
echo $phase->textContent;
}
Related
I was trying to scrape the data from "non-secured" url that is using 'http' instead of 'https'.
Here is the code
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('/html/body/div[1]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
add_shortcode('shortcode_name2', 'display_html_info2');
I have also tried using XPath
//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3
In both the cases, it shows blank output. Means No Value.
Please let me know how this will work.
I have included the html_dom_parser.php
I tried the above mentioned code but it is giving No Value as Output. Instead, it is showing blank space where is use shortcode [shortcode_name2] to show output of the above code.
Additional
I have tried #Pinke Helga method but does not work for me. That's what I did
declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
if (!is_string($html)) {
return 'Error: Could not retrieve the HTML content.';
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
echo display_html_info2();
add_shortcode('shortcode_name2', 'display_html_info2');
And that's what I got. "Error: Could not retrieve the HTML content."
It looks as you have generated the xpath expression from browser dev-tools. The browser extends some HTML. There is no <tbody> in the original source.
Use the xpath expression //*#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3
Complete code:
<?php declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
// var_dump($h3_element);
return $h3_element->nodeValue;
}
echo display_html_info2(); // DEBUG output
Current result:
21.898 OMR
Building an MTDB DB with php, and need to scrape a specific tag from the URL.
Tag to get from url
vars.disqus = '';
vars.lists = [];
vars.titleId = '35079';
vars.trailersPlayer = 'default';
vars.userId = '907791';
vars.title = {"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec}
I need the
"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec
My code:
$html = 'myurl';
libxml_use_internal_errors(TRUE); $dom = new DOMDocument; $dom->loadHTMLFile($html); libxml_clear_errors();
$xp = new DOMXpath($dom); $nodes = $xp->query('//script[#\'id','trailer','title');
echo $nodes->item(0)->nodeValue;
the "Tag" is not a HTML format, its looks like some javascript code ~~
to resolve these string, simply via regex
preg_match('/title\s*=\s*\{([^}]+)}/', $str, $matches);
var_dump($matches[1]);
I have the following source code:
<?php
function getTerms()
{
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML('https://charitablebookings.com/terms'); // loads your HTML
$xpath = new DOMXPath($doc);
// returns a list of all links with rel=nofollow
$nodeList = $xpath->query("//div[#class='terms-conditions']");
$temp_dom = new DOMDocument();
$node = $nodeList->item(0);
$temp_dom = new DOMDocument();
foreach($nodeList as $n) $temp_dom->appendChild($temp_dom->importNode($n,true));
print_r($temp_dom->saveHTML());
}
getTerms();
?>
which I'm trying to get a text from a web page by getting a specific class. I don't get anything on my browser when I try to print_r the temp_dom. And $node is null. What am I doing wrong ?
Thanks for your time
The first issue is that DOMDocument's loadHTML method expects HTML content as its first parameter, not an URL.
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$html = file_get_contents('https://charitablebookings.com/terms');
$doc->loadHTML($html);
And the second problem is with your XPath expression: $xpath->query("//div[#class='terms-conditions']") - as there is no div with class of terms-conditions in the document (it probably gets added by some JavaScript loader).
How do I stop DOMDocument from having a mind of its own?
$dom = new DOMDocument();
$validHtml = '<body>Test</body>';
$dom->loadHTML($validHtml);
After loading, the anchor attribute is encoded. I want it not to do this.
$body = $dom->saveHTML();
var_dump($body);
//<body>Test</body>
I realize this has been covered before, but every where I look, it's more useless Ninja code. Any help appreciated.
Here's how I fixed my own problem. Basically, I decided to strip out all the tags in the markup and put in place holders that I can use later use to put back in:
$validHtml = '<body>Test</body>';
$matches = array();
preg_match_all('/{{[^}]+}}/',$validHtml, $matches);
$matches = $matches[0];
if (count($matches)>0){
foreach ($matches as $i=>$match){
$validHtml = str_replace($match, "<!--INDEX-$i-->", $validHtml);
}
}
$dom = new DOMDocument();
$dom->loadHTML($validHtml);
... //do processing on the loaded dom
Later on after manipulating the dom, I put back all the matches:
$validHtml = $dom->saveHTML();
if (count($matches)>0){
foreach ($matches as $i=>$match){
$validHtml = str_replace(array("<!--INDEX-$i-->", "<!--INDEX-$i-->"), $match, $validHtml);
}
}
I have this code which extracts all links from a website. How do I edit it so that it only extracts links that ends on .mp3?
Here are the following code:
preg_match_all("/\<a.+?href=(\"|')(?!javascript:|#)(.+?)(\"|')/i", $html, $matches);
Update:
A nice solution would be to use DOM together with XPath, as #zerkms mentioned in the comments:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new DOMXPath($doc);
// use the XPath function ends-with to select only those links which end with mp3
$links = $xpath->query('//a[ends-with(#href, ".mp3")]/#href');
Original Answer:
I would use DOM for this:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$links = array();
foreach($doc->getElementsByTagName('a') as $elem) {
if($elem->hasAttribute('href')
&& preg_match('/.*\.mp3$/i', $elem->getAttribute('href')) {
$links []= $elem->getAttribute('href');
}
}
var_dump($links);
I would prefer XPath, which is meant to parse XML/xHTML:
$DOM = new DOMDocument();
#$DOM->loadHTML($html); // use the # to suppress warnings from invalid HTML
$XPath = new DOMXPath($DOM);
$links = array();
$link_nodes = $XPath->query('//a[contains(#href, ".mp3")]');
foreach($link_nodes as $link_node) {
$source = $link_nodes->getAttribute('href');
// do some extra work to make sure .mp3 is at the end of the string
$links[] = $source;
}
There is an ends-with() XPath function that you can replace contains(), if you are using XPath 2.0. Otherwise, you might want to add an extra conditional to make sure the .mp3 is at the end of the string. It may not be necessary though.