I'm trying to understand how to scrape decoded phone numbers from a yellow page website with PHP & Curl.
Here is an example URL:
https://www.gelbeseiten.de/test
Normally you can technically do it with something like this:
$ch = curl_init();
$page = curl_exec($ch);
if(preg_match('#example html code (.*) example html code#', $page, $match))
$result = $match[1];
echo $result;
But on the page mentioned above you cannot directly find the phone number in the HTML code. There must be a way to get the phone number.
Can you please help me out?
Best regards,
Jennifer
Don't use regex to parse html, use an html parser like DOMDocument, i.e.:
$html = file_get_contents("https://www.gelbeseiten.de/test");
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//span[contains(#class,"nummer")]') as $item) {
print trim($item->textContent);
}
Output:
(0211) 4 08 05(0211) 4 08 05(0211) 4 08 05(0211) 4 08 05(0231) 9 79 76(0231)...
As suggested in a comment - using an XPath expression yields the phone numbers as desired.
$url='https://www.gelbeseiten.de/test';
$dom=new DOMDocument;
$dom->loadHTMLFile( $url );
$xp=new DOMXpath( $dom );
$query='//li[#class="phone"]';
$col=$xp->query($query);
if( $col ){
foreach( $col as $node )echo $node->nodeValue . "<br />";
}
$dom = $xp = $col = null;
Related
I currently got this far in scraping with htmldom (as far as examples go)
<?php
require 'simple_html_dom.php';
$html = file_get_html('https://nitter.absturztau.be/chillartaholic');
$title = $html->find('title', 0);
$image = $html->find('img', 0);
echo $title->plaintext."<br>\n";
echo $image->src;
?>
However instead of retrieving a title and image,
I'd like to instead get all lines in the target page that begin with:
<a class="tweet-link"
and display the lines scraped - in their entirety - top to bottom below.
(First scraped line would then be:
> <a class="tweet-link"
> href="/ChillArtaholic/status/1413973360841744390#m"></a>
Is this possible with htmldom (or are there limitations on the scrapeable number of lines et all?)
Strangely enough, the answer from yesterday is gone.
This was the consensus that works
(altho their answer had many different other approaches) :/
<?php
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
$url = 'https://nitter.absturztau.be/chillartaholic';
$html = file_get_contents($url);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[#class="tweet-link"]');
foreach ($nodes as $node){
echo $link->nodeValue;
echo $node-> getAttribute('href'), '<br>';
}
?>
I was trying to scrape the data from "non-secured" url that is using 'http' instead of 'https'.
Here is the code
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('/html/body/div[1]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
add_shortcode('shortcode_name2', 'display_html_info2');
I have also tried using XPath
//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tbody/tr[2]/td[1]/h3
In both the cases, it shows blank output. Means No Value.
Please let me know how this will work.
I have included the html_dom_parser.php
I tried the above mentioned code but it is giving No Value as Output. Instead, it is showing blank space where is use shortcode [shortcode_name2] to show output of the above code.
Additional
I have tried #Pinke Helga method but does not work for me. That's what I did
declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
if (!is_string($html)) {
return 'Error: Could not retrieve the HTML content.';
}
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
return $h3_element->nodeValue;
}
echo display_html_info2();
add_shortcode('shortcode_name2', 'display_html_info2');
And that's what I got. "Error: Could not retrieve the HTML content."
It looks as you have generated the xpath expression from browser dev-tools. The browser extends some HTML. There is no <tbody> in the original source.
Use the xpath expression //*#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3
Complete code:
<?php declare(strict_types = 1);
function display_html_info2() {
$html = file_get_contents('http://adamsonsgroup.com/goldrates/');
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$h3_element = $xpath->query('//*[#id="myCarousel"]/div/div[1]/div[3]/div/div[1]/table/tr[2]/td[1]/h3')->item(0);
// var_dump($h3_element);
return $h3_element->nodeValue;
}
echo display_html_info2(); // DEBUG output
Current result:
21.898 OMR
Building an MTDB DB with php, and need to scrape a specific tag from the URL.
Tag to get from url
vars.disqus = '';
vars.lists = [];
vars.titleId = '35079';
vars.trailersPlayer = 'default';
vars.userId = '907791';
vars.title = {"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec}
I need the
"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec
My code:
$html = 'myurl';
libxml_use_internal_errors(TRUE); $dom = new DOMDocument; $dom->loadHTMLFile($html); libxml_clear_errors();
$xp = new DOMXpath($dom); $nodes = $xp->query('//script[#\'id','trailer','title');
echo $nodes->item(0)->nodeValue;
the "Tag" is not a HTML format, its looks like some javascript code ~~
to resolve these string, simply via regex
preg_match('/title\s*=\s*\{([^}]+)}/', $str, $matches);
var_dump($matches[1]);
I'm trying to get the wheather data from http://www.weather-forecast.com/locations/Berlin/forecasts/latest
but preg_match just returns nothing
<?php
$contents=file_get_contents("http://www.weather-forecast.com/locations/Berlin/forecasts/latest");
preg_match('/3Day Weather Forecast Summary:<\/b><span class="phrase">(.*?)</s', $contents, $matches);
print_r($matches)
?>
Don't use a regex to parse html, user an html parser like DOMDocument,
$contents = file_get_contents("http://www.weather-forecast.com/locations/Berlin/forecasts/latest");
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($contents);
$x = new DOMXpath($dom);
foreach($x->query('//span[contains(#class,"phrase")]') as $phase)
{
echo $phase->textContent;
}
I have this code which extracts all links from a website. How do I edit it so that it only extracts links that ends on .mp3?
Here are the following code:
preg_match_all("/\<a.+?href=(\"|')(?!javascript:|#)(.+?)(\"|')/i", $html, $matches);
Update:
A nice solution would be to use DOM together with XPath, as #zerkms mentioned in the comments:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new DOMXPath($doc);
// use the XPath function ends-with to select only those links which end with mp3
$links = $xpath->query('//a[ends-with(#href, ".mp3")]/#href');
Original Answer:
I would use DOM for this:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$links = array();
foreach($doc->getElementsByTagName('a') as $elem) {
if($elem->hasAttribute('href')
&& preg_match('/.*\.mp3$/i', $elem->getAttribute('href')) {
$links []= $elem->getAttribute('href');
}
}
var_dump($links);
I would prefer XPath, which is meant to parse XML/xHTML:
$DOM = new DOMDocument();
#$DOM->loadHTML($html); // use the # to suppress warnings from invalid HTML
$XPath = new DOMXPath($DOM);
$links = array();
$link_nodes = $XPath->query('//a[contains(#href, ".mp3")]');
foreach($link_nodes as $link_node) {
$source = $link_nodes->getAttribute('href');
// do some extra work to make sure .mp3 is at the end of the string
$links[] = $source;
}
There is an ends-with() XPath function that you can replace contains(), if you are using XPath 2.0. Otherwise, you might want to add an extra conditional to make sure the .mp3 is at the end of the string. It may not be necessary though.