I have to use crawlers for my project.
I've used simple dom class to get all the links from a page.
Now I want to filter only those links which are of the form "/questions/3904482/<title of the question".
Here's my attempt:
include_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://stackoverflow.com/questions?sort=newest');
$pat='#^/question/([0-9]+)/#';
foreach($html->find('a') as $link)
{
echo preg_match($pat, $link->href);
{
echo $link->href."<br>";
}
}
All the links get filtered out.
you say the url is question*s* but your pattern shows no s
Also, it looks like you should be using if not echo
include_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://stackoverflow.com/questions?sort=newest');
$pat='#^/questions/([0-9]+)/#';
foreach($html->find('a') as $link)
{
if ( preg_match($pat, $link->href) )
{
echo $link->href."<br>";
}
}
You can take advantage of DOM and XPath:
<?php
$dom = new DOMDocument;
#$dom->loadHTMLFile('http://stackoverflow.com/questions?sort=newest');
$xpath = new DOMXPath($dom);
$questions = $xpath->query("//a[contains(#href, '/questions/') and not(contains(#href, '/tagged/')) and not(contains(#href, '/ask'))]");
foreach ($questions as $question) {
print "{$question->getAttribute('href')} => {$question->nodeValue}";
}
Related
good day Sir/Maam.
I have a certain html attribute that I want to search from the external website
I want to get the a href value but the problem is the id or class or name is random.
<div class="static">
Dynamic
</div>
This code should display all the hrefs in http://example.com
In this case I use DOMDocument and XPath to select the elements you want to access because it's very flexible and easy to use.
<?php
$html = file_get_contents("http://example.com");
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//a/#href");
print_r($nodeList);
// To access the values inside nodes
foreach($nodeList as $node){
echo "<p>" . $node->nodeValue . "</p>";
}
use jquery to get the value as follow:
var link = $(".static>a").attr("href");
You can use PHP DOMDocument:
<?php
$exampleurl = "http://YourDomain.com"; //set your url
$filterClass = "dynamicclass";
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($exampleurl);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href'); // all href
$class = $element->getAttribute('class');
if($class==$filterClass){
echo $href;
}
}
?>
I just try to create small simplephpdome
target is
<ul id=filter><li><a href="url1"></li><li><a href="url2"></li></ul>
<ul id=filter><li><a href="url3"></li><li><a href="url4"></li></ul>
How to get just first li result for every ul?
I have try this
$html = file_get_html($url);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$first_list_links = $xpath->evaluate('//ul[#id="filter"]/li/a');
foreach($first_list_links as $links) {
echo $dom->saveHTML($links);
}
but all li still included
You can achieve this using the PHP Simple HTML DOM Parser :
PHP
$html = file_get_html('<ul class="filter"><li><a href="url1"></li><li><a href="url2"></li></ul><ul class="filter"><li><a href="url3"></li><li><a href="url4"></li></ul>');
$urls = [];
foreach($html->find('.filter') as $element) {
$url = $element->firstChild()->find('a', 0)->href;
if (!in_array($url, $urls)) {
echo $url . "<br/>";
$urls[] = $url;
}
}
should output :
url1
url2
I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}
How do I echo and scrape a div class? I tried this but it doesn't work. I am using cURL to establish the connection. How do I echo it? I want it just how it is on the actual page.
$document = new DOMDocument();
$document->loadHTML($html);
$selector = new DOMXPath($document);
$anchors = $selector->query("/html/body//div[#class='resultitem']");
//a URL you want to retrieve
foreach($anchors as $a) {
echo $a;
}
Neighbor,
I just made this snippet below, that uses your logic, and some tweaks to display the specified class from the webpage in the get_contents function.
Maybe you can plug in your values and try it?
(Note: I put the error checking in there to see a few bugs. It can be helpful to use that as you tweak. )
<?php
error_reporting(E_ALL);
ini_set('display_errors', '1');
$url = "http://www.tizag.com/cssT/cssid.php";
$class_to_scrape="display";
$html = file_get_contents($url);
$document = new DOMDocument();
$document->loadHTML($html);
$selector = new DOMXPath($document);
$anchors = $selector->query("/html/body//div[#class='". $class_to_scrape ."']");
echo "ok, no php syntax errors. <br>Lets see what we scraped.<br>";
foreach ($anchors as $node) {
$full_content = innerHTML($node);
echo "<br>".$full_content."<br>" ;
}
/* this function preserves the inner content of the scraped element.
** http://stackoverflow.com/questions/5349310/how-to-scrape-web-page-data-without-losing-tags
** So be sure to go and give that post an uptick too:)
**/
function innerHTML(DOMNode $node)
{
$doc = new DOMDocument();
foreach ($node->childNodes as $child) {
$doc->appendChild($doc->importNode($child, true));
}
return $doc->saveHTML();
}
?>
My PHP code
$dom = new DOMDocument();
#$dom->loadHTML($file);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="text"]');
foreach ($tags as $tag) {
echo $tag->textContent;
}
What I'm trying to do here is to get the content of the div that has class 'text' but the problem when I loop and echo the results I only get the text I can't get the HTML code with images and all the HTML tags like p, br, img... etc i tried to use $tag->nodeValue; but also nothing worked out.
Personally, I like Simple HTML Dom Parser.
include "lib.simple_html_dom.php"
$html = str_get_html($file);
foreach($html->find('div.text') as $e){
echo $e->innertext;
}
Pretty simple, huh? It accommodates selectors like jQuery :)
What you need to do is create a temporary document, add the element to that and then use saveHTML():
foreach ($tags as $tag) {
$doc = new DOMDocument;
$doc->appendChild($doc->importNode($tag, true));
$html = $doc->saveHTML();
}
I found this snippet at http://www.php.net/manual/en/class.domelement.php:
<?php
function getInnerHTML($Node)
{
$Body = $Node->ownerDocument->documentElement->firstChild->firstChild;
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Body,true));
return $Document->saveHTML();
}
?>
Not sure if it works though.