matching URL to a pattern using php

matching URL to a pattern using php - php

I have to use crawlers for my project.
I've used simple dom class to get all the links from a page.
Now I want to filter only those links which are of the form "/questions/3904482/<title of the question".
Here's my attempt:
include_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://stackoverflow.com/questions?sort=newest');
$pat='#^/question/([0-9]+)/#';
foreach($html->find('a') as $link)
{
echo preg_match($pat, $link->href);
{
echo $link->href."<br>";
}
}
All the links get filtered out.

you say the url is question*s* but your pattern shows no s
Also, it looks like you should be using if not echo
include_once('simple_html_dom.php');
$html = new simple_html_dom();
$html->load_file('http://stackoverflow.com/questions?sort=newest');
$pat='#^/questions/([0-9]+)/#';
foreach($html->find('a') as $link)
{
if ( preg_match($pat, $link->href) )
{
echo $link->href."<br>";
}
}

You can take advantage of DOM and XPath:
<?php
$dom = new DOMDocument;
#$dom->loadHTMLFile('http://stackoverflow.com/questions?sort=newest');
$xpath = new DOMXPath($dom);
$questions = $xpath->query("//a[contains(#href, '/questions/') and not(contains(#href, '/tagged/')) and not(contains(#href, '/ask'))]");
foreach ($questions as $question) {
print "{$question->getAttribute('href')} => {$question->nodeValue}";
}

Related

get value of href inside of div from external site using PHP

good day Sir/Maam.
I have a certain html attribute that I want to search from the external website
I want to get the a href value but the problem is the id or class or name is random.
<div class="static">
Dynamic
</div>

This code should display all the hrefs in http://example.com
In this case I use DOMDocument and XPath to select the elements you want to access because it's very flexible and easy to use.
<?php
$html = file_get_contents("http://example.com");
$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DomXPath($doc);
$nodeList = $xpath->query("//a/#href");
print_r($nodeList);
// To access the values inside nodes
foreach($nodeList as $node){
echo "<p>" . $node->nodeValue . "</p>";
}

use jquery to get the value as follow:
var link = $(".static>a").attr("href");

You can use PHP DOMDocument:
<?php
$exampleurl = "http://YourDomain.com"; //set your url
$filterClass = "dynamicclass";
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($exampleurl);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href'); // all href
$class = $element->getAttribute('class');
if($class==$filterClass){
echo $href;
}
}
?>

Get first li Simple DOM Parser

I just try to create small simplephpdome
target is
<ul id=filter><li><a href="url1"></li><li><a href="url2"></li></ul>
<ul id=filter><li><a href="url3"></li><li><a href="url4"></li></ul>
How to get just first li result for every ul?
I have try this
$html = file_get_html($url);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$first_list_links = $xpath->evaluate('//ul[#id="filter"]/li/a');
foreach($first_list_links as $links) {
echo $dom->saveHTML($links);
}
but all li still included

You can achieve this using the PHP Simple HTML DOM Parser :
PHP
$html = file_get_html('<ul class="filter"><li><a href="url1"></li><li><a href="url2"></li></ul><ul class="filter"><li><a href="url3"></li><li><a href="url4"></li></ul>');
$urls = [];
foreach($html->find('.filter') as $element) {
$url = $element->firstChild()->find('a', 0)->href;
if (!in_array($url, $urls)) {
echo $url . "<br/>";
$urls[] = $url;
}
}
should output :
url1
url2

How to extract specific type of links from website using php?

I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>

You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.

This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}

You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}

How can I echo a scraped div in PHP?

How do I echo and scrape a div class? I tried this but it doesn't work. I am using cURL to establish the connection. How do I echo it? I want it just how it is on the actual page.
$document = new DOMDocument();
$document->loadHTML($html);
$selector = new DOMXPath($document);
$anchors = $selector->query("/html/body//div[#class='resultitem']");
//a URL you want to retrieve
foreach($anchors as $a) {
echo $a;
}

Neighbor,
I just made this snippet below, that uses your logic, and some tweaks to display the specified class from the webpage in the get_contents function.
Maybe you can plug in your values and try it?
(Note: I put the error checking in there to see a few bugs. It can be helpful to use that as you tweak. )
<?php
error_reporting(E_ALL);
ini_set('display_errors', '1');
$url = "http://www.tizag.com/cssT/cssid.php";
$class_to_scrape="display";
$html = file_get_contents($url);
$document = new DOMDocument();
$document->loadHTML($html);
$selector = new DOMXPath($document);
$anchors = $selector->query("/html/body//div[#class='". $class_to_scrape ."']");
echo "ok, no php syntax errors. <br>Lets see what we scraped.<br>";
foreach ($anchors as $node) {
$full_content = innerHTML($node);
echo "<br>".$full_content."<br>" ;
}
/* this function preserves the inner content of the scraped element.
** http://stackoverflow.com/questions/5349310/how-to-scrape-web-page-data-without-losing-tags
** So be sure to go and give that post an uptick too:)
**/
function innerHTML(DOMNode $node)
{
$doc = new DOMDocument();
foreach ($node->childNodes as $child) {
$doc->appendChild($doc->importNode($child, true));
}
return $doc->saveHTML();
}
?>

how to handle DOM in PHP

My PHP code
$dom = new DOMDocument();
#$dom->loadHTML($file);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[#class="text"]');
foreach ($tags as $tag) {
echo $tag->textContent;
}
What I'm trying to do here is to get the content of the div that has class 'text' but the problem when I loop and echo the results I only get the text I can't get the HTML code with images and all the HTML tags like p, br, img... etc i tried to use $tag->nodeValue; but also nothing worked out.

Personally, I like Simple HTML Dom Parser.
include "lib.simple_html_dom.php"
$html = str_get_html($file);
foreach($html->find('div.text') as $e){
echo $e->innertext;
}
Pretty simple, huh? It accommodates selectors like jQuery :)

What you need to do is create a temporary document, add the element to that and then use saveHTML():
foreach ($tags as $tag) {
$doc = new DOMDocument;
$doc->appendChild($doc->importNode($tag, true));
$html = $doc->saveHTML();
}

I found this snippet at http://www.php.net/manual/en/class.domelement.php:
<?php
function getInnerHTML($Node)
{
$Body = $Node->ownerDocument->documentElement->firstChild->firstChild;
$Document = new DOMDocument();
$Document->appendChild($Document->importNode($Body,true));
return $Document->saveHTML();
}
?>
Not sure if it works though.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

matching URL to a pattern using php - php

Related

get value of href inside of div from external site using PHP

Get first li Simple DOM Parser

How to extract specific type of links from website using php?

How can I echo a scraped div in PHP?

how to handle DOM in PHP

Categories

Resources