Web scraping with Xpath, grabbing img - php

I am trying to scrape some img from page. But couldn't grab those. My path is true(i think) but Xpath returns 0. Any idea what is wrong with my path?
function pageContent($url)
{
$html = cache()->rememberForever($url, function () use ($url) {
return file_get_contents($url);
});
$parser = new \DOMDocument();
$parser->loadHTML($html);
return $parser;
}
$url = 'https://sumai.tokyu-land.co.jp/osaka';
#$parser = pageContent($url);
$resimler = [];
$rota = new \DOMXPath($parser);
$images = $rota->query("//section//div[#class='p-articlelist-content-left']//div[#class='p-articlelist-content-img']//img");
foreach ($images as $image) {
$resimler[] = $image->getAttribute("src");
}
var_dump($resimler);

You were looking for a div[#class='p-articlelist-content-img'] instead of a ul.
In addition to that, you should not be hiding error messages with the # operator, instead use the libxml_use_internal_errors() function as it was intended.
Finally, the // search in XPath is expensive, so avoid it where possible, and you can get the attribute value directly from the query (I don't know if this is any more efficient though.)
function pageContent(String $url) : \DOMDocument
{
$html = cache()->rememberForever($url, function () use ($url) {
return file_get_contents($url);
});
$parser = new \DOMDocument();
libxml_use_internal_errors(true);
$parser->loadHTML($html);
libxml_use_internal_errors(false);
return $parser;
}
$url = "https://sumai.tokyu-land.co.jp/osaka";
$parser = pageContent($url);
$rota = new \DOMXPath($parser);
$images = $rota->query("//ul[#class='p-articlelist-content-img']/li/img/#src");
foreach ($images as $image) {
$resimler[] = $image->nodeValue;
}
var_dump($resimler);

Related

how to get exact img src in xpath

if you inspect the page to get img src, you'll see sth like this:/images/March/img1.jpeg. but as you know that's not a real address. I want to scrape this page and get the proper src value. how can I do that?
thx in advance.
<?php
$content=file_get_content('example.com');
$dom= new DOMDocument();
$dom->loadHTML($content);
$xpath=new DOMXpath();
$img=$xpath->query("(//img)[2]/#src");
foreach($img as $val){
$images=$val->nodeValue;//just returns img/march/img1.jpeg
//instead of www.example.com/img.....
}
?>
You have to make Absolute path manually like this:
<?php
$content = file_get_contents('example.com');
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXpath();
$img = $xpath->query("(//img)[2]/#src");
// Make Absolute Url
function getAbsUrl($value, $baseurl)
{
$Parsed = parse_url($value);
if (empty($Parsed['host'])) {
// Relative
return rtrim($baseurl, '/') . '/' . ltrim($Parsed['path'], '/');
} else {
return $value;
}
}
foreach ($img as $val) {
$images = getAbsUrl($val->nodeValue, 'http://www.example.com/');
}

How can I retrieve infos from PHP DOMElement?

I'm working on a function that gets the whole content of the style.css file, and returns only the CSS rules that needed by the currently viewed page (it will be cached too, so the function only runs when the page was changed).
My problem is with parsing the DOM (I'm never doing it before with PHP DOM). I have the following function, but $element->tagname returns NULL. I also want to check the element's "class" attribute, but I'm stuck here.
function get_rules($html) {
$arr = array();
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('*') as $element ){
$arr[sizeof($arr)] = $element->tagname;
}
return array_unique($arr);
}
What can I do? How can I get all of the DOM elements tag name, and class from HTML?
Because tagname should be an undefined index because its supposed to be tagName (camel cased).
function get_rules($html) {
$arr = array();
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('*') as $element ){
$e = array();
$e['tagName'] = $element->tagName; // tagName not tagname
// get all elements attributes
foreach($element->attributes as $attr) {
$attrs = array();
$attrs['name'] = $attr->nodeName;
$attrs['value'] = $attr->nodeValue;
$e['attributes'][] = $attrs;
}
$arr[] = $e;
}
return $arr;
}
Simple Output

Get Element by ClassName with DOMdocument() Method

Here is what I am trying to achieve : retrieve all products on a page and put them into an array. Here is the code I am using :
$page2 = curl_exec($ch);
$doc = new DOMDocument();
#$doc->loadHTML($page2);
$nodes = $doc->getElementsByTagName('title');
$noders = $doc->getElementsByClassName('productImage');
$title = $nodes->item(0)->nodeValue;
$product = $noders->item(0)->imageObject.src;
It works for the $title but not for the product. For info, in the HTML code the img tag looks like this :
<img alt="" class="productImage" data-altimages="" src="xxxx">
I have been looking at this (PHP DOMDocument how to get element?) but I still don't understand how to make it work.
PS : I get this error :
Call to undefined method DOMDocument::getElementsByclassName()
I finally used the following solution :
$classname="blockProduct";
$finder = new DomXPath($doc);
$spaner = $finder->query("//*[contains(#class, '$classname')]");
https://stackoverflow.com/a/31616848/3068233
Linking this answer as it helped me the most with this problem.
function getElementsByClass(&$parentNode, $tagName, $className) {
$nodes=array();
$childNodeList = $parentNode->getElementsByTagName($tagName);
for ($i = 0; $i < $childNodeList->length; $i++) {
$temp = $childNodeList->item($i);
if (stripos($temp->getAttribute('class'), $className) !== false) {
$nodes[]=$temp;
}
}
return $nodes;
}
Theres the code and heres the usage
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTML($html);
$content_node=$dom->getElementById("content_node");
$div_a_class_nodes=getElementsByClass($content_node, 'div', 'a');
function getElementsByClassName($dom, $ClassName, $tagName=null) {
if($tagName){
$Elements = $dom->getElementsByTagName($tagName);
}else {
$Elements = $dom->getElementsByTagName("*");
}
$Matched = array();
for($i=0;$i<$Elements->length;$i++) {
if($Elements->item($i)->attributes->getNamedItem('class')){
if($Elements->item($i)->attributes->getNamedItem('class')->nodeValue == $ClassName) {
$Matched[]=$Elements->item($i);
}
}
}
return $Matched;
}
// usage
$dom = new \DOMDocument('1.0');
#$dom->loadHTML($html);
$elementsByClass = getElementsByClassName($dom, $className, 'h1');

Get Vine video url and image using PHP simple HTML DOM Parser

So i like to take vine image url and video url using PHP Simple HTML DOM Parser.
http://simplehtmldom.sourceforge.net/
here is a example vine url
https://vine.co/v/bjHh0zHdgZT
So i need to take this info from the URL. Form image URL:
<meta property="twitter:image" content="https://v.cdn.vine.co/v/thumbs/8B474922-0D0E-49AD-B237-6ED46CE85E8A-118-000000FFCD48A9C5_1.0.6.mp4.jpg?versionId=mpa1lJy2aylTIEljLGX63RFgpSR5KYNg">
and For the video URL
<meta property="twitter:player:stream" content="https://v.cdn.vine.co/v/videos/8B474922-0D0E-49AD-B237-6ED46CE85E8A-118-000000FFCD48A9C5_1.0.6.mp4?versionId=ul2ljhBV28TB1dUvAWKgc6VH0fmv8QCP">
I want to take only the content of the these meta tags. if anyone can help really appreciate it. Thanks
Instead of using the lib you pointed out, I'm using native PHP DOM in this example, and it should work.
Here's a small class I created for something like that:
<?php
class DomFinder {
function __construct($page) {
$html = #file_get_contents($page);
$doc = new DOMDocument();
$this->xpath = null;
if ($html) {
$doc->preserveWhiteSpace = true;
$doc->resolveExternals = true;
#$doc->loadHTML($html);
$this->xpath = new DOMXPath($doc);
$this->xpath->registerNamespace("html", "http://www.w3.org/1999/xhtml");
}
}
function find($criteria = NULL, $getAttr = FALSE) {
if ($criteria && $this->xpath) {
$entries = $this->xpath->query($criteria);
$results = array();
foreach ($entries as $entry) {
if (!$getAttr) {
$results[] = $entry->nodeValue;
} else {
$results[] = $entry->getAttribute($getAttr);
}
}
return $results;
}
return NULL;
}
function count($criteria = NULL) {
$items = 0;
if ($criteria && $this->xpath) {
$entries = $this->xpath->query($criteria);
foreach ($entries as $entry) {
$items++;
}
}
return $items;
}
}
To use it you can try:
$url = "https://vine.co/v/bjHh0zHdgZT";
$dom = new DomFinder($url);
$content_cell = $dom->find("//meta[#property='twitter:player:stream']", 'content');
print $content_cell[0];

Extracting certain portions of HTML from within PHP

Ok, so I'm writing an application in PHP to check my sites if all the links are valid, so I can update them if I have to.
And I ran into a problem. I've tried to use SimpleXml and DOMDocument objects to extract the tags but when I run the app with a sample site I usually get a ton of errors if I use the SimpleXml object type.
So is there a way to scan the html document for href attributes that's pretty much as simple as using SimpleXml?
<?php
// what I want to do is get a similar effect to the code described below:
foreach($html->html->body->a as $link)
{
// store the $link into a file
foreach($link->attributes() as $attribute=>$value);
{
//procedure to place the href value into a file
}
}
?>
so basically i'm looking for a way to preform the above operation. The thing is I'm currently getting confused as to how should I treat the string that i'm getting with the html code in it...
just to be clear, I'm using the following primitive way of getting the html file:
<?php
$target = "http://www.targeturl.com";
$file_handle = fopen($target, "r");
$a = "";
while (!feof($file_handle)) $a .= fgets($file_handle, 4096);
fclose($file_handle);
?>
Any info would be useful as well as any other language alternatives where the above problem is more elegantly fixed (python, c or c++)
You can use DOMDocument::loadHTML
Here's a bunch of code we use for a HTML parsing tool we wrote.
$target = "http://www.targeturl.com";
$result = file_get_contents($target);
$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
#$dom->loadHTML($result);
$links = extractLink(getTags( $dom, 'a', ));
function extractLink( $html, $argument = 1 ) {
$href_regex_pattern = '/<a[^>]*?href=[\'"](.*?)[\'"][^>]*?>(.*?)<\/a>/si';
preg_match_all($href_regex_pattern,$html,$matches);
if (count($matches)) {
if (is_array($matches[$argument]) && count($matches[$argument])) {
return $matches[$argument][0];
}
return $matches[1];
} else
function getTags( $dom, $tagName, $element = false, $children = false ) {
$html = '';
$domxpath = new DOMXPath($dom);
$children = ($children) ? "/".$children : '';
$filtered = $domxpath->query("//$tagName" . $children);
$i = 0;
while( $myItem = $filtered->item($i++) ){
$newDom = new DOMDocument;
$newDom->formatOutput = true;
$node = $newDom->importNode( $myItem, true );
$newDom->appendChild($node);
$html[] = $newDom->saveHTML();
}
if ($element !== false && isset($html[$element])) {
return $html[$element];
} else
return $html;
}
You could just use strpos($html, 'href=') and then parse the URL. You could also search for <a or .php

Categories