I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?
solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;
Related
i made a code like this and how to make it short ? i mean i don't want to use foreach all the time for regex match, thank you.
<?php
preg_match_all('#<article [^>]*>(.*?)<\/article>#sim', $content, $article);
foreach($article[1] as $posts) {
preg_match_all('#<img class="images" [^>]*>#si', $posts, $matches);
$img[] = $matches[0];
}
$result = array_filter($img);
foreach($result as $res) {
preg_match_all('#src="(.*?)" data-highres="(.*?)"#si', $res[0], $out);
$final[] = array(
'src' => $proxy.base64_encode($out[1][0]),
'highres' => $proxy.base64_encode($out[2][0])
);
?>
If you want a robust code (that always works), avoid to parse html using regex, because html is more complicated and unpredictable than you think. Instead use build-in tools available for these particular tasks, i.e DOMxxx classes.
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$imgList = $xp->query('//article//img[#src][#data-highres]');
foreach($imgList as $img) {
$final[] = [
'src' => $proxy.base64_encode($img->getAttribute('src')),
'highres' => $proxy.base64_encode($img->getAttribute('data-highres'))
];
}
I'm stuck on particular task. As you can see I'm extracting hrefs and title from webpage and I need to put this information to a file. But how this array can be printed in order like this: href1 : title1 , href2 : title2 and so on.
<?php
$searched = file_get_contents('http://technologijos.lt');
$xml = new DOMDocument();
#$xml->loadHTML($searched);
foreach($xml->getElementsByTagName('a') as $lnk)
{
$links[] = array(
'href' => $lnk->getAttribute('href'),
'title' => $lnk->getAttribute('title')
);
}
echo '<pre>'; print_r($links); echo '</pre>';
?>
Why not create the array directly in a way that is usable afterwards?
<?php
$searched = file_get_contents('http://technologijos.lt');
$xml = new DOMDocument();
#$xml->loadHTML($searched);
$links = [];
foreach($xml->getElementsByTagName('a') as $lnk) {
$links[] = sprintf(
'%s : %s',
$lnk->getAttribute('href'),
$lnk->getAttribute('title');
);
}
var_dump(implode(', ', $links);
Obviously the same can be done by using a second loop to iterate over the links array if it is create as shown in your example.
I know this solution is simple, but it keeps slipping my mind. When I parse the page with this code and the $links array is printed, all of href parts are correct yet the img part only prints the last src element that is found on the page.
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
// get links
$href = $item->getAttribute("href");
// get images.
foreach ($images as $item) {
$img = $item->getAttribute('src');
}
$links[] = array(
'href' => $href,
'img' => $img
);
}
print_r(array_values($links));
The for each statement for images should be building an array where as the final array ($links)is a multi-dimentional array($img being the nested array).
Check if this works for you:
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $item) {
// get links
$href = $item->getAttribute("href");
// get images.
foreach ($images as $item) {
$img = $item->getAttribute('src');
// storing the image src
$links[] = array(
'img' => $img
);
}
$links[] = array(
'href' => $href
);
}
print_r(array_values($links));
You use the dublicate variable $item in internal foreach.
Try this without internal foreach
$doc = new DOMDocument();
$doc->loadHTML($html);
$links = array();
$images = $doc->getElementsByTagName("img");
$arr = $doc->getElementsByTagName("a");
foreach($arr as $key=>$item) {
// get links
$href = $item->getAttribute("href");
$img = $images[$key]->getAttribute('src');
$links[] = array(
'href' => $href,
'img' => $img
);
}unset($item);
print_r(array_values($links));
I want to select all URL's from a HTML page into an array like:
This is a webpage with
different kinds of <img src="someimg.png">
The output i would like is:
with => http://somesite.se/link1.php
Now i get:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
I do not want the urls/links that does contain a image between the start and end . Only the ones with text.
My current code is:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
#$dom->loadHTML($html); // # = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>
Add a second preg_match to your conditional:
if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;
I would like to extract all img tags that are within an anchor tag using the PHP DOM object.
I am trying it with the code below but its getting all anchor tag and making it's text empty due the inside of an img tag.
function get_links($url) {
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the url's contents into the DOM
#$xml->loadHTMLFile($url);
// Empty array to hold all links to return
$links = array();
//Loop through each <a> tag in the dom and add it to the link array
foreach($xml->getElementsByTagName('a') as $link)
{
$hrefval = '';
if(strpos($link->getAttribute('href'),'www') > 0)
{
//$links[] = array('url' => $link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
else
{
//$links[] = array('url' => GetMainBaseFromURL($url).$link->getAttribute('href'), 'text' => $link->nodeValue);
$hrefval = '#URL#'.GetMainBaseFromURL($url).$link->getAttribute('href').'#TEXT#'.$link->nodeValue;
$links[$hrefval] = $hrefval;
}
}
foreach($xml->getElementsByTagName('img') as $link)
{
$srcval = '';
if(strpos($link->getAttribute('src'),'www') > 0)
{
//$links[] = array('src' => $link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
else
{
//$links[] = array('src' => GetMainBaseFromURL($url).$link->getAttribute('src'), 'nodval' => $link->nodeValue);
$srcval = '#SRC#'.GetMainBaseFromURL($url).$link->getAttribute('src').'#NODEVAL#'.$link->nodeValue;
$links[$srcval] = $srcval;
}
}
//Return the links
//$links = unsetblankvalue($links);
return $links;
}
This returns all anchor tag and all img tag separately.
$xml = new DOMDocument;
libxml_use_internal_errors(true);
$xml->loadHTMLFile($url);
libxml_clear_errors();
libxml_use_internal_errors(false);
$xpath = new DOMXPath($xml);
foreach ($xpath->query('//a[contains(#href, "www")]/img') as $entry) {
var_dump($entry->getAttribute('src'));
}
The usage of strpos() function is not correct in the code.
Instead of using
if(strpos($link->getAttribute('href'),'www') > 0)
Use
if(strpos($link->getAttribute('href'),'www')!==false )