Get img src value inside a document.write PHP - php

I need to get the image src value from the following code using PHP XPath & node.
Sample HTML
<div class=\"thumb-inside\">
<div class=\"thumb\">
<script>document.write(thumbs.replaceThumbUrl('<img src=\".....\" />'));</script>
</div>
</div>
I tried like this:
$node = $xpath->query("div[#class='thumb-inside']/div[#class='thumb'‌​]/a/img/attribute::s‌​rc", $e);
$th = $node->item(0)->nodeValue;

I achieved through the following code. But I dont know whether its a correct code.
$string = str_replace("document.write(thumbs.replaceThumbUrl(","",$string);
$string = str_replace("'));","",$string);
$pattern = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#';
preg_match_all($pattern, $string, $matches, PREG_PATTERN_ORDER);
$th = $matches[0][0];

You can use DOMDocument in php like below to get the image source.
$html=file_get_contents('file_path');
$doc = new \DomDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('img');
foreach ($tags as $tag) {
echo $tag->getAttribute('src');
}

Related

Using PHP preg_replace to append text to pattern found with regex

I want to append a tag div before and after all tags img.
So I have
<img src=%random url image% />
And it should be replaced with
<div class="demo"><img src=%random url image% /></div>
Can I do it with preg_replace?
$string = %page source code%;
$find = array("/<img(.*?)\/>/");
$replace = array('<div class="demo">'.$find[0].'</div>');
$result = preg_replace($find, $replace, $string);
But it not work :/
A better way to parse HTML is using PHPs DOMDocument and DOMXPath classes. In your case, you can use XPath to find all the images, then add a div around them as shown in this example:
$html = '<div><img src="http://x.com" /><span>xyz</span><img src="http://example.com" /></div>';
$doc = new DOMDocument();
$doc->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXpath($doc);
$images = $xpath->query('//img');
foreach ($images as $image) {
$div = $doc->createElement('div');
$div->setAttribute('class', 'demo');
$image->parentNode->replaceChild($div, $image);
$div->appendChild($image);
}
echo $doc->saveHTML();
Output:
<div>
<div class="demo"><img src="http://x.com"></div>
<span>xyz</span>
<a href="http://example.com">
<div class="demo"><img src="http://example.com"></div>
</a>
</div>
Demo on 3v4l.org

PHP Web Crawler doesn't crawl .php files

This is the simple webcrawler I was trying to build
<?php
$to_crawl = "http://samplewebsite.com/about.php";
function get_links($url)
{
$input = #file_get_contents($url);
$regexp = " <a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a> ";
preg_match_all("/$regexp/siU", $input, $matches);
$l = $matches[2];
foreach ($l as $link) {
echo $link."</br>";
}
}
get_links($to_crawl);
?>
When I try to run the script with the $to_crawl variable set to a url ending with a file name, e.g. "facebook.com/about", it works, but for some reason, it just echo's nothing when the link is ending with a '.php' filename. Can someone please help?
To get all links and their inner texts, you can use DOMDocument like this:
$dom = new DOMDocument;
#$dom->loadHTML($input); // Your input (HTML code)
$xp = new DOMXPath($dom);
$links = $xp->query('//a[#href]'); // XPath to get only <a> tags with a href attribute
$result = array();
foreach ($links as $link) {
$result[] = array($link->getAttribute("href"), $link->nodeValue);
}
print_r($result);
See IDEONE demo

Remove all occurrences of <br> before text

I'm trying to remove all <br> before my text.
So I have this:
<p>
<br/><br/>When the battle is on between contestants in a talent show, it gets really competitive when down to the last four. X-FactorUSAcontestant Marcus Canty knows this all too well as this is the stage he was voted off of the show earlier this year. <br/><br/>
</p>
I want to get rid of the first two <br/> but also I'd want to get rid of them if there were more than 2.
I would prefer to sue xpath as I'm already using it, at the moment I have this.
foreach($xpath->query('//br[not(preceding::text())]') as $node) {
$node->parentNode->removeChild($node);
}
For some reason on this particular page it doesn't seem to be working.
UPDATE
Originally the question was why was there at the start of document when my xpath should be getting rid of them (see below). I applied some regex to see if that worked which revealed the doctype you see now. I thought the doctype was somehow causing my original problem but it just wasn't being shown until now. This content is what I've imported from blogger and currently manipulating to fit a new blog.
link to example page
!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN” “http://www.w3.org/TR/REC-html40/loose.dtd”><br><br>
Here's my code:
global $post;
$postTime = $post - > post_date;
$postTime = strtotime($postTime);
$startDate = "2014/01/16";
if ($postTime < strtotime($startDate)) {
$html = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$doc = new DOMDocument();#$doc - > loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath - > query('//br[not(preceding::text())]') as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//a[string-length(.) = 0]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//*[not(text() or node() or self::br)]');
foreach($nodes as $node) {
$node - > parentNode - > removeChild($node);
}
remove_filter('the_content', 'wpautop');
$content = $doc - > saveHTML();
$content = ltrim($content, '<br>');
$content = strip_tags($content, '<br> <a> <iframe>');
$content = preg_replace(array('/(<br\s*\/?>\s*){1,}/'), array('<br/><br/>'), $content);
$content = str_replace(' ', ' ', $content);
$content = "<p>".implode("</p>\n\n<p>", preg_split('/\n(?:\s*\n)+/', $content))."</p>";
return $content;
Help appreciated.
What about ltrim?
$string = ltrim($string, '<br/>');
You could try using a regex
s/!DOCTYPE html PUBLIC “-\/\/W3C\/\/DTD HTML 4.0 Transitional\/\/EN” “http:\/\/www.w3.org\/TR\/REC-html40\/loose.dtd”>((<br[^>]*/>)+)(.*)/\3/
or in PHP:
$pattern = '/^((<br[^>]*/>)+)(.*)/i';
$replacement = '$3';
$content = preg_replace($pattern, $replacement, $content);

Dom replace entire node

Right now, i have this:
$text = $row->text;
$dom = new DOMDocument();
$dom->loadHTML($text);
$tags = $dom->getElementsByTagName('img');
foreach ($tags as $tag) {
$eg = $tag->getAttribute('data-easygal');
$src = $tag->getAttribute('src');
$values = explode("_",$eg);
$display = $this->prepareAlbum($values[0],$values[1],$src);
}
$row->text = $text;
is there a way to replace the whole node $tag, with what's in the $display string? I cant seem to find out how to str_replace the node for example.
Used to have preg_replace but that doesnt work properly on the clients server even though it works at home (and some instant anger from the php community with preg and html)
Tried searching the board, but no luck in finding what i need :S
Something like:
foreach($tags as &$tag) {
...
$tag = new DomNode();
}
Try
$tag-> parentNode ->replaceChild($newNode, $tag);
should replace the $tag node with $newNode - A DOM node that you create in the usual way.

getting all values from h1 tags using php

I want to receive an array that contains all the h1 tag values from a text
Example, if this where the given input string:
<h1>hello</h1>
<p>random text</p>
<h1>title number two!</h1>
I need to receive an array containing this:
titles[0] = 'hello',
titles[1] = 'title number two!'
I already figured out how to get the first h1 value of the string but I need all the values of all the h1 tags in the given string.
I'm currently using this to receive the first tag:
function getTextBetweenTags($string, $tagname)
{
$pattern = "/<$tagname ?.*>(.*)<\/$tagname>/";
preg_match($pattern, $string, $matches);
return $matches[1];
}
I pass it the string I want to be parsed and as $tagname I put in "h1".
I didn't write it myself though, I've been trying to edit the code to do what I want it to but nothing really works.
I was hoping someone could help me out.
Thanks in advance.
you could use simplehtmldom:
function getTextBetweenTags($string, $tagname) {
// Create DOM from string
$html = str_get_html($string);
$titles = array();
// Find all tags
foreach($html->find($tagname) as $element) {
$titles[] = $element->plaintext;
}
}
function getTextBetweenTags($string, $tagname){
$d = new DOMDocument();
$d->loadHTML($string);
$return = array();
foreach($d->getElementsByTagName($tagname) as $item){
$return[] = $item->textContent;
}
return $return;
}
Alternative to DOM. Use when memory is an issue.
$html = <<< HTML
<html>
<h1>hello<span>world</span></h1>
<p>random text</p>
<h1>title number two!</h1>
</html>
HTML;
$reader = new XMLReader;
$reader->xml($html);
while($reader->read() !== FALSE) {
if($reader->name === 'h1' && $reader->nodeType === XMLReader::ELEMENT) {
echo $reader->readString();
}
}
function getTextBetweenH1($string)
{
$pattern = "/<h1>(.*?)<\/h1>/";
preg_match_all($pattern, $string, $matches);
return ($matches[1]);
}

Categories