Extract all urls Href php

Extract all urls Href php - php

How do I convert these links to sha1? and then return to the html already applied with sha1
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
if (preg_match("/globo.com/i", $link->getAttribute('href'))) {
$v = $link->getAttribute('href');
$str = str_replace($v,'http://www.globo.com/?id='.sha1($v),$v);
$str2 = str_replace($v,$str,$html);
echo $str2."";
}
}

You can just put the href back into the element:
$dom = new DOMDocument;
$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (preg_match("/globo.com/i", $href)) {
$newHref = 'http://www.globo.com/?id=' . sha1($v);
$link->setAttribute('href', $newHref);
}
}
And then export the finished HTML using saveHTML().
echo $dom->saveHTML();

Related

Get first li Simple DOM Parser

I just try to create small simplephpdome
target is
<ul id=filter><li><a href="url1"></li><li><a href="url2"></li></ul>
<ul id=filter><li><a href="url3"></li><li><a href="url4"></li></ul>
How to get just first li result for every ul?
I have try this
$html = file_get_html($url);
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$first_list_links = $xpath->evaluate('//ul[#id="filter"]/li/a');
foreach($first_list_links as $links) {
echo $dom->saveHTML($links);
}
but all li still included

You can achieve this using the PHP Simple HTML DOM Parser :
PHP
$html = file_get_html('<ul class="filter"><li><a href="url1"></li><li><a href="url2"></li></ul><ul class="filter"><li><a href="url3"></li><li><a href="url4"></li></ul>');
$urls = [];
foreach($html->find('.filter') as $element) {
$url = $element->firstChild()->find('a', 0)->href;
if (!in_array($url, $urls)) {
echo $url . "<br/>";
$urls[] = $url;
}
}
should output :
url1
url2

php dom not able to find any nodes

I'm trying to get the href of all anchor(a) tags using this code
$obj = json_decode($client->getResponse()->getContent());
$dom = new DOMDocument;
if($dom->loadHTML(htmlentities($obj->data->partial))) {
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHtml($node), PHP_EOL;
echo $node->getAttribute('href');
}
}
where the returned JSON is like here but it doesn't echo anything. The HTML does have a tags but the foreach is never run. What am I doing wrong?

Just remove that htmlentities(). It will work just fine.
$contents = file_get_contents('http://jsonblob.com/api/jsonBlob/54a7ff55e4b0c95108d9dfec');
$obj = json_decode($contents);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($obj->data->partial);
libxml_clear_errors();
foreach ($dom->getElementsByTagName('a') as $node) {
echo $dom->saveHTML($node) . '<br/>';
echo $node->getAttribute('href') . '<br/>';
}

PHP: DOM get url and anchors (but not IMG)

I want to select all URL's from a HTML page into an array like:
This is a webpage with
different kinds of <img src="someimg.png">
The output i would like is:
with => http://somesite.se/link1.php
Now i get:
<img src="someimg.png"> => http://somesite.com/link1.php
with => http://somesite.com/link1.php
I do not want the urls/links that does contain a image between the start and end . Only the ones with text.
My current code is:
<?php
function innerHTML($node) {
$ret = '';
foreach ($node->childNodes as $node) {
$ret .= $node->ownerDocument->saveHTML($node);
}
return $ret;
}
$html = file_get_contents('http://somesite.com/'.$_GET['apt']);
$dom = new DOMDocument;
#$dom->loadHTML($html); // # = Removes errors from the HTML...
$links = $dom->getElementsByTagName('a');
$result = array();
foreach ($links as $link) {
//$node = $link->nodeValue;
$node = innerHTML($link);
$href = $link->getAttribute('href');
if (preg_match('/\.pdf$/i', $href))
$result[$node] = $href;
}
print_r($result);
?>

Add a second preg_match to your conditional:
if(preg_match('/\.pdf$/i',$href) && !preg_match('/<img .*>/i',$node)) $result[$node] = $href;

XPATH/PHP - Smarter way to acommplish this?

I have the following:
$html = "<img src="path/to/image.jpg" alt="Alt name" />Page name"
I need to extract href and src attribute and anchor text
My solution:
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $node) {
$href = $node->getAttribute('href');
$title = $node->nodeValue;
}
foreach ($dom->getElementsByTagName('img') as $node) {
$img = $node->getAttribute('src');
}
What would be the smarter way?

You can avoid the loops if you use DOMXPath to grab the elements directly:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXpath( $dom);
$a = $xpath->query( '//a')->item( 0); // Get the first <a> node
$img = $xpath->query( '//img', $a)->item( 0); // Get the <img> child of that <a>
Now, you can do:
echo $a->getAttribute('href');
echo $a->nodeValue;
echo $img->getAttribute('src');
This will print:
/path/to/page.html
Page name
path/to/image.jpg

Possible alternative approach:
$domXpath = new DOMXPath(DOMDocument::loadHTML($html));
$href = $domXpath->query('a/#href')->item(0)->nodeValue;
$src = $domXpath->query('img/#src')->item(0)->nodeValue;
Empty/null checks are up to you.

http://ca2.php.net/manual/en/function.preg-match.php - if you want to use regex
or
http://php.net/manual/en/book.simplexml.php
if you need to use xml parsing.
// Simple xml
$xml = simplexml_load_string($html);
$attr = $xml->attributes();
echo 'href: ' . $attr['href'] . PHP_EOL;

Trying to use PHP DOM to replace node text without changing child nodes

I am trying to use the dom object to simplify the implementation of a glossary tooltip. What I need to do is to replace a text element in a paragraph, but NOT in an anchor tag that may be embedded in the paragraph.
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementByTagName("p");
foreach ($nodes as $node) {
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
I get:
'...<p>Replace this element not this element</p>...'
I want:
'...<p>Replace this element not this tag</p>...'
How do I implement this such that only the parent node text is changed and the child node (a tag) is not changed?

Try this:
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
$node->nodeValue = str_replace("tag","element",$node->nodeValue);
}
echo $document->saveHTML();
Hope this helps.
UPDATE
To answer #paul's question in the comments below, you can create
$html = '<p>Replace this tag not this tag</p>';
$document = new DOMDocument();
$document->loadHTML($html);
$document->preserveWhiteSpace = false;
$document->validateOnParse = true;
$nodes = $document->getElementsByTagName("p");
//create the element which should replace the text in the original string
$elem = $document->createElement( 'dfn', 'tag' );
$attr = $document->createAttribute('title');
$attr->value = 'element';
$elem->appendChild( $attr );
foreach ($nodes as $node) {
while( $node->hasChildNodes() ) {
$node = $node->childNodes->item(0);
}
//dump the new string here, which replaces the source string
$node->nodeValue = str_replace("tag",$document->saveHTML($elem),$node->nodeValue);
}
echo $document->saveHTML();

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Extract all urls Href php - php

Related

Get first li Simple DOM Parser

php dom not able to find any nodes

PHP: DOM get url and anchors (but not IMG)

XPATH/PHP - Smarter way to acommplish this?

Trying to use PHP DOM to replace node text without changing child nodes

Categories

Resources