I have the following code:
<?php
include("simple_html_dom.php");
crawl('http://www.google.com/search?hl=en&output=search&q=inurl:https://website.com/folder/');
function crawl($url){
$html = file_get_html($url);
$links = $html->find('a');
foreach($links as $link)
{
$new_link = str_replace("url?q=", "/" ,$link->href);
$new_link = $newstr = substr( $new_link, 0, strpos( $new_link, '&' ) );
echo "<a href='".$new_link."'>".$link->plaintext."</a><br />";
}
}
?>
it returns url like this: http//website.com/folder/stuff
without the : which makes the URL inaccessible.
I think there is nothing wrong in your code here is my approach using DOMDocument
$xml = new DOMDocument();
#$xml->loadHTMLFile("http://www.google.com/search?hl=en&output=search&q=inurl:https://github");
$links = array();
foreach($xml->getElementsByTagName('a') as $link) {
//skip if url don't contain url?q
if (false === strpos($link->getAttribute('href'), '/url?q')) continue;
$href = str_replace("url?q=", "/" ,$link->getAttribute('href'));
$href = substr( $href, 0, strpos( $href, '&' ) );
$links[] = array('url' => str_replace("//","", $href), 'text' => $link->nodeValue);
}
print_r($links);
See Demo at Viper
What if you take out the "http://" all together? Wouldn't it put you on the correct website? I don't know php, but I'm going to take a guess based on what I know about HTML and how browsers work.
Related
I have a string containing HTML and some placeholders.
The placeholders always start with {{ and end with }}.
I'm trying to encode the contents of places holders and the decode them later.
While they're encoded the ideally need to be valid HTML as I want to use DOMDocument on the string and the problem I'm having is that it ends up being a mess because the places holders are usually something like:
<img src="{{image url="mediadir/someimage.jpg"}}"/>
Sometimes they're something like this though:
<p>Some text</p>
{{widget type="pagelink" pageid="1"}}
<div class="whatever">Content</div>
I was wondering what the best way of doing this, thanks!
UPDATE: CONTEXT
The overall problem is that I have Magento site with a bunch of static links like:
Link text
And I need to replace them with widgets to the page so that if the URL changes the links update. So replace the above with something like this.
{{widget type="Magento\Cms\Block\Widget\Page\Link" anchor_text="Link Text" template="widget/link/link_block.phtml" page_id="123"}}
I have something which does this using the PHP DOMDocument functionality. It looks up CMS page through their URL, finds the ID and replaces the anchor node with the widget text. This works fine if the page doesn't already contain any widgets or URL placeholders.
However if it does then the placeholders come out broken when processed through the DOMDocument saveHTML() function.
My idea of a solution to this was to encode the widgets and URL placeholders before passing it toe DOMDocument loadHTML() function and decode them after the saveHTML() function when it is string again.
UPDATE: CODE
This is a cut down version of what I've got currently. It's messy but it does work in replacing pages with widgets.
$pageCollection = $this->pageCollectionFactory->create();
$collection = $pageCollection->load();
$findarray = array('http', 'mailto', '.pdf', '{', '}');
$findarray2 = array('mailto', '.pdf', '{', '}');
$specialurl = 'https://www.example.com';
$relative_links = 0;
$missing_pages = 0;
$fixed_links = 0;
try {
foreach ($collection as $page) {
$dom = new \DOMDocument();
$content = $this->cleanMagentoCode( $page->getContent() );
libxml_use_internal_errors(true); // Surpress warnings created by reading bad HTML
$dom->loadHTML( $content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD ); // Load HTML without doctype or html containing elements
$elements = $dom->getElementsByTagName("a");
for ($i = $elements->length - 1; $i >= 0; $i --) {
$link = $elements->item($i);
$found = false;
// To clean up later
if ( strpos($link->getAttribute('href'), $specialurl) !== FALSE ) {
foreach ($findarray2 as $find) {
if (stripos($link->getAttribute('href'), $find) !== FALSE) {
$found = true;
break;
}
}
} else {
foreach ($findarray as $find) {
if (stripos($link->getAttribute('href'), $find) !== FALSE) {
$found = true;
break;
}
}
}
if ( strpos($link->getAttribute('href'), '#') === 0 ) {
$found = true;
}
if ( $link->getAttribute('href') == '' ) {
$found = true;
}
if ( !$found ) {
$url = parse_url($link->getAttribute('href'));
if ( isset( $url['path'] ) ) {
$identifier = rtrim( ltrim($url['path'], '/'), '/' );
try {
$pagelink = $this->pageRepository->getById($identifier);
// Fix link
if ($this->fixLinksFlag($input)) {
if ( stripos( $link->getAttribute('class'), "btn" ) !== FALSE ) {
$link_template = "widget/link/link_block.phtml";
} else {
$link_template = "widget/link/link_inline.phtml";
}
$widgetcode = '{{widget type="Magento\Cms\Block\Widget\Page\Link" anchor_text="' . $link->nodeValue . '" template="' . $link_template . '" page_id="' . $pagelink->getId() . '"}}';
$widget = $dom->createTextNode($widgetcode);
$link->parentNode->replaceChild($widget, $link);
}
}
}
}
}
$page->setContent( $this->dirtyMagentoCode( $dom->saveHTML() ) );
$page->save();
}
}
I'm trying to get all the links on a page. So for example if a user types in https://laravel.com/ in the input field they will see all the links on that page.
I already got the concept working. Here is part of the code I have:
$website = request('website_url');
$pureURL = 'http://www.'.$website.'/';
$doc = new \DOMDocument;
#$doc->loadHTMLFile($pureURL);
foreach ($doc->getElementsByTagName('a') as $link){
$linkDetail[] = array('url' => $link->getAttribute('href'));
}
$pageLinks = $linkDetail;
return view('api.index', compact('result'));
My frontend code:
#foreach($pageLinks as $key => $link)
{{ $link['url'] }}<br />
#endforeach
This is what I get:
The problem is, I just want to get all the links that start with https. I want to avoid the links that have a /doc in them and so on.
How would I go about doing that. Im not really good with regex, but I know there is a way you can use that.
$website = request('website_url');
$pureURL = 'http://www.'.$website.'/';
$doc = new \DOMDocument;
#$doc->loadHTMLFile($pureURL);
foreach ($doc->getElementsByTagName('a') as $link){
$url = $link->getAttribute('href');
if (strpos($url, 'https') !== 0) {
continue;
}
$linkDetail[] = array('url' => $url);
}
$pageLinks = $linkDetail;
return view('api.index', compact('result'));
How about using parse_url() to check protocol?
foreach ($doc->getElementsByTagName('a') as $link){
if (parse_url($link->getAttribute('href'), PHP_URL_SCHEME) === 'https') {
$linkDetail[] = array('url' => $link->getAttribute('href'));
}
}
Why don't you check if the String starts with HTTPS and only then push it into the array? For instance:
foreach ($doc->getElementsByTagName('a') as $link){
if (strpos($link, 'https://') !== FALSE)
$linkDetail[] = array('url' => $link->getAttribute('href'));
}
^https.*[^doc|avoid_end]$
something like that https://regex101.com/r/I4ebAR/1 ?
foreach ($doc->getElementsByTagName('a') as $link){
$linkTmp = $link->getAttribute('href');
if (preg_match('/^https.*[^doc|avoid_end]$/')) {
$linkDetail[] = ['url' => $linkTmp];
}
}
I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)
Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.
None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)
please help me strip the following more efficiently.
a href="/mv/test-1-2-3-4.vFIsdfuIHq4gpAnc.html"
the site I visit has a few of those, I would only need everything in between the two periods:
vFIsdfuIHq4gpAnc
I would like to use my current format and coding that works around the regex environment. Please help me tune up my following preg match line:
preg_match_all("(./(.*?).html)", $sp, $content);
Any kind of help I get on this is greatly appreciated and thank you in advance!
Here is my complete code
$dp = "http://www.cnn.com";
$sp = #file_get_contents($dp);
if ($sp === FALSE) {
echo("<P>Error: unable to read the URL $dp. Process aborted.</P>");
exit();
}
preg_match_all("(./(.*?).html)", $sp, $content);
foreach($content[1] as $surl) {
$nctid = str_replace("mv/","",$surl);
$nctid = str_replace("/","",$nctid);
echo $nctid,'<br /><br /><br />';
the above is what I have been working on
It's pretty okay, really. It's just that you don't want to match .*?, you want to match multiple characters that aren't a full stop, so you can use [^.]+ instead.
$sp = 'a href="/mv/test-1-2-3-4.vFIsdfuIHq4gpAnc.html"';
preg_match_all( '/\.([^.]+).html/', $sp, $content );
var_dump( $content[1] );
The result that is printed:
array(1) {
[0]=>
string(16) "vFIsdfuIHq4gpAnc"
}
Here's an example of how to loop through all links:
<?php
$url = 'http://www.cnn.com';
$dom = new DomDocument( );
#$dom->loadHTMLFile( $url );
$links = $dom->getElementsByTagName( 'a' );
foreach( $links as $link ) {
$href = $link->attributes->getNamedItem( 'href' );
if( $href !== null ) {
if( preg_match( '~mv/.*?([^.]+).html~', $href->nodeValue, $matches ) ) {
echo "Link-id found: " . $matches[1] . "\n";
}
}
}
You can use explode():
$string = 'a href="/mv/test-1-2-3-4.vFIsdfuIHq4gpAnc.html"';
if(stripos($string, '/mv/')){
$dots = explode('.', $string);
echo $dots[(count($dots)-2)];
}
How about using explode?
$exploded = explode('.', $sp);
$content = $exploded[1]; // string: "vFIsdfuIHq4gpAnc"
even more simpler
$sp="/mv/test-1-2-3-4.vFIsdfuIHq4gpAnc.html";
$regex = '/\.(?P<value>.*)\./';
preg_match_all($regex, $sp, $content);
echo nl2br(print_r($content["value"], 1));
I found this code to check for links on an URL.
<?php
$url = "http://example.com";
$input = #file_get_contents($url);
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
#$dom->loadHTML($input);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
if ($link->hasAttribute('href')) {
$href = $link->getAttribute('href');
if (stripos($href, 'shows') !== false) {
echo "<p>http://example.com" . $href . "</p>\n";
}
}
}
?>
Works good, it shows all the links that contains 'shows'.
For example the script above find 3 links, so i get:
<p>http://example.com/shows/Link1</p>
<p>http://example.com/shows/Link2</p>
<p>http://example.com/shows/Link3</p>
Now the thing i try to do is to check those urls i just fetched also for links that contains 'shows'.
To be honest i'm a php noob, so i don't know where to start :(
Regards,
Bart
Something like:
function checklinks($url){
$input = #file_get_contents($url);
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
#$dom->loadHTML($input);
$links = $dom->getElementsByTagName('a');
foreach($links as $link) {
if ($link->hasAttribute('href')) {
$href = $link->getAttribute('href');
if (stripos($href, 'shows') !== false) {
echo "<p>" . $url . "/" . $href . "</p>\n";
checklinks($url . "/" . $href);
}
}
}
}
$url = "http://example.com";
checklinks($url);
Make it recursive - call the function again in the function itself.