scrapr image url from wikipedia page - php

I created regex which gives image url from the source code of the page.
<?php
function get_logo($html, $url)
{
//preg_match_all('', $html, $matches);
//preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches);
if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches)) {
echo "First";
return $matches[0][0];
} else {
if (preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches)) {
echo "Second";
return url_to_absolute($url, $matches[0][0]);
//return $matches[0][0];
} else
return null;
}
}
But for wikipedia page image url is like this
http://en.wikipedia.org/wiki/File:Nelson_Mandela-2008_(edit).jpg
which always fails in my regex.
How can I get rid of this?

Why try to parse HTML with regex when this can easily be done with the DOMDocument class in PHP.
<?php
$doc = new DOMDocument();
#$doc->loadHTMLfile( "http://www.wikipedia.org/" );
$images = $doc->getElementsByTagName("img");
foreach( $images as $image ) {
echo $image->getAttribute("src");
echo "<br>";
}
?>

Related

How to correct image links in scraped html using regex

Scraping using SimpleHTMLDom retrieves the HTML on the page as written but not as seen in the web browser and unless written to include the full url to their location on the website, they twill be missing information needed to display properly. Those links can be varied, some with no leading slash (/) and others using (../). So I have created a script to hopefully retrieve the (img src) using regex and then loop though each one, check if the domain name is included, and if not, inject it.
$homepage = "https://example.com/";
$html = '<img class="drt" src="100.png"><img src="../101.png"><img src="/102.png"><img src="103.png">';
$check_img = preg_match_all("/<img .*?(?=src)src=\"([^\"]+)\"/si", $html, $m);
foreach ($m[1] as $img){
if (strpos($img, $homepage) == false) {
if (strpos($img, '../') !== false) {
$html = str_replace('../', $homepage, $img);
} elseif ($img[0] == '/') {
$html = str_replace('/', $homepage, $img);
} else {
$html = substr_replace($img, $homepage, 0, 0);
}
}
}
echo $html;
But it only injects the last image and for some reason the <> are missing from the html.
Use DOMDocument or other HTML parser (edit: you already are using SimpleHTMLDom but I'm unfamiliar with it, see here if you want to use it), it's better in the long run especially if you want to tweak or get other elements.
<?php
$homepage = "https://example.com/";
$html = '<img class="drt" src="100.png"><img src="../101.png"><img src="/102.png"><img src="103.png">';
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('img') as $img) {
$src = $img->getAttribute('src');
if (strpos($img, '//') === false) {
$src = $homepage.basename($src);
$img->setAttribute('src', $src);
}
}
// hacky way! remove unwanted doctype ect
$ret = preg_replace('~<(?:!DOCTYPE|/?(?:html|body|head))[^>]*>\s*~i', '', $dom->saveHTML());
echo trim(str_replace('<meta http-equiv="Content-Type" content="text/html;charset=utf-8">', '', $ret));
// proper way! but you dont have correct DOM, no <body>
// remove <!DOCTYPE
//$dom->removeChild($dom->doctype);
// remove <html><body></body></html>
//$dom->replaceChild($dom->firstChild->firstChild->firstChild, $dom->firstChild);
//
//echo $dom->saveHTML();
https://3v4l.org/1sf3B
Or to produce the same result with your current code, but possibly prone to breaking use basename() to remove the ./ and ../, and possibly ../../
<?php
$homepage = "https://example.com/";
$html = '<img class="drt" src="100.png"><img src="../101.png"><img src="/102.png"><img src="103.png">';
$check_img = preg_match_all("/<img .*?(?=src)src=\"([^\"]+)\"/si", $html, $m);
foreach ($m[1] as $img){
if (strpos($img, '//') === false)
$html = str_replace($img, $homepage.basename($img), $html);
}
echo $html;
Example: https://3v4l.org/LvL82
Or do the longer checks and replace the $html with the replaced $src value
<?php
$homepage = "https://example.com/";
$html = '<img class="drt" src="100.png"><img src="../101.png"><img src="/102.png"><img src="103.png">';
$check_img = preg_match_all("/<img .*?(?=src)src=\"([^\"]+)\"/si", $html, $m);
foreach ($m[1] as $img){
if (strpos($img, '//') === false) {
$old_img = $img;
if (strpos($img, '../') !== false) {
$img = str_replace('../', $homepage, $old_img);
} elseif ($img[0] == '/') {
$img = str_replace('/', $homepage, $old_img);
} else {
$img = $homepage.$old_img;
}
$html = str_replace($old_img, $img, $html);
}
}
echo $html;
All produce the same result.

how to find urls under double quote

let's say we load the source code of this question and we want to find the url alongside "childUrl"
or goto this site source code and search "childUrl".
<?php
$sites_html = file_get_contents("https://stackoverflow.com/questions/46272862/how-to-find-urls-under-double-quote");
$html = new DOMDocument();
#$html->loadHTML($sites_html);
foreach() {
# now i want here to echo the link alongside "childUrl"
}
?>
Try this
<?php
function extract($url){
$sites_html = file_get_contents("$url");
$html = new DOMDocument();
$$html->loadHTML($sites_html);
foreach ($html->loadHTML($sites_html) as $row)
{
if($row=="wanted_url")
{
echo $row;
}
}
}
?>
you can use regex:
try this code
$matches = [[],[]];
preg_match_all('/\"wanted_url\": \"([^\"]*?)\"/', $sites_html, $matches);
foreach($matches[1] as $match) {
echo $match;
}
this will print all urls with wanted_url tag

Optimize remote page retrieving and parsing

I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:
$result = get_web_page('THE_WEB_PAGE');
preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);
foreach ($matches[2] as $lnk) {
$result = get_web_page($lnk);
preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test'] = $match[1];
preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test2'] = $match[1];
preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test3'] = $match[1];
++$index;
}
I have some more preg_match calls inside the loop.
How can I optimize my code?
Edit:
I've changed my code to use xpath instead of regex, and it became much more slower.
Edit2:
That's my full code:
<?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
// Get the links
$matches = $xpath->evaluate('//li[#class = "lasts"]/a[#class = "lnk"]/#href | //li[#class=""]/a[ #class = "lnk"]/#href');
if ($matches === FALSE) {
echo 'error';
exit();
}
foreach ($matches as $match) {
$links[] = 'WEB_PAGE'.$match->value;
}
$index = 0;
// For each link
foreach ($links as $link) {
echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
$result = get_web_page($link);
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('concat(//span[#id = "header"]/span[#id = "sub_header"]/text(), //span[#id = "header"]/span[#id = "sub_header"]/following-sibling::text()[1])');
if ($matches === FALSE) {
exit();
}
$data[$index]['name'] = $match;
$matches = $xpath->evaluate('//li[starts-with(#class, "active")]/a/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['types'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is a title" and #class = "info"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['info'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is another title" and #class = "name"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['names'][] = $match->data;
}
++$index;
}
?>
As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:
<?php
# set up some dummy data
$data = <<<DATA
<div>
<a class='link'>Some link</a>
<a class='link' id='otherid'>Some link 2</a>
</div>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
# all links
$links = $xpath->query("//a[#class = 'link']");
print_r($links);
# special id link
$special = $xpath->query("//a[#id = 'otherid']")
# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>
Consider using a DOM framework for PHP. This should be way faster.
Use PHP's DOMDocument with xpath queries:
http://php.net/manual/en/class.domdocument.php
See Jan's answer for more explanation.
The following also works but is less preferable, according to the comments.
For example:
http://simplehtmldom.sourceforge.net/
an example to get all a tags on a page:
<?php
include_once('simple_html_dom.php');
$url = "http://your_url/";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link)
{
// do something with the link
}
?>

Find all external links with Simple HTML Dom Parser and regular expressions?

How can I find all external links on a page using regular expressions and Simple HTML DOM Parser? I have the following code to find all links.
<?php
include_once('simple_html_dom.php');
$url = "http://www.tokyobit.com";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find('a') as $a){
echo $a;
}
?>
How can I add a regular expression to find all links starting with http://, https:// or ftp://?
foreach($html->find('a') as $a){
$regex = ; //regex here
if(preg_match_all($regex, $a, $matches)){
foreach($matches as $match){
echo $match . '<br />';
}
}
}
Change the $regex variable to:
$regex = "#(https?|ftp)://.#";
You can use a custom strpos to use an array as a needle
You'll first need this function
function strposa($haystack, $needle, $offset=0) {
if(!is_array($needle)) $needle = array($needle);
foreach($needle as $query) {
if(strpos($haystack, $query, $offset) !== false) return true; // stop on first true result
}
return false;
}
Then in your code
$needle = array("ftp://","http://","https://");
foreach($html->find('a') as $a){
if(strposa($a, $needle){
echo $matches;
}
}
Try this:
foreach($html->find('a') as $a){
if(preg_match('#^(?:https?|ftp)://.+$#', $a->href)){
echo $matches;
}
}
You can do it like this:
include_once('simple_html_dom.php');
$url = "http://www.tokyobit.com";
$html = new simple_html_dom();
$html->load_file($url);
$result = array();
foreach($html->find('a') as $a){
$href = $a->href;
if (strpos($href, '://', 3)!==false) $result[] = $href;
}
print_r($result);

PHP Regex or DOMDocument for Matching & Removing URLs?

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)
Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.
None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

Categories