Scrape unique image URLs from HTML - php

Using PHP to curl a web page (some URL entered by user, let's assume it's valid).
Example: http://www.youtube.com/watch?v=Hovbx6rvBaA
I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't wanna parse the DOM but wanna use RegEx).
I plan to then curl the URLs for their width and height information and ensure that they are indeed images, so don't worry about security related stuff.

What's wrong with using the DOM? It gives you much better control over the context of the information and a much higher likelihood that the things you pull out are actually URLs.
<?php
$resultFromCurl = '
<html>
<body>
<img src="hello.jpg" />
Yep
<table background="yep.jpg">
</table>
<p>
Perhaps you should check out foo.jpg! I promise it
is safe for work.
</p>
</body>
</html>
';
// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
'//table/#background',
'//img/#src',
'//input/#src',
'//a/#href',
'//area/#href',
'//img/#longdesc',
);
$dom = #DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);
$urls = array();
foreach ($queries as $query) {
foreach ($xpath->query($query) as $link) {
if (preg_match('#\.(gif|jpe?g|png)$#', $link->textContent))
$urls[$link->textContent] = true;
}
}
if (preg_match_all('#\b[^\s]+\.(?:gif|jpe?g|png)\b#', $dom->textContent, $matches)) {
foreach ($matches as $m) {
$urls[$m[0]] = true;
}
}
$urls = array_keys($urls);
var_dump($urls);

Collect all image urls into an array, then use array_unique() to remove duplicates.
$my_image_links = array_unique( $my_image_links );
// No more duplicates
If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:
$pattern = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);
The above will capture the image link in stuff like:
<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG</p>
Live example

Related

php regular expression to remove unwanted code

The editor I am using is adding extraneous coding that I would like to remove via php before writing to the database.
The code looks like this:
<img style="width: 250px;" src="files/school-big.jpg" data-cke-saved-src="files/school-big.jpg" alt="">
<img style="width: 250px;" src="files/firemen.jpg" data-cke-saved-src="files/firemen.jpg" alt="">
What I need to get rid of is the data-cke-saved-src="files/image-name". My understanding of regex is somewhere below weak so how would I build a regex to grab the image name without grabbing the end of the line or the rest of the content?
Thank you kindly,
Try this:
$data = preg_replace('#\s(data-cke-saved-src)="[^"]+"#', '', $data);
Or do it in jQuery before going into PHP with this:
$('img').removeAttr('data-cke-saved-src')
Try adding and using this function:
/*
*I am assuming you get all the data in a single variable.
*/
function remove_data_cke($text) {
// Get all data-cke-saved-src="..." tags from the html.
$result = array();
preg_match_all('|data-cke-saved-src="[^"]*"|U', $text, $result);
// Replace all occurances with an empty string.
foreach($result[0] as $data_cke) {
$text = str_replace($data_cke, '', $text);
}
return $text;
}
You can use DOM to easily remove the attribute:
$doc = new DOMDocument;
#$doc->loadHTML($html); // load the HTML data
foreach ($doc->getElementsByTagName('img') as $img) {
$img->removeAttribute('data-cke-saved-src');
}

How to use preg_match_all search if HTML source contains given URL?

I want to find all href tags that include my URL in any html source.
I used this code:
preg_match_all("'<a.*?href=\"(http[s]*://[^>\"]*?)\"[^>]*?>(.*?)</a>'si", $target_source, $matches);
Example, I try to find a href tags that include http://www.emrekadan.com
How can I do it ?
I'd simply use PHP's DOM Parser for this purpose. This may seem harder than regex, but it's actually a lot more easier and is the correct way to parse HTML.
$url = 'WEBSITE_TO_SEARCH_FOR';
$searchstring = 'YOUR_SEARCH_STRING';
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$result = array();
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if(stripos($href, $searchstring) !== FALSE) {
$result[] = $href;
}
}
if(!empty($result)) print_r($result);
Explanation:
Loads the given URL using loadHTMLfile() method
Finds all <a> tags and loops through them
Uses stripos() to case-insensitively check if the href contains the given search term
If it does, it's pushed into the $result array
Note: If an empty string is passed as the filename or an empty file is named, a warning will be generated. I've used # to hide that message, but it's generally regarded as a bad practice. You can add additional checks to make sure the URL exists before trying to load it.

strip_tags plus annotate links

Note: The input HTML is trusted; it is not user defined!
I'll highlight what I need with an example.
Given the following HTML:
<p>
Welcome to Google.com!<br>
Please, enjoy your stay!
</p>
I'd like to to convert it to:
Welcome to Google.com[1]
Please, enjoy[2] your stay!
[1] http://google.com/
[2] %request-uri%/enjoy.html <- note, request uri is something I define
for relative paths
I'd like to be able to customize it.
Edit: On a further note, I'd better explain myself and my reasons
We have an automated templating system (with sylesheets!) for emails and as part of the system, I'd like to generate multipart emails, ie, which contain both HTML and TEXT.
The system is made to only provides HTML.
I need to convert this HTML to text meaningfully, eg, I'd like to somehow retain any links and images, perhaps in the format I specified above.
You could use the DOM to do the following:
$doc = new DOMDocument();
$doc->loadHTML('…');
$anchors = array();
foreach ($doc->getElementsByTagName('a') as $anchor) {
if ($anchor->hasAttribute('href')) {
$href = $anchor->getAttribute('href');
if (!isset($anchors[$href])) {
$anchors[$href] = count($anchors) + 1;
}
$index = $anchors[$href];
$anchor->parentNode->replaceChild($doc->createElement('a', $anchor->nodeValue." [$index]"), $anchor);
}
}
$html = strip_tags($doc->saveHTML());
$html = preg_replace('/^[\t ]+|[\t ]+$/m', '', $html);
foreach ($anchors as $href => $index) {
$html .= "\n[$index] $href";
}

Searching within a webpage

what would be the best way to write a code in Php that would search within a webpage for a number of words stored in a file? is it best to store the source code in a file or is it another way? please help.
The best way is to use google: site:example.com word1 OR word2 OR word3
Do you want to search in ONE PAGE? or one website with MULTIPLE PAGES?
If its only one page i think you can store the html code in memory without problems.
if you know exactly what you search strpos for reach word will probably be the fastest (stripos for case insensitive). you can also define your own character class and use preg_match_all or something... just something like this will do...
<?
$keywords = array("word1","word2","word3");
$doc = strip_tags(file_get_contents("http://www.example.com")); // remove tags to get only text
$doc = preg_replace('/\s+/', ' ',$doc); // remove multiple whitespaces...
foreach($keywords as $word) {
$pos = stripos($doc,$word);
if($pos !== false) {
echo "match: ...".str_replace($word,"<em>$word</em>",substr($doc,$pos-20,50))."... \n";
}
}
?>
something like the following for example will perform MUCH faster as its based on hashmap lookups with O(1) and doesnt need to scan the whole text for every keyword...
<?
setlocale(LC_ALL, "en_US.utf8");
$keywords = array("word1","word2","word3","word4");
$doc = file_get_contents("http://www.example.com");
$doc = strtolower($doc);
$doc = preg_replace('!/\*.*?\*/!s', '', $doc);
$doc = preg_replace("/<!--.*>/i", "", $doc);
$doc = preg_replace('!<script.*?script>!s', '', $doc);
$doc = preg_replace('!<style.*?style>!s', '', $doc);
$doc = strip_tags($doc);
$doc = preg_replace('/[^0-9a-z\s]/','',$doc);
$doc = iconv('UTF-8', 'ASCII//TRANSLIT', $doc); // check if encoding is really utf8
//$doc = preg_replace('{(.)\1+}','$1',$doc); remove duplicate chars ... possible step to add even more fuzzyness
$doc = preg_split("/\s+/",trim($doc));
foreach($keywords as $word) {
$word = strtolower($word);
$word = iconv('UTF-8', 'ASCII//TRANSLIT', $word);
$key = array_search($word,$doc);
var_dump($key);
if($key !== false) {
echo "match: ";
for($i=$key;$i<=5 && isset($doc[$i]);$i++) {
echo $doc[$i]." ";
}
}
}
?>
this code is untested.
it would be however be more elegant to dump textnodes from a domdocument
Simple searching is easy. If you want to search in a whole website the crawling logic is difficult.
I once did a backlink-checker for a company that worked like a crawler.
My first advice is not to do a recursion (like scanning a page and following all links and following all links in that until you reach a certain level...)
rather do it like this:
do a for loop as often as many levels you want to crawl.
set a site array with one entry (start page)
pass array to a function downloads every link, scans the site there and stores links on it in array.
when done with all links return the new link list array
in the for loop update the array with the return value of the function, and call the function again.
this way you can avoid following nasty paths but rather crawl website level by level.
also store already visited links in an array to skip, dont follow external links, check for weird url parameters etc..
for future use you can store documents in lucene or solr, there are classes to turn html pages into senseful lucene objects and search within.

Inserting multiple links into text, ignoring matches that happen to be inserted

The site I'm working on has a database table filled with glossary terms. I am building a function that will take some HTML and replace the first instances of the glossary terms with tooltip links.
I am running into a problem though. Since it's not just one replace, the function is replacing text that has been inserted in previous iterations, so the HTML is getting mucked up.
I guess the bottom line is, I need to ignore text if it:
Appears within the < and > of any HTML tag, or
Appears within the text of an <a></a> tag.
Here's what I have so far. I was hoping someone out there would have a clever solution.
function insertGlossaryLinks($html)
{
// Get glossary terms from database, once per request
static $terms;
if (is_null($terms)) {
$query = Doctrine_Query::create()
->select('gt.title, gt.alternate_spellings, gt.description')
->from('GlossaryTerm gt');
$glossaryTerms = $query->rows();
// Create whole list in $terms, including alternate spellings
$terms = array();
foreach ($glossaryTerms as $glossaryTerm) {
// Initialize with title
$term = array(
'wordsHtml' => array(
h(trim($glossaryTerm['title']))
),
'descriptionHtml' => h($glossaryTerm['description'])
);
// Add alternate spellings
foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
$alternateSpelling = h(trim($alternateSpelling));
if (empty($alternateSpelling)) {
continue;
}
$term['wordsHtml'][] = $alternateSpelling;
}
$terms[] = $term;
}
}
// Do replacements on this HTML
$newHtml = $html;
foreach ($terms as $term) {
$callback = create_function('$m', 'return \'<span>\'.$m[0].\'</span>\';');
$term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
$pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
$newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
}
return $newHtml;
}
Using Regexes to process HTML is always risky business. You will spend a long time fiddling with the greediness and laziness of your Regexes to only capture text that is not in a tag, and not in a tag name itself. My recommendation would be to ditch the method you are currently using and parse your HTML with an HTML parser, like this one: http://simplehtmldom.sourceforge.net/. I have used it before and have recommended it to others. It is a much simpler way of dealing with complex HTML.
I ended up using preg_replace_callback to replace all existing links with placeholders. Then I inserted the new glossary term links. Then I put back the links that I had replaced.
It's working great!

Categories