Searching within a webpage - php

what would be the best way to write a code in Php that would search within a webpage for a number of words stored in a file? is it best to store the source code in a file or is it another way? please help.

The best way is to use google: site:example.com word1 OR word2 OR word3
Do you want to search in ONE PAGE? or one website with MULTIPLE PAGES?
If its only one page i think you can store the html code in memory without problems.
if you know exactly what you search strpos for reach word will probably be the fastest (stripos for case insensitive). you can also define your own character class and use preg_match_all or something... just something like this will do...
<?
$keywords = array("word1","word2","word3");
$doc = strip_tags(file_get_contents("http://www.example.com")); // remove tags to get only text
$doc = preg_replace('/\s+/', ' ',$doc); // remove multiple whitespaces...
foreach($keywords as $word) {
$pos = stripos($doc,$word);
if($pos !== false) {
echo "match: ...".str_replace($word,"<em>$word</em>",substr($doc,$pos-20,50))."... \n";
}
}
?>
something like the following for example will perform MUCH faster as its based on hashmap lookups with O(1) and doesnt need to scan the whole text for every keyword...
<?
setlocale(LC_ALL, "en_US.utf8");
$keywords = array("word1","word2","word3","word4");
$doc = file_get_contents("http://www.example.com");
$doc = strtolower($doc);
$doc = preg_replace('!/\*.*?\*/!s', '', $doc);
$doc = preg_replace("/<!--.*>/i", "", $doc);
$doc = preg_replace('!<script.*?script>!s', '', $doc);
$doc = preg_replace('!<style.*?style>!s', '', $doc);
$doc = strip_tags($doc);
$doc = preg_replace('/[^0-9a-z\s]/','',$doc);
$doc = iconv('UTF-8', 'ASCII//TRANSLIT', $doc); // check if encoding is really utf8
//$doc = preg_replace('{(.)\1+}','$1',$doc); remove duplicate chars ... possible step to add even more fuzzyness
$doc = preg_split("/\s+/",trim($doc));
foreach($keywords as $word) {
$word = strtolower($word);
$word = iconv('UTF-8', 'ASCII//TRANSLIT', $word);
$key = array_search($word,$doc);
var_dump($key);
if($key !== false) {
echo "match: ";
for($i=$key;$i<=5 && isset($doc[$i]);$i++) {
echo $doc[$i]." ";
}
}
}
?>
this code is untested.
it would be however be more elegant to dump textnodes from a domdocument
Simple searching is easy. If you want to search in a whole website the crawling logic is difficult.
I once did a backlink-checker for a company that worked like a crawler.
My first advice is not to do a recursion (like scanning a page and following all links and following all links in that until you reach a certain level...)
rather do it like this:
do a for loop as often as many levels you want to crawl.
set a site array with one entry (start page)
pass array to a function downloads every link, scans the site there and stores links on it in array.
when done with all links return the new link list array
in the for loop update the array with the return value of the function, and call the function again.
this way you can avoid following nasty paths but rather crawl website level by level.
also store already visited links in an array to skip, dont follow external links, check for weird url parameters etc..
for future use you can store documents in lucene or solr, there are classes to turn html pages into senseful lucene objects and search within.

Related

Ignore html tags in preg_replace

How do I ignore html tags in this preg_replace.
I have a foreach function for a search, so if someone searches for "apple span" the preg_replace also applies a span to the span and the html breaks:
preg_replace("/($keyword)/i","<span class=\"search_hightlight\">$1</span>",$str);
Thanks in advance!
I assume you should make your function based on DOMDocument and DOMXPath rather than using regular expressions. Even those are quite powerful, you run into problems like the one you describe which are not (always) easily and robust to solve with regular expressions.
The general saying is: Don't parse HTML with regular expressions.
It's a good rule to keep in mind and albeit as with any rule, it does not always apply, it's worth to make up one's mind about it.
XPath allows you so find all texts that contain the search terms within texts only, ignoring all XML elements.
Then you only need to wrap those texts into the <span> and you're done.
Edit: Finally some code ;)
First it makes use of xpath to locate elements that contain the search text. My query looks like this, this might be written better, I'm not a super xpath pro:
'//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..'
$search contains the text to search for, not containing any " (quote) character (this would break it, see Cleaning/sanitizing xpath attributes for a workaround if you need quotes).
This query will return all parents that contain textnodes which put together will be a string that contain your search term.
As such a list is not easy to process further as-is, I created a TextRange class that represents a list of DOMText nodes. It is useful to do string-operations on a list of textnodes as if they were one string.
This is the base skeleton of the routine:
$str = '...'; # some XML
$search = 'text that span';
printf("Searching for: (%d) '%s'\n", strlen($search), $search);
$doc = new DOMDocument;
$doc->loadXML($str);
$xp = new DOMXPath($doc);
$anchor = $doc->getElementsByTagName('body')->item(0);
if (!$anchor)
{
throw new Exception('Anchor element not found.');
}
// search elements that contain the search-text
$r = $xp->query('//*[contains(., "'.$search.'")]/*[FALSE = contains(., "'.$search.'")]/..', $anchor);
if (!$r)
{
throw new Exception('XPath failed.');
}
// process search results
foreach($r as $i => $node)
{
$textNodes = $xp->query('.//child::text()', $node);
// extract $search textnode ranges, create fitting nodes if necessary
$range = new TextRange($textNodes);
$ranges = array();
while(FALSE !== $start = strpos($range, $search))
{
$base = $range->split($start);
$range = $base->split(strlen($search));
$ranges[] = $base;
};
// wrap every each matching textnode
foreach($ranges as $range)
{
foreach($range->getNodes() as $node)
{
$span = $doc->createElement('span');
$span->setAttribute('class', 'search_hightlight');
$node = $node->parentNode->replaceChild($span, $node);
$span->appendChild($node);
}
}
}
For my example XML:
<html>
<body>
This is some <span>text</span> that span across a page to search in.
and more text that span</body>
</html>
It produces the following result:
<html>
<body>
This is some <span><span class="search_hightlight">text</span></span><span class="search_hightlight"> that span</span> across a page to search in.
and more <span class="search_hightlight">text that span</span></body>
</html>
This shows that this even allows to find text that is distributed across multiple tags. That's not that easily possible with regular expressions at all.
You find the full code here: http://codepad.viper-7.com/U4bxbe (including the TextRange class that I have taken out of the answers example).
It's not working properly on the viper codepad because of an older LIBXML version that site is using. It works fine for my LIBXML version 20707. I created a related question about this issue: XPath query result order.
A note of warning: This example uses binary string search (strpos) and the related offsets for splitting textnodes with the DOMText::splitText function. That can lead to wrong offsets, as the functions needs the UTF-8 character offset. The correct method is to use mb_strpos to obtain the UTF-8 based value.
The example works anyway because it's only making use of US-ASCII which has the same offsets as UTF-8 for the example-data.
For a real life situation, the $search string should be UTF-8 encoded and mb_strpos should be used instead of strpos:
while(FALSE !== $start = mb_strpos($range, $search, 0, 'UTF-8'))

Modification to a code to merge two parts of it with similar characteristics

Below is a link crawler that gets the urls of a page in a given depth. At the end of it I added a regular expression to match all the emails of the url that is just crawled. As you can see in the second part, it file_get_content the same page it just downloaded, meaning twice the execution time, bandwidth etc.
The question is how can I merge those two parts to use the first downloaded page, to avoid getting it again? Thank you.
function crawler($url, $depth = 2) {
$dom = new DOMDocument('1.0');
if (!$parts || !#$dom->loadHTMLFile($url)) {
return;
}
.
.
.
//this is where the second part starts
$text = file_get_contents($url);
$res = preg_match_all("/[a-z0-9]+([_\\.-][a-z0-9]+)*#([a-z0-9]+([\.-][a-z0-9]+)*)+\\.[a-z]{2,}/i", $text, $matches);
}
Replace:
$text = file_get_contents($url);
with:
$text = $dom->saveHTML();
http://www.php.net/manual/en/domdocument.savehtml.php
Alternatively, in the first part of your function, you could save the HTML into a variable using file_get_contents, then pass it to $dom->loadHTML. That way you can then reuse the variable with your regex.
http://www.php.net/manual/en/domdocument.loadhtml.php

How can I parse a very simple Table using PHP

Good day dear community!
I need to build a function which parses the content of a very simple Table
(with some labels and values) see the url below. I have used various ways to parse html sources. But this one is is a bit tricky! See the target i want to parse - it has some invaild markup:
The target: http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=644.0013008534253&SchulAdresseMapDO=194190
Well i tried it with this one
<?php
require_once('config.php'); // call config.php for db connection
$filename = "url.txt"; // Include the txt file which have urls
$each_line = file($filename);
foreach($each_line as $line_num => $line)
{
$line = trim($line);
$content = file_get_contents($line);
//echo ($content)."<br>";
$pattern = '/<td>(.*?)<\/td>/si';
preg_match_all($pattern,$content,$matches);
foreach ($matches[1] as $match) {
$match = strip_tags($match);
$match = trim($match);
//var_dump($match);
$sql = mysqli_query("insert into tablename(contents) values ('$match')");
//echo $match;
}
}
?>
Well - see the regex in line 7-11: it does not match!
Conclusio: i have to rework the parser-part of this script. I need to parse someway different - since the parsercode does not match exactly what is aimed. It is aimed to get back the results of the table.
Can anybody help me here to get a better regex - or a better way to parse this site ...
Any and all help will be greatly apprecaited.
regards
zero
You could use tear the table apart using
preg_split('/<td width="73%"> /', $str, -1); (note; i did not bother escaping characters)
You'll want to drop the first entry. Now you can use stripos and substr to cut away everything after the .
This is a basic setup! You will have to fine-tune it quite a bit, but I hope this gives you an idea of what would be my approach.
Regex does not always provide perfect result. Using any HTML parser is a good idea. There are many HTML parsers as described in Gordon's Answer.
I have used Simple HTML DOM Parser in past and it worked for me.
For Example:
// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');
// Find all <td> in <table> which class=hello
$es = $html->find('table.hello td');
// Find all td tags with attribite align=center in table tags
$es = $html->find('table td[align=center]');

Inserting multiple links into text, ignoring matches that happen to be inserted

The site I'm working on has a database table filled with glossary terms. I am building a function that will take some HTML and replace the first instances of the glossary terms with tooltip links.
I am running into a problem though. Since it's not just one replace, the function is replacing text that has been inserted in previous iterations, so the HTML is getting mucked up.
I guess the bottom line is, I need to ignore text if it:
Appears within the < and > of any HTML tag, or
Appears within the text of an <a></a> tag.
Here's what I have so far. I was hoping someone out there would have a clever solution.
function insertGlossaryLinks($html)
{
// Get glossary terms from database, once per request
static $terms;
if (is_null($terms)) {
$query = Doctrine_Query::create()
->select('gt.title, gt.alternate_spellings, gt.description')
->from('GlossaryTerm gt');
$glossaryTerms = $query->rows();
// Create whole list in $terms, including alternate spellings
$terms = array();
foreach ($glossaryTerms as $glossaryTerm) {
// Initialize with title
$term = array(
'wordsHtml' => array(
h(trim($glossaryTerm['title']))
),
'descriptionHtml' => h($glossaryTerm['description'])
);
// Add alternate spellings
foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
$alternateSpelling = h(trim($alternateSpelling));
if (empty($alternateSpelling)) {
continue;
}
$term['wordsHtml'][] = $alternateSpelling;
}
$terms[] = $term;
}
}
// Do replacements on this HTML
$newHtml = $html;
foreach ($terms as $term) {
$callback = create_function('$m', 'return \'<span>\'.$m[0].\'</span>\';');
$term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
$pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
$newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
}
return $newHtml;
}
Using Regexes to process HTML is always risky business. You will spend a long time fiddling with the greediness and laziness of your Regexes to only capture text that is not in a tag, and not in a tag name itself. My recommendation would be to ditch the method you are currently using and parse your HTML with an HTML parser, like this one: http://simplehtmldom.sourceforge.net/. I have used it before and have recommended it to others. It is a much simpler way of dealing with complex HTML.
I ended up using preg_replace_callback to replace all existing links with placeholders. Then I inserted the new glossary term links. Then I put back the links that I had replaced.
It's working great!

Scrape unique image URLs from HTML

Using PHP to curl a web page (some URL entered by user, let's assume it's valid).
Example: http://www.youtube.com/watch?v=Hovbx6rvBaA
I need to parse the HTML and extract all de-duplicated URL's that seem like an image. Not just the ones in img src="" but any URL ending in jpe?g|bmp|gif|png, etc. on that page. (In other words, I don't wanna parse the DOM but wanna use RegEx).
I plan to then curl the URLs for their width and height information and ensure that they are indeed images, so don't worry about security related stuff.
What's wrong with using the DOM? It gives you much better control over the context of the information and a much higher likelihood that the things you pull out are actually URLs.
<?php
$resultFromCurl = '
<html>
<body>
<img src="hello.jpg" />
Yep
<table background="yep.jpg">
</table>
<p>
Perhaps you should check out foo.jpg! I promise it
is safe for work.
</p>
</body>
</html>
';
// these are all the attributes i could think of that
// can contain URLs.
$queries = array(
'//table/#background',
'//img/#src',
'//input/#src',
'//a/#href',
'//area/#href',
'//img/#longdesc',
);
$dom = #DOMDocument::loadHtml($resultFromCurl);
$xpath = new DOMXPath($dom);
$urls = array();
foreach ($queries as $query) {
foreach ($xpath->query($query) as $link) {
if (preg_match('#\.(gif|jpe?g|png)$#', $link->textContent))
$urls[$link->textContent] = true;
}
}
if (preg_match_all('#\b[^\s]+\.(?:gif|jpe?g|png)\b#', $dom->textContent, $matches)) {
foreach ($matches as $m) {
$urls[$m[0]] = true;
}
}
$urls = array_keys($urls);
var_dump($urls);
Collect all image urls into an array, then use array_unique() to remove duplicates.
$my_image_links = array_unique( $my_image_links );
// No more duplicates
If you really want to do this w/ a regex, then we can assume each image name will be surrounded by either ', ", or spaces, tabs, or line breaks or beginning of line, >, <, and whatever else you can think of. So, then we can do:
$pattern = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($pattern, html_entity_decode($resultFromCurl), $matches);
$imgs = array_unique($matches[1]);
The above will capture the image link in stuff like:
<p>Hai guys look at this ==> http://blah.com/lolcats.JPEG</p>
Live example

Categories