I want help on this script I am making...
I want my website to be a wikipedia in itself... take for example I have a php website... I publish daily articles on it.
Suppose I publish 2 articles on Jenna Bush and Michael Jackson respectively
now I save into text/xml/database text and link
example
jenna bush, http://www.domain.com/jenna.html
michael jackson, http://www.domain.com/michael.html
or any which ways required like
<xml>
<item>
<text>jenna bush</text>
<link>http://www.domain.com/jenna.html</link>
</item>
... etc
</xml>
now what I want is the PHP script should automatically convert any jenna bush or any michael jackson linked to their respective links all over my website...
Any help is much appreciated...
Assuming that the text containing those words are in the database the best way to achieve something like that is using str_replace http://ie2.php.net/manual/en/function.str-replace.php
Right before the text is submitted to the database you run a function on it that looks for certain phrases and replaces them with other phrases.
Alternatively and probably a better approach is the same one that mediawiki (the software that wikipedia runs on uses), everytime you want to create a link to another article in a mediawiki you put [[ ]] around it, for example [[Michael Jackson]].
That way you have more control over what becomes a link.
Example: If you had an article on Prince the musician and one on Prince Charles and you wanted to link to Prince Charles, the first method might find Prince first and link to him, however if you use the mediawiki method you would write [[Prince Charles]] and it would know what to look for.
To do that I'd recommend preg_match http://www.php.net/manual/en/function.preg-match.php
It may be worth having a look at how mediawiki does the same thing, you can download it for free and it's written in php
I customized it and here is for everyone interested
function tags_autolink($text)
{
$text = " $text ";
$query_tags_autolink = "SELECT tag from tags";
$rs_tags_autolink = mysql_query($query_tags_autolink) or print "error getting tags";
while($row_tags_autolink = mysql_fetch_array($rs_tags_autolink))
{
$tag_name = trim($row_tags_autolink['tag']);
$tag_url = "http://www.domain.com/tag/".createLink(trim(htmlentities($tag_name)))."/";
$text = preg_replace("|(?!<[^<>]*?)(?<![?./&])\b($tag_name)\b(?!:)(?![^<>]*?>)|imsU","$1" , $text);
}
return trim( $text );
}
the create link function simply makes a string of "abcd is kk" like "abcd-is-kk" for a tag page ending ;)
cheers !
function auto_href($x)
{
$x = explode(' ', $x);
foreach ($x as $y)
{
if (substr($y, 0, 7) == 'http://')
$y = ''.$y.'';
$z[] = $y;
}
return implode($z, ' ');
}
function tags_autolink()
{
$conn = mysqli_connect("localhost", "root", "", "sample")
or die ("Could not connect to mysql because ".mysqli_error());
$text = 'You paragraph or text here';
$query_tags_autolink = "SELECT tag from tags";
$rs_tags_autolink = mysqli_query($conn,$query_tags_autolink) or print "error getting tags";
while($row_tags_autolink = mysqli_fetch_array($rs_tags_autolink))
{
$tag_name = trim($row_tags_autolink['tag']);
$trimedurl = str_replace(' ', '-',$tag_name);
$trimedurl=strtolower("$trimedurl");
$tag_url = "http://yourdomain/tag/$trimedurl";
$text = preg_replace("|(?!<[^<>]*?)(?<![?./&])\b($tag_name)\b(?!:)(?![^<>]*?>)|imsU","$1" , $text);
}
return trim($text);
}
echo tags_autolink() ;
Wikipedia's automatic hyperlinking code is in mediawiki:Parser.php, methods handleMagicLinks and makeFreeExternalLink.
The first searches for protocols, the latter removes stuff like trailing punctuation.
Related
I'm trying to replace a title tag from |title|Page title| to <title>Page Title</title>, using this regular expression. But being a complete amateur, it's not gone to well..
'^|title|^[a-zA-Z0-9_]{1,}|$' => '<title>$1</title>'
I would love to know how to fix it, and more importantly, what I did wrong and why it was wrong.
You almost got it:
You should escape the | characters as they have special meaning in a
regex and you are using it as a plain character.
You should add the space character to your search group
$string = '|title|Page title|';
$pattern = '/\|title\|([a-zA-Z0-9_ ]{1,})\|/';
$replacement = '<title>$1</title>';
echo preg_replace($pattern, $replacement, $string); //echoes <title>Page title</title>
See working demo
OP posted some code in comments which is wrong, try this version:
$regular_expressions = array( array( '/\|title\|([a-zA-Z0-9_ ]{1,})\|/' , '<title>$1</title>' ));
foreach($regular_expressions as $regexp){
$data = preg_replace($regexp[0], $regexp[1], $data);
}
Heres a little function I came up with a while back to essentially scrape the titles of a page when users submitted links through my service. What this function does is will get the contents of a provided URL. Seek a title tag, if found, get whats between the title tag and dump it's result. With a little tweaking I am sure you can use a replace method for whatever your doing, and make it work for your needs. So this is more of a starting point rather than an answer but overall I hope it helps to some extent.
$url = 'http://www.chrishacia.com';
function get_page_title($url){
if( !($data = file_get_contents($url)) ) return false;
if( preg_match("#<title>(.+)<\/title>#iU", $data, $t)) {
return trim($t[1]);
} else {
return false;
}
}
var_dump(get_page_title($url));
<?php
$s = "|title|Page title|";
$s = preg_replace('/^\|title\|([^\|]+)\|/', "<title>$1</title>", $s);
echo $s;
?>
K2 is parsing un-necessary text into urls in item comments.
1.Created a item using joomla admin panel and as a guest entered comment with following text
"node.js is a power full js engine. Enven.though this is not a valid url it has been rendered as valid.url anything with xxx.xxx are parsed as urls and even like sub domain syntax iam.not.valid i.e mail.yahoo.com how funny this is"
In the above coomment node.js, even.though, valid.url, xxx.xxx iam.not.valid i.e mail.yahoo.com are rendered as valid url. but in this case only mail.yahoo.com is valid not others.
K2 is using some smart intelligence using following snippet in $JHOME/components/com_k2/views/item/view.html.php lines (159-178)
$comments = $model->getItemComments($item->id, $limitstart, $limit, $commentsPublished);
$pattern = "#\b(https?://)?(([0-9a-zA-Z_!~*'().&=+$%-]+:)?[0-9a-zA-Z_!~*'().&=+$%-]+\#)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([0-9a-zA-Z_!~*'()-]+\.)*([0-9a-zA-Z][0-9a-zA-Z-]{0,61})?[0-9a-zA-Z]\.[a-zA-Z]{2,6})(:[0-9]{1,4})?((/[0-9a-zA-Z_!~*'().;?:\#&=+$,%#-]+)*/?)#";
for ($i = 0; $i < sizeof($comments); $i++) {
$comments[$i]->commentText = nl2br($comments[$i]->commentText);
$comments[$i]->commentText = preg_replace($pattern, '<a target="_blank" rel="nofollow" href="\0">\0</a>', $comments[$i]->commentText);
$comments[$i]->userImage = K2HelperUtilities::getAvatar($comments[$i]->userID, $comments[$i]->commentEmail, $params->get('commenterImgWidth'));
if ($comments[$i]->userID>0) {
$comments[$i]->userLink = K2HelperRoute::getUserRoute($comments[$i]->userID);
}
else {
$comments[$i]->userLink = $comments[$i]->commentURL;
}
if($reportSpammerFlag && $comments[$i]->userID>0) {
$comments[$i]->reportUserLink = JRoute::_('index.php?option=com_k2&view=comments&task=reportSpammer&id='.$comments[$i]->userID.'&format=raw');
}
else {
$comments[$i]->reportUserLink = false;
}
}
Can somebody help fixing above regular expression? Thanks
You are going to have this problem any time a user types.in a period with no spaces around it. You could add in some login to test for valid TLDs, but even that would not be perfect because there are plenty of TLDs that would fool the logic, like .it.
If you want to try your hand at fixing the regular expression, the pattern that determines if a string is a URL is here -
$pattern = "#\b(https?://)?(([0-9a-zA-Z_!~*'().&=+$%-]+:)?[0-9a-zA-Z_!~*'().&=+$%-]+\#)?(([0-9]{1,3}\.){3}[0-9]{1,3}|([0-9a-zA-Z_!~*'()-]+\.)*([0-9a-zA-Z][0-9a-zA-Z-]{0,61})?[0-9a-zA-Z]\.[a-zA-Z]{2,6})(:[0-9]{1,4})?((/[0-9a-zA-Z_!~*'().;?:\#&=+$,%#-]+)*/?)#";
Personally, I would just disable links in comments altogether by removing or commenting out this code -
$comments[$i]->commentText = preg_replace($pattern, '<a target="_blank" rel="nofollow" href="\0">\0</a>', $comments[$i]->commentText);
For example: I have this string
#[1234:peterwateber] <b>hello</b> <div>hi!</div> http://stackoverflow.com
I want to convert it into HTML like this:
#peterwateber <b>hello</b> <div>hi!<divb>
http://stackoverflow.com
I'm using QueryPath, and I have this code where you can get the texts from "#[123:peterwateber]" to be outputted to "123 and peterwateber" respectively.
The code to do that is:
$hidden_input = "#[1234:peterwateber] <b>hello</b> <div>hi!</div> http://stackoverflow.com";
preg_match('##\[(\w+)\:(\w+)\]#', $hidden_input, $m); //returns 123,peterwateber
What I'm trying to achieve is to have this kind of output:
I'm using Hawkee's plugin for jQuery autocomplete http://www.hawkee.com/snippet/9391/
I'm not entirly sure if there is a specific function just for that but what you can do is this:
in example of the link (a href)
$raw = "#[1234:peterwateber]"
$thingtoreplace = ("#[");
$firstpass = str_replace($thingtoreplace, "<a href='", $raw);
$raw2 = $firstpass
$thingtoreplace = (":");
$secondpass = str_replace($thingtoreplace, "'>", $raw1);
$raw3 = $second
$thingtoreplace = ("]");
$secondpass = str_replace($thingtoreplace, "'</a>", $raw3);
I know it seems tedious but it should do the trick. If its not helpful then please dont rate me down... I spent time on this
I have a dynamic sitemap creation script that recursively looks through the filesystem and when needed, opens the start of a file, then looks for the text between tags. It worked when the setup was like...
<title>Page Title | Company, Inc.</title>
But now I've added a wrinkle and set up a way to manage titles and meta info via an admin tool. Just in case something isn't entered, I want to fall back to the old default title info, but my preg_match isn't working for this...
<title><?=#$page_meta_array['page_title']!=''?$page_meta_array['page_title']:"Default Title Here";?> | Company, Inc.</title>
The php function that I pass the page to looks like this...
function get_title($filename) {
$retval = "";
$handle = fopen($filename, "r");
$head = fread($handle, 4096);
preg_match(";<title>(.+)</title>;", $head, $matches);
if(sizeof($matches) == 2) {
$retval = trim($matches[1]);
}
fclose($handle);
return $retval;
}
Can someone point me to the correct preg_match? I'd like to have "Default Title Here" returned from the second example above.
Thanks
this might do it:
preg_match(';<title><\?[^>]*"([^"]*)"[^>]*></title>;', $head, $matches);
note that short-tags are deprecated, and may not work in PHP6.
you first need to parse the PHP code, then check for matches. Why don't you just use the contents of $page_meta_array['page_title']?
what would be the best way to write a code in Php that would search within a webpage for a number of words stored in a file? is it best to store the source code in a file or is it another way? please help.
The best way is to use google: site:example.com word1 OR word2 OR word3
Do you want to search in ONE PAGE? or one website with MULTIPLE PAGES?
If its only one page i think you can store the html code in memory without problems.
if you know exactly what you search strpos for reach word will probably be the fastest (stripos for case insensitive). you can also define your own character class and use preg_match_all or something... just something like this will do...
<?
$keywords = array("word1","word2","word3");
$doc = strip_tags(file_get_contents("http://www.example.com")); // remove tags to get only text
$doc = preg_replace('/\s+/', ' ',$doc); // remove multiple whitespaces...
foreach($keywords as $word) {
$pos = stripos($doc,$word);
if($pos !== false) {
echo "match: ...".str_replace($word,"<em>$word</em>",substr($doc,$pos-20,50))."... \n";
}
}
?>
something like the following for example will perform MUCH faster as its based on hashmap lookups with O(1) and doesnt need to scan the whole text for every keyword...
<?
setlocale(LC_ALL, "en_US.utf8");
$keywords = array("word1","word2","word3","word4");
$doc = file_get_contents("http://www.example.com");
$doc = strtolower($doc);
$doc = preg_replace('!/\*.*?\*/!s', '', $doc);
$doc = preg_replace("/<!--.*>/i", "", $doc);
$doc = preg_replace('!<script.*?script>!s', '', $doc);
$doc = preg_replace('!<style.*?style>!s', '', $doc);
$doc = strip_tags($doc);
$doc = preg_replace('/[^0-9a-z\s]/','',$doc);
$doc = iconv('UTF-8', 'ASCII//TRANSLIT', $doc); // check if encoding is really utf8
//$doc = preg_replace('{(.)\1+}','$1',$doc); remove duplicate chars ... possible step to add even more fuzzyness
$doc = preg_split("/\s+/",trim($doc));
foreach($keywords as $word) {
$word = strtolower($word);
$word = iconv('UTF-8', 'ASCII//TRANSLIT', $word);
$key = array_search($word,$doc);
var_dump($key);
if($key !== false) {
echo "match: ";
for($i=$key;$i<=5 && isset($doc[$i]);$i++) {
echo $doc[$i]." ";
}
}
}
?>
this code is untested.
it would be however be more elegant to dump textnodes from a domdocument
Simple searching is easy. If you want to search in a whole website the crawling logic is difficult.
I once did a backlink-checker for a company that worked like a crawler.
My first advice is not to do a recursion (like scanning a page and following all links and following all links in that until you reach a certain level...)
rather do it like this:
do a for loop as often as many levels you want to crawl.
set a site array with one entry (start page)
pass array to a function downloads every link, scans the site there and stores links on it in array.
when done with all links return the new link list array
in the for loop update the array with the return value of the function, and call the function again.
this way you can avoid following nasty paths but rather crawl website level by level.
also store already visited links in an array to skip, dont follow external links, check for weird url parameters etc..
for future use you can store documents in lucene or solr, there are classes to turn html pages into senseful lucene objects and search within.