Keywords erroneous, extracting content from a website. OOP - php

I have a problem when extracting the keywords from a website (wiki article), the keywords that are extracted, they are not exactly keywords, are words taken from the html, and not of the web site.
I use the following code:
include("Extkeys.php");
[...]
if (empty($keywords)){
$ekeywords = new KeyPer;
$keywords = $ekeywords->Keys($webhtml);
}
And the code of "Extkeys" is:
<?php
class Extkeys {
function Keys($webhtml) {
$webhtml = $this->clean($webhtml);
$blacklist='de,la,los,las,el,ella,nosotros,yo,tu,el,te,mi,del,ellos';
$sticklist='test';
$minlength = 3;
$count = 17;
$webhtml = preg_replace('/[\.;:|\'|\"|\`|\,|\(|\)|\-]/', ' ', $webhtml);
$webhtml = preg_replace('/¡/', '', $webhtml);
$webhtml = preg_replace('/¿/', '', $webhtml);
$keysArray = explode(" ", $webhtml);
$keysArray = array_count_values(array_map('strtolower', $keysArray));
$blackArray = explode(",", $blacklist);
foreach($blackArray as $blackWord){
if(isset($keysArray[trim($blackWord)]))
unset($keysArray[trim($blackWord)]);
}
arsort($keysArray);
$i = 1;
$keywords = "";
foreach($keysArray as $word => $instances){
if($i > $count) break;
if(strlen(trim($word)) >= $minlength && is_string($word)) {
$keywords .= $word . ", ";
$i++;
}
}
$keywords = rtrim($keywords, ", ");
return $keywords=$sticklist.''.$keywords;
}
function clean($webhtml) {
$regex = '/(([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*#([A-Za-z0-9-]+)(\\.[A-Za-z0-9-]+)*)/iex';
$desc = preg_replace($regex, '', $webhtml);
$webhtml = preg_replace( "''si", '', $webhtml );
$webhtml = preg_replace( '/]*>([^<]+)<\/a>/is', '\2 (\1)', $webhtml );
$webhtml = preg_replace( '//', '', $webhtml );
$webhtml = preg_replace( '/{.+?}/', '', $webhtml );
$webhtml = preg_replace( '/ /', ' ', $webhtml );
$webhtml = preg_replace( '/&/', ' ', $webhtml );
$webhtml = preg_replace( '/"/', ' ', $webhtml );
$webhtml = strip_tags( $webhtml );
$webhtml = htmlspecialchars($webhtml);
$webhtml = str_replace(array("\r\n", "\r", "\n", "\t"), " ", $webhtml);
while (strchr($webhtml," ")) {
$webhtml = str_replace(" ", "",$webhtml);
}
for ($cnt = 1;
$cnt < strlen($webhtml)-1; $cnt++) {
if (($webhtml{$cnt} == '.') || ($webhtml{$cnt} == ',')) {
if ($webhtml{$cnt+1} != ' ') {
$webhtml = substr_replace($webhtml, ' ', $cnt + 1, 0);
}
}
}
return $webhtml;
}
}
?>
This is an example of the keywords extracted:
testfalse, lang, {mw, loader, window, function, true, vector, user, gadget, mediawiki, legacy, options, usebetatoolbar, implement, resourceloader, default
Of the article:
http://en.wikipedia.org/wiki/Searchengine
The code "Extkeys", its a copy of a code from a tutorial, adapted for me to make it functional.
How i can make the code can extract the keywords of a website, and not of a html?
Best regards!

Assuming I understand your question, I think simply doing the following is the solution you're looking for.
This will read the HTML from a URL (e.g. http://www.whatever.com/page.html) and use that to generate the keys, rather than requiring the HTML as a parameter.
function Keys($url) {
$webhtml = file_get_contents($url);

You want to extract the content from the page first and then search for keywords. Meaning you want to find the actual content from the page and strip stuff as sidebars, footers etc.
Just google for HTML content extraction, there are numberous articles about this.
I did this once in java, there a library called boilerpipe i'm not sure if there's a PHP port/interface a quick google search didn't reveal anything. But i'm sure there are similar libraries for PHP.
The easiest way to just get rid of the HTML and not specifically search only the page content would be using a regex to strip all html, something like s/<[^>]+>//g. However for a search engine that's probably not the best approach since you end up with a lot of crap that could mess up your key extraction.
EDIT: Here is an article on content extraction with PHP.

Related

PHP - Replace specific word inside tag span

I would like to replace the word "custom" with
<span class="persProd">custom</span>.
This is my code but not work:
$output = '<span>Special custom products</span>';
$test = '~<span>custom</span>~';
$outputEdit = preg_replace($test, '<span class="persProd">custom</span>', $output);
echo $outputEdit;
How can i do?
Thanks for any help
I would do it like this. Watch out for 'custom' being in the $subject string twice. It will be replaced both times. I used spaces like so: ' custom '
$subject = '<span>Special custom products</span>';
$search = ' custom ';
$replace = '<span class="persProd"> custom </span>';
$outputEdit = str_replace($search, $replace, $subject);
echo $outputEdit;
Output: <span>Special<span class="persProd"> custom </span>products</span>
Here is the str_replace() page in the php manual for more.
This is my example and it will working not only with tags (some unique strings too).
<?php
function string_between_two_tags($str, $starting_tag, $ending_tag, $string4replace)
{
$start = strpos($str, $starting_tag)+strlen($starting_tag);
$end = strpos($str, $ending_tag);
return substr($str, 0, $start).$string4replace.substr($str, $end);
}
$output = '<span>Special custom products</span>';
$res = string_between_two_tags($output, '<span>', '</span>', 'custom');
echo $res;
?>

How can I avoid adding href to an overlapping keyword in string?

Using the following code:
$text = "أطلقت غوغل النسخة المخصصة للأجهزة الذكية العاملة بنظام أندرويد من الإصدار “25″ لمتصفحها الشهير كروم.ولم تحدث غوغل تطبيق كروم للأجهزة العاملة بأندرويد منذ شهر تشرين الثاني العام الماضي، وهو المتصفح الذي يستخدمه نسبة 2.02% من أصحاب الأجهزة الذكية حسب دراسة سابقة. ";
$tags = "غوغل, غوغل النسخة, كروم";
$tags = explode(",", $tags);
foreach($tags as $k=>$v) {
$text = preg_replace("/\b{$v}\b/u","$0",$text, 1);
}
echo $text;
Will give the following result:
I love PHP">love PHP</a>, but I am facing a problem
Note that my text is in Arabic.
The way is to do all in one pass. The idea is to build a pattern with an alternation of tags. To make this way work, you must before sort the tags because the regex engine will stop at the first alternative that succeeds (otherwise 'love' will always match even if it is followed by 'php' and 'love php' will never be matched).
To limit the replacement to the first occurence of each word you can remove tag from the array once it has been found and you test if it is always present in the array inside the replacement callback function:
$text = 'I love PHP, I love love but I am facing a problem';
$tagsCSV = 'love, love php, facing';
$tags = explode(', ', $tagsCSV);
rsort($tags);
$tags = array_map('preg_quote', $tags);
$pattern = '/\b(?:' . implode('|', $tags) . ')\b/iu';
$text = preg_replace_callback($pattern, function ($m) use (&$tags) {
$mLC = mb_strtolower($m[0], 'UTF-8');
if (false === $key = array_search($mLC, $tags))
return $m[0];
unset($tags[$key]);
return '<a href="index.php?s=news&tag=' . rawurlencode($mLC)
. '">' . $m[0] . '</a>';
}, $text);
Note: when you build an url you must encode special characters, this is the reason why I use preg_replace_callback instead of preg_replace to be able to use rawurlencode.
If you have to deal with an utf8 encoded string, you need to add the u modifier to the pattern and you need to replace strtolower with mb_strtolower)
the preg_split way
$tags = explode(', ', $tagsCSV);
rsort($tags);
$tags = array_map('preg_quote', $tags);
$pattern = '/\b(' . implode('|', $tags) . ')\b/iu';
$items = preg_split($pattern, $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$itemsLength = count($items);
$i = 1;
while ($i<$itemsLength && count($tags)) {
if (false !== $key = array_search(mb_strtolower($items[$i], 'UTF-8'), $tags)) {
$items[$i] = '<a href="index.php?s=news&tag=' . rawurlencode($tags[$key])
. '">' . $items[$i] . '</a>';
unset($tags[$key]);
}
$i+=2;
}
$result = implode('', $items);
Instead of calling preg_replace multiple times, call it a single time with a regexp that matches any of the tags:
$tags = explode(",", tags);
$tags_re = '/\b(' . implode('|', $tags) . ')\b/u';
$text = preg_replace($tags_re, '$0', $text, 1);
This turns the list of tags into the regexp /\b(love|love php|facing)\b/u. x|y in a regexp means to match either x or y.

Trying to extract keywords from a website PHP (OOP)

haha, I still have the problem of keywords, but this is a code that I'm creating.
Is a poor code but is my creation:
<?php
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) {
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml ->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
$webhtml = explode(" ", $webhtml);
foreach($listanegra as $key=> $ln) {
$webhtml = str_replace($ln, " ", $webhtml);
}
$palabras = str_word_count ("$webhtml", 1 );
$frq = array_count_values ($palabras);
$frq = asort($frq);
$ffrq = count($frq);
$i=1;
while ($i < $ffrq) {
print $frqq[$i];
print '<br />';
$i++;
}
}
?>
The code trying extract keywords of a website. Extracts the first paragraph of a web, and deletes the words of the variable "$listanegra". Next, counts the repeat words and saves all words in a "array". After i call the array, and this show me the words.
The problem is... the code it's not functional =(.
When i use the code, this shows blank.
Could help me finish my code?. Was recommending me to using "tf-idf", but I will use it later.
I do believe this is what you were trying to do:
$url = 'http://es.wikipedia.org/wiki/Animalia';
$words = Keys($url);
/// do your database stuff with $words
function Keys($url)
{
$listanegra = array('a', 'ante', 'bajo', 'con', 'contra', 'de', 'desde', 'mediante', 'durante', 'hasta', 'hacia', 'para', 'por', 'que', 'qué', 'cuán', 'cuan', 'los', 'las', 'una', 'unos', 'unas', 'donde', 'dónde', 'como', 'cómo', 'cuando', 'porque', 'por', 'para', 'según', 'sin', 'tras', 'con', 'mas', 'más', 'pero', 'del');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
$webhtml = explode(' ', $webhtml);
$palabras = array();
foreach($webhtml as $word)
{
$word = strtolower(trim($word, ' .,!?()')); // remove trailing special chars and spaces
if (!in_array($word, $listanegra))
{
$palabras[] = $word;
}
}
$frq = array_count_values($palabras);
asort($frq);
return implode(' ', array_keys($frq));
}
Your server should show the errors if you are testing :
add this after
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
that way you will see the error:
Array to string conversion on line 24 (line 19 if you don't put the 5 new lines)
here are some errors i found 4 functions are not used as they should str_replace, str_word_count , asort , array_count_values.
Using str_replace is a little tricky. Trying to find and remove a removes all the "a" in the text even in "animal". (str_replace("a","animal") => nmal)
this link should be usefull : link
asort return true or false so doing just:
asort($frq);
will sort the values in alphabetical order. $frq returns the result of array_count_values --> $frq = array($word1=>word1_count , ...)
the value here is the number of times the word is used so when later you have :
print $**frq**[$i]; // you have print $frqq[$i]; in your code
the result will be empty since the index of this array are the words and the values the number of time the words appear in the text.
Also with str_word_count you must be really careful, since you are reading Hispanic text and text can have numbers you shoudl use this
str_word_count($string,1,'áéíóúüñ1234567890');
The code i would suggest :
<?php
header('Content-Type: text/html; charset=UTF-8');
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) {
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del");
$html=file_get_contents($url);
$doc = new DOMDocument('1.0', 'UTF-8');
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml ->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
print_r ($webhtml);
$webhtml = explode(" ", $webhtml);
// $webhtml = str_replace($listanegra, " ", $webhtml); str_replace() accepts array
foreach($listanegra as $key=> $ln) {
$webhtml = preg_replace('/\b'.$ln.'\b/u', ' ', $webhtml);
}
$palabras = str_word_count(implode(" ",$webhtml), 1, 'áéíóúüñ1234567890');
sort($palabras);
$frq = array_count_values ($palabras);
foreach($frq as $index=>$value) {
print "the word <strong>$index</strong> was used <strong>$value</strong> times";
print '<br />';
}
}
?>
Was really painfull trying to figure out the special chars issues

strip defined character from string

im having a odd problem os my website, i have a script that records all the searchs and insert those search words on database, the problem is that since search engine robots started sneaking around my website, they make my script to produce search keywords like "search keywords////////////////////////////////////////////////"
I want to strip that characteres ( ////////// ) before indexed on mysql.
This is what i have:
$search=htmlspecialchars($_GET['load']);
$say=mysql_query("SELECT * FROM madvideo WHERE MATCH (baslik) AGAINST ('*$search*' IN BOOLEAN MODE)");
$saydim=mysql_num_rows($say);
$count = $saydim;
$page = !empty($_GET["page"]) ? intval($_GET["page"]) : 1;
$s = ($page-1)*$perpage;
$sayfasayisi=ceil($count/$perpage);
if(ayaral("Arananlar-Kaydet")=="1") {
$ekle=cevir($search);
#mysql_query("insert into tag (baslik,tr,tarih) values ('$search','$ekle',now()) "); }
The variable " $search " will call the search word, and i dont know whats the strip syntax i have to use to strip that nasty ///////// character.
EDIT: the code that creates the words is this:
$vtitle = str_replace("\r\n\r\n", ' ', $vtitle);
$words = explode(' ', $vtitle);
$k = count($words);
$k3 = ceil($k/3);
$new = array();
for ($i=0; $i<$k; $i+=$k3) {
$new[] = join(' ', array_slice($words,$i, $k3));
}
$tag1 = $new[0];
$tag2 = $new[1];
$tag3 = $new[2];
Use MySQL TRIM function to remove from end of the string as follows:
TRIM(TRAILING '/' FROM <string>)
#mysql_query("insert into tag (baslik,tr,tarih)
values (TRIM(TRAILING '/' FROM '$search'),'$ekle',now())")
To trim from both beginning and ending, use following
TRIM(BOTH '/' FROM <string>)
#mysql_query("insert into tag (baslik,tr,tarih)
values (TRIM(BOTH '/' FROM '$search'),'$ekle',now())")
To remove all occurences of the string use REPLACE function as follows:
REPLACE(<string>, '/', '')
#mysql_query("insert into tag (baslik,tr,tarih)
values (REPLACE('$search', '/',''),'$ekle',now())")
Hope it helps...
If the / characters appear always at the end of the $search, you can use rtrim using the second (optional) parameter:
$search = "search keywords////////////////////////////////////////////////";
$search = rtrim($search, '/ ');
echo $search; // prints 'search keywords'
EDIT:
Real example...
<?php
if(isset($_REQUEST['str'])) {
$search = $_REQUEST['str'];
$ser_chk = strpos($search, "/");
if ($ser_chk > -1) {
$search = str_replace("/", "", $search);
}
}
?>
<h1><?php print $search; ?></h1>
<form action="" method="post">
<input type="text" size="100" value="search keywords////////////////////////////////////////////////" name="str" />
<input type="submit" />
</form>
LINK TO TEST: http://simplestudio.rs/tsto.php
At least do:
$search = mysql_real_escape_string($search);
Also you can check for that characters and if founded just replace them with empty string.
$ser_chk = strpos($search, "/");
if ($ser_chk > -1) {
$search = str_replace("/", "", $search);
}
$regex = "/\//";
$string = "////search";
$string = preg_replace($regex, '', $string);
echo $string;
The flexible solution for character repetition will be a regular expression.
$output = preg_replace('/\/[\/]+/', '/'. $input);
It will handle any number of "/" and replace it with a single "/".

Replace new lines from PHP to JavaScript

Situation is simple:
I post a plain HTML form with a textarea. Then, in PHP, I register a JavaScript function depending on the form's content.
This is done using the following code:
$js = sprintf("window.parent.doSomething('%s');", $this->textarea->getValue());
Works like a charm, until I try to process newlines. I need to replace the newlines with char 13 (I believe), but I can't get to a working solution. I tried the following:
$textarea = str_replace("\n", chr(13), $this->textarea->getValue());
And the following:
$js = sprintf("window.parent.doSomething('%s');", "'+String.fromCharCode(13)+'", $this->textarea->getValue());
Does anyone have a clue as to how I can process these newlines correctly?
You were almost there you just forgot to actually replace the line-breaks.
This should do the trick:
$js = sprintf("window.parent.doSomething('%s');"
, preg_replace(
'#\r?\n#'
, '" + String.fromCharCode(13) + "'
, $this->textarea->getValue()
);
What you meant to do was:
str_replace("\n", '\n', $this->textarea->getValue());
Replace all new line characters with the literal string '\n'.
However, you'd do better to encode it as JSON:
$js = sprintf(
"window.parent.doSomething('%s');",
json_encode($this->textarea->getValue())
);
That will fix quotes as well.
Your problem has already been solved elsewhere in our codebase...
Taken from our WebApplication.php file:
/**
* Log a message to the javascript console
*
* #param $msg
*/
public function logToConsole($msg)
{
if (defined('CONSOLE_LOGGING_ENABLED') && CONSOLE_LOGGING_ENABLED)
{
static $last = null;
static $first = null;
static $inGroup = false;
static $count = 0;
$decimals = 5;
if ($first == null)
{
$first = microtime(true);
$timeSinceFirst = str_repeat(' ', $decimals) . ' 0';
}
$timeSinceFirst = !isset($timeSinceFirst)
? number_format(microtime(true) - $first, $decimals, '.', ' ')
: $timeSinceFirst;
$timeSinceLast = $last === null
? str_repeat(' ', $decimals) . ' 0'
: number_format(microtime(true) - $last, $decimals, '.', ' ');
$args = func_get_args();
if (count($args) > 1)
{
$msg = call_user_func_array('sprintf', $args);
}
$this->registerStartupScript(
sprintf("console.log('%s');",
sprintf('[%s][%s] ', $timeSinceFirst, $timeSinceLast) .
str_replace("\n", "'+String.fromCharCode(13)+'", addslashes($msg))));
$last = microtime(true);
}
}
The bit you are interested in is:
str_replace("\n", "'+String.fromCharCode(13)+'", addslashes($msg))
Note that in your questions' sprintf, you forgot the str_replace...
use
str_replace(array("\n\r", "\n", "\r"), char(13), $this->textarea->getValue());
This should replace all new lines in the string with char(13)

Categories