Trying to extract keywords from a website PHP (OOP) - php

haha, I still have the problem of keywords, but this is a code that I'm creating.
Is a poor code but is my creation:
<?php
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) {
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml ->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
$webhtml = explode(" ", $webhtml);
foreach($listanegra as $key=> $ln) {
$webhtml = str_replace($ln, " ", $webhtml);
}
$palabras = str_word_count ("$webhtml", 1 );
$frq = array_count_values ($palabras);
$frq = asort($frq);
$ffrq = count($frq);
$i=1;
while ($i < $ffrq) {
print $frqq[$i];
print '<br />';
$i++;
}
}
?>
The code trying extract keywords of a website. Extracts the first paragraph of a web, and deletes the words of the variable "$listanegra". Next, counts the repeat words and saves all words in a "array". After i call the array, and this show me the words.
The problem is... the code it's not functional =(.
When i use the code, this shows blank.
Could help me finish my code?. Was recommending me to using "tf-idf", but I will use it later.

I do believe this is what you were trying to do:
$url = 'http://es.wikipedia.org/wiki/Animalia';
$words = Keys($url);
/// do your database stuff with $words
function Keys($url)
{
$listanegra = array('a', 'ante', 'bajo', 'con', 'contra', 'de', 'desde', 'mediante', 'durante', 'hasta', 'hacia', 'para', 'por', 'que', 'qué', 'cuán', 'cuan', 'los', 'las', 'una', 'unos', 'unas', 'donde', 'dónde', 'como', 'cómo', 'cuando', 'porque', 'por', 'para', 'según', 'sin', 'tras', 'con', 'mas', 'más', 'pero', 'del');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
$webhtml = explode(' ', $webhtml);
$palabras = array();
foreach($webhtml as $word)
{
$word = strtolower(trim($word, ' .,!?()')); // remove trailing special chars and spaces
if (!in_array($word, $listanegra))
{
$palabras[] = $word;
}
}
$frq = array_count_values($palabras);
asort($frq);
return implode(' ', array_keys($frq));
}

Your server should show the errors if you are testing :
add this after
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
that way you will see the error:
Array to string conversion on line 24 (line 19 if you don't put the 5 new lines)
here are some errors i found 4 functions are not used as they should str_replace, str_word_count , asort , array_count_values.
Using str_replace is a little tricky. Trying to find and remove a removes all the "a" in the text even in "animal". (str_replace("a","animal") => nmal)
this link should be usefull : link
asort return true or false so doing just:
asort($frq);
will sort the values in alphabetical order. $frq returns the result of array_count_values --> $frq = array($word1=>word1_count , ...)
the value here is the number of times the word is used so when later you have :
print $**frq**[$i]; // you have print $frqq[$i]; in your code
the result will be empty since the index of this array are the words and the values the number of time the words appear in the text.
Also with str_word_count you must be really careful, since you are reading Hispanic text and text can have numbers you shoudl use this
str_word_count($string,1,'áéíóúüñ1234567890');
The code i would suggest :
<?php
header('Content-Type: text/html; charset=UTF-8');
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) {
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del");
$html=file_get_contents($url);
$doc = new DOMDocument('1.0', 'UTF-8');
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml ->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
print_r ($webhtml);
$webhtml = explode(" ", $webhtml);
// $webhtml = str_replace($listanegra, " ", $webhtml); str_replace() accepts array
foreach($listanegra as $key=> $ln) {
$webhtml = preg_replace('/\b'.$ln.'\b/u', ' ', $webhtml);
}
$palabras = str_word_count(implode(" ",$webhtml), 1, 'áéíóúüñ1234567890');
sort($palabras);
$frq = array_count_values ($palabras);
foreach($frq as $index=>$value) {
print "the word <strong>$index</strong> was used <strong>$value</strong> times";
print '<br />';
}
}
?>
Was really painfull trying to figure out the special chars issues

Related

Add space between textContent data scraped from website using PHP DOM

I am trying to add a comma and whitespace to some data I am scraping from a website. The data scrapes successfully, but they are muddled up together, and the space and comma are trying to add only get added to the last item. Here is the code I currently have
$html = curl_exec($ch);
$dom = new DOMDocument();
#$dom->loadHTML($html);
$finder = new DomXPath($dom);
$class_ops = 'ipc-inline-list ';
$class_opp = 'ipc-inline ';
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']");
foreach ($node as $index => $t) {
if ($index == 3) {
$la = $t->textContent.", ";
}
}
echo $la;
Current Result
DoyleBrainDavid,
Expected Result
Doyle, Brain, David
I am using this code
$c1 = curl_init('https://stackoverflow.com/');
curl_setopt($c1, CURLOPT_RETURNTRANSFER, true);
$html = curl_exec($c1);
if (curl_error($c1))
die(curl_error($c1));
// Get the status code
$status = curl_getinfo($c1, CURLINFO_HTTP_CODE);
curl_close($c1);
preg_match_all('/<span(.*?)<\/span>/s', $html, $matches1);
foreach($matches1[0] as $k=>$v){
$enc = mb_detect_encoding($v);
$v = mb_convert_encoding($v,$enc, "UTF-8");
$match1[$k] = strip_tags ($v);
//$match1[$k] = preg_replace('/^[^A-Za-z0-9]+/', '', $match1[$k]);
}
var_dump($match1);
In your case you can replace like this
preg_match_all('/<div class="ipc-inline-list">(.*?)<\/div>/s', $html, $matches1);
This return array with matches.
I hope this can be helpful for you
You want each li, not the ul as one block. Try:
$node = $finder->query("//div[#class='$class_ops']//ul[#class='$class_opp']/li");
Demo: https://3v4l.org/Mvfud
If that doesn't work the actual HTML content should be added to the question.

How to search in XML? (PHP)

I am working on a word application. I'm trying to get values from XML. My goal is getting the first and last letter of a word. Could you help me, please?
<!--?xml version='1.0'?-->
<Letters>
<Letter category='A'>
<FirstLetter>
<Property>First letter is A.</Property>
</FirstLetter>
<LastLetter>
<Property>Last letter is A.</Property>
</LastLetter>
</Letter>
<Letter category='B'>
<FirstLetter>
<Property>First letter is B.</Property>
</FirstLetter>
<LastLetter>
<Property>Last letter is B.</Property>
</LastLetter>
</Letter>
<Letter category='E'>
<FirstLetter>
<Property>First letter is E.</Property>
</FirstLetter>
<LastLetter>
<Property>Last letter is E.</Property>
</LastLetter>
</Letter>
</Letters>
PHP code:
<?php
$word = "APPLE";
$alphabet = "ABCÇDEFGĞHIİJKLMNOÖPQRSŞTUÜVWXYZ";
$index = strpos($alphabet, $word);
$string = $xml->xpath("//Letters/Letter[contains(text(), " . $alfabe[$rakam] . ")]/FirstLetter");
echo "<pre>" . print_r($string, true) . "</pre>";
The letter is in an attribute named 'category'.
$word = "APPLE";
// bootstrap DOM
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
// get first and last letter
$firstLetter = substr($word, 0, 1);
$lastLetter = substr($word, -1);
// fetch text from property elements
var_dump(
$xpath->evaluate(
"string(/Letters/Letter[#category = '$firstLetter']/FirstLetter/Property)"
),
$xpath->evaluate(
"string(/Letters/Letter[#category = '$lastLetter']/LastLetter/Property)"
)
);
Or in SimpleXML
$word = "APPLE";
$letters = new SimpleXMLElement($xml);
$firstLetter = substr($word, 0, 1);
$lastLetter = substr($word, -1);
// SimpleXML does not allow for xpath expression with type casts
// So validation and cast has to be done in PHP
var_dump(
(string)($letters->xpath(
"/Letters/Letter[#category = '$firstLetter']/FirstLetter/Property"
)[0] ?? ''),
(string)($letters->xpath(
"/Letters/Letter[#category = '$lastLetter']/LastLetter/Property"
)[0] ?? '')
);

find a element in html and explode it for stock

I want to retrieve an HTML element in a page.
<h2 id="resultCount" class="resultCount">
<span>
Showing 1 - 12 of 40,923 Results
</span>
</h2>
I have to get the total number of results for the test in my php.
For now, I get all that is between the h2 tags and I explode the first time with space.
Then I explode again with the comma to concatenate able to convert numbers results in European format. Once everything's done, I test my number results.
define("MAX_RESULT_ALL_PAGES", 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$htmlResultCountPage = file_get_html($queryUrl);
$htmlResultCount = $htmlResultCountPage->find("h2[id=resultCount]");
$resultCountArray = explode(" ", $htmlResultCount[0]);
$explodeCount = explode(',', $resultCountArray[5]);
$europeFormatCount = '';
foreach ($explodeCount as $val) {
$europeFormatCount .= $val;
}
if ($europeFormatCount > MAX_RESULT_ALL_PAGES) {*/
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}
At the moment the total number of results is not well recovered and the condition does not happen even when it should.
Someone would have a solution to this problem or any other way?
I would simply fetch the page as a string (not html) and use a regular expression to get the total number of results. The code would look something like this:
define('MAX_RESULT_ALL_PAGES', 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
$queryResult = file_get_contents($queryUrl);
if (preg_match('/of\s+([0-9,]+)\s+Results/', $queryResult, $matches)) {
$totalResults = (int) str_replace(',', '', $matches[1]);
} else {
throw new \RuntimeException('Total number of results not found');
}
if ($totalResults > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
// ...
}
A regex would do it:
...
preg_match("/of ([0-9,]+) Results/", $htmlResultCount[0], $matches);
$europeFormatCount = intval(str_replace(",", "", $matches[1]));
...
Please try this code.
define("MAX_RESULT_ALL_PAGES", 1200);
// new dom object
$dom = new DOMDocument();
// HTML string
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$html_string = file_get_contents($queryUrl);
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//Get all h2 tags
$nodes = $dom->getElementsByTagName('h2');
// Store total result count
$totalCount = 0;
// loop over the all h2 tags and print result
foreach ($nodes as $node) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
if ($attribute->name === 'class' && $attribute->value == 'resultCount') {
$inner_html = str_replace(',', '', trim($node->nodeValue));
$inner_html_array = explode(' ', $inner_html);
// Print result to the terminal
$totalCount += $inner_html_array[5];
}
}
}
}
// If result count grater than 1200, do this
if ($totalCount > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}
Give this a try:
$match =array();
preg_match('/(?<=of\s)(?:\d{1,3}+(?:,\d{3})*)(?=\sResults)/', $htmlResultCount, $match);
$europeFormatCount = str_replace(',','',$match[0]);
The RegEx reads the number between "of " and " Results", it matches numbers with ',' seperator.

Keywords erroneous, extracting content from a website. OOP

I have a problem when extracting the keywords from a website (wiki article), the keywords that are extracted, they are not exactly keywords, are words taken from the html, and not of the web site.
I use the following code:
include("Extkeys.php");
[...]
if (empty($keywords)){
$ekeywords = new KeyPer;
$keywords = $ekeywords->Keys($webhtml);
}
And the code of "Extkeys" is:
<?php
class Extkeys {
function Keys($webhtml) {
$webhtml = $this->clean($webhtml);
$blacklist='de,la,los,las,el,ella,nosotros,yo,tu,el,te,mi,del,ellos';
$sticklist='test';
$minlength = 3;
$count = 17;
$webhtml = preg_replace('/[\.;:|\'|\"|\`|\,|\(|\)|\-]/', ' ', $webhtml);
$webhtml = preg_replace('/¡/', '', $webhtml);
$webhtml = preg_replace('/¿/', '', $webhtml);
$keysArray = explode(" ", $webhtml);
$keysArray = array_count_values(array_map('strtolower', $keysArray));
$blackArray = explode(",", $blacklist);
foreach($blackArray as $blackWord){
if(isset($keysArray[trim($blackWord)]))
unset($keysArray[trim($blackWord)]);
}
arsort($keysArray);
$i = 1;
$keywords = "";
foreach($keysArray as $word => $instances){
if($i > $count) break;
if(strlen(trim($word)) >= $minlength && is_string($word)) {
$keywords .= $word . ", ";
$i++;
}
}
$keywords = rtrim($keywords, ", ");
return $keywords=$sticklist.''.$keywords;
}
function clean($webhtml) {
$regex = '/(([_A-Za-z0-9-]+)(\\.[_A-Za-z0-9-]+)*#([A-Za-z0-9-]+)(\\.[A-Za-z0-9-]+)*)/iex';
$desc = preg_replace($regex, '', $webhtml);
$webhtml = preg_replace( "''si", '', $webhtml );
$webhtml = preg_replace( '/]*>([^<]+)<\/a>/is', '\2 (\1)', $webhtml );
$webhtml = preg_replace( '//', '', $webhtml );
$webhtml = preg_replace( '/{.+?}/', '', $webhtml );
$webhtml = preg_replace( '/ /', ' ', $webhtml );
$webhtml = preg_replace( '/&/', ' ', $webhtml );
$webhtml = preg_replace( '/"/', ' ', $webhtml );
$webhtml = strip_tags( $webhtml );
$webhtml = htmlspecialchars($webhtml);
$webhtml = str_replace(array("\r\n", "\r", "\n", "\t"), " ", $webhtml);
while (strchr($webhtml," ")) {
$webhtml = str_replace(" ", "",$webhtml);
}
for ($cnt = 1;
$cnt < strlen($webhtml)-1; $cnt++) {
if (($webhtml{$cnt} == '.') || ($webhtml{$cnt} == ',')) {
if ($webhtml{$cnt+1} != ' ') {
$webhtml = substr_replace($webhtml, ' ', $cnt + 1, 0);
}
}
}
return $webhtml;
}
}
?>
This is an example of the keywords extracted:
testfalse, lang, {mw, loader, window, function, true, vector, user, gadget, mediawiki, legacy, options, usebetatoolbar, implement, resourceloader, default
Of the article:
http://en.wikipedia.org/wiki/Searchengine
The code "Extkeys", its a copy of a code from a tutorial, adapted for me to make it functional.
How i can make the code can extract the keywords of a website, and not of a html?
Best regards!
Assuming I understand your question, I think simply doing the following is the solution you're looking for.
This will read the HTML from a URL (e.g. http://www.whatever.com/page.html) and use that to generate the keys, rather than requiring the HTML as a parameter.
function Keys($url) {
$webhtml = file_get_contents($url);
You want to extract the content from the page first and then search for keywords. Meaning you want to find the actual content from the page and strip stuff as sidebars, footers etc.
Just google for HTML content extraction, there are numberous articles about this.
I did this once in java, there a library called boilerpipe i'm not sure if there's a PHP port/interface a quick google search didn't reveal anything. But i'm sure there are similar libraries for PHP.
The easiest way to just get rid of the HTML and not specifically search only the page content would be using a regex to strip all html, something like s/<[^>]+>//g. However for a search engine that's probably not the best approach since you end up with a lot of crap that could mess up your key extraction.
EDIT: Here is an article on content extraction with PHP.

get attribute values with php dom

I try to get some attiributue values. But have no chance. Below yo can see my code and explanation. How to get duration, file etc.. values?
$url="http://www.some-url.ltd";
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$the_div = $xpath->query('//div[#id="the_id"]');
foreach ($the_div as $rval) {
$the_value = trim($rval->getAttribute('title'));
echo $the_value;
}
The output below:
{title:'title',
description:'description',
scale:'fit',keywords:'',
file:'http://xxx.ccc.net/ht/2012/05/10/419EE45F98CD63F88F52CE6260B9E85E_c.mp4',
type:'flv',
duration:'24',
screenshot:'http://xxx.ccc.net/video/2012/05/10/419EE45F98CD63F88F52CE6260B9E85E.jpg?v=1336662169',
suggestion_path:'/videoxml/player_xml/61319',
showSuggestions:true,
autoStart:true,
width:412,
height:340,
autoscreenshot:true,
showEmbedCode:true,
category: 1,
showLogo:true
}
How to get duration, file etc.. values?
What about
$parsed = json_decode($the_value, true);
$duration = $parsed['duration'];
EDIT:
Since json_decode() requires proper JSON formatting (key names and values must be enclosed in double quotes), we should fix original formatting into the correct one. So here is the code:
function my_json_decode($s, $associative = false) {
$s = str_replace(array('"', "'", 'http://'), array('\"', '"', 'http//'), $s);
$s = preg_replace('/(\w+):/i', '"\1":', $s);
$s = str_replace('http//', 'http://', $s);
return json_decode($s, $associative);
}
$parsed = my_json_decode($var, true);
Function my_json_decode is taken from this answer, slightly modified.

Categories