How to search in XML? (PHP) - php

I am working on a word application. I'm trying to get values from XML. My goal is getting the first and last letter of a word. Could you help me, please?
<!--?xml version='1.0'?-->
<Letters>
<Letter category='A'>
<FirstLetter>
<Property>First letter is A.</Property>
</FirstLetter>
<LastLetter>
<Property>Last letter is A.</Property>
</LastLetter>
</Letter>
<Letter category='B'>
<FirstLetter>
<Property>First letter is B.</Property>
</FirstLetter>
<LastLetter>
<Property>Last letter is B.</Property>
</LastLetter>
</Letter>
<Letter category='E'>
<FirstLetter>
<Property>First letter is E.</Property>
</FirstLetter>
<LastLetter>
<Property>Last letter is E.</Property>
</LastLetter>
</Letter>
</Letters>
PHP code:
<?php
$word = "APPLE";
$alphabet = "ABCÇDEFGĞHIİJKLMNOÖPQRSŞTUÜVWXYZ";
$index = strpos($alphabet, $word);
$string = $xml->xpath("//Letters/Letter[contains(text(), " . $alfabe[$rakam] . ")]/FirstLetter");
echo "<pre>" . print_r($string, true) . "</pre>";

The letter is in an attribute named 'category'.
$word = "APPLE";
// bootstrap DOM
$document = new DOMDocument();
$document->loadXML($xml);
$xpath = new DOMXpath($document);
// get first and last letter
$firstLetter = substr($word, 0, 1);
$lastLetter = substr($word, -1);
// fetch text from property elements
var_dump(
$xpath->evaluate(
"string(/Letters/Letter[#category = '$firstLetter']/FirstLetter/Property)"
),
$xpath->evaluate(
"string(/Letters/Letter[#category = '$lastLetter']/LastLetter/Property)"
)
);
Or in SimpleXML
$word = "APPLE";
$letters = new SimpleXMLElement($xml);
$firstLetter = substr($word, 0, 1);
$lastLetter = substr($word, -1);
// SimpleXML does not allow for xpath expression with type casts
// So validation and cast has to be done in PHP
var_dump(
(string)($letters->xpath(
"/Letters/Letter[#category = '$firstLetter']/FirstLetter/Property"
)[0] ?? ''),
(string)($letters->xpath(
"/Letters/Letter[#category = '$lastLetter']/LastLetter/Property"
)[0] ?? '')
);

Related

substr and mb_substr return nothing

I do not know what is wrong in the code below:
<?php
$html = file_get_contents('https://www.ibar.az/en/');
$doc = new domDocument();
$doc->loadHTML($html);
$doc->preserveWhiteSpace = false;
$ExchangePart = $doc->getElementsByTagName('li');
/*for ($i=0; $i<=$ExchangePart->length; $i++) {
echo $i . $ExchangePart->Item($i)->nodeValue . "<br>";
}*/
$C=$ExchangePart->Item(91)->nodeValue;
var_dump ($C);
$fff=mb_substr($C, 6, 2, 'UTF-8');
echo $fff;
?>
I have tried both substr and mb_substr but in both cases echo $fff; returns nothing.
Could anybody suggest what I am doing wrong?
This is the item 91 node:
<ul>
<li>USD</li>
<li>1.5072</li>
<li>1.462</li>
<li>1.5494</li>
<li class="down"> </li>
</ul>
This is node value:
¶
····························USD¶
································1.5072¶
································1.462¶
································1.5494¶
································•¶
····························
( · = space; • = nbsp )
substr( $C, 6, 2 ) is a string of two spaces.
To correct retrieve all values:
foreach( $ExchangePart->Item(91) as $node )
{
if( trim($node->nodeValue) ) echo $node->nodeValue . '<br>';
}
Otherwise, you can replace all node value spaces:
$C = str_replace( ' ', '', $C );

find a element in html and explode it for stock

I want to retrieve an HTML element in a page.
<h2 id="resultCount" class="resultCount">
<span>
Showing 1 - 12 of 40,923 Results
</span>
</h2>
I have to get the total number of results for the test in my php.
For now, I get all that is between the h2 tags and I explode the first time with space.
Then I explode again with the comma to concatenate able to convert numbers results in European format. Once everything's done, I test my number results.
define("MAX_RESULT_ALL_PAGES", 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$htmlResultCountPage = file_get_html($queryUrl);
$htmlResultCount = $htmlResultCountPage->find("h2[id=resultCount]");
$resultCountArray = explode(" ", $htmlResultCount[0]);
$explodeCount = explode(',', $resultCountArray[5]);
$europeFormatCount = '';
foreach ($explodeCount as $val) {
$europeFormatCount .= $val;
}
if ($europeFormatCount > MAX_RESULT_ALL_PAGES) {*/
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}
At the moment the total number of results is not well recovered and the condition does not happen even when it should.
Someone would have a solution to this problem or any other way?
I would simply fetch the page as a string (not html) and use a regular expression to get the total number of results. The code would look something like this:
define('MAX_RESULT_ALL_PAGES', 1200);
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
$queryResult = file_get_contents($queryUrl);
if (preg_match('/of\s+([0-9,]+)\s+Results/', $queryResult, $matches)) {
$totalResults = (int) str_replace(',', '', $matches[1]);
} else {
throw new \RuntimeException('Total number of results not found');
}
if ($totalResults > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL . $searchMonthUrlParam . $searchYearUrlParam . $searchTypeUrlParam . urlencode($keyword) . '&page=' . $pageNum;
// ...
}
A regex would do it:
...
preg_match("/of ([0-9,]+) Results/", $htmlResultCount[0], $matches);
$europeFormatCount = intval(str_replace(",", "", $matches[1]));
...
Please try this code.
define("MAX_RESULT_ALL_PAGES", 1200);
// new dom object
$dom = new DOMDocument();
// HTML string
$queryUrl = AMAZON_TOTAL_BOOKS_COUNT.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
$html_string = file_get_contents($queryUrl);
//load the html
$html = $dom->loadHTML($html_string);
//discard white space
$dom->preserveWhiteSpace = TRUE;
//Get all h2 tags
$nodes = $dom->getElementsByTagName('h2');
// Store total result count
$totalCount = 0;
// loop over the all h2 tags and print result
foreach ($nodes as $node) {
if ($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
if ($attribute->name === 'class' && $attribute->value == 'resultCount') {
$inner_html = str_replace(',', '', trim($node->nodeValue));
$inner_html_array = explode(' ', $inner_html);
// Print result to the terminal
$totalCount += $inner_html_array[5];
}
}
}
}
// If result count grater than 1200, do this
if ($totalCount > MAX_RESULT_ALL_PAGES) {
$queryUrl = AMAZON_SEARCH_URL.$searchMonthUrlParam.$searchYearUrlParam.$searchTypeUrlParam.urlencode($keyword)."&page=".$pageNum;
}
Give this a try:
$match =array();
preg_match('/(?<=of\s)(?:\d{1,3}+(?:,\d{3})*)(?=\sResults)/', $htmlResultCount, $match);
$europeFormatCount = str_replace(',','',$match[0]);
The RegEx reads the number between "of " and " Results", it matches numbers with ',' seperator.

Trying to extract keywords from a website PHP (OOP)

haha, I still have the problem of keywords, but this is a code that I'm creating.
Is a poor code but is my creation:
<?php
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) {
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del");
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml ->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
$webhtml = explode(" ", $webhtml);
foreach($listanegra as $key=> $ln) {
$webhtml = str_replace($ln, " ", $webhtml);
}
$palabras = str_word_count ("$webhtml", 1 );
$frq = array_count_values ($palabras);
$frq = asort($frq);
$ffrq = count($frq);
$i=1;
while ($i < $ffrq) {
print $frqq[$i];
print '<br />';
$i++;
}
}
?>
The code trying extract keywords of a website. Extracts the first paragraph of a web, and deletes the words of the variable "$listanegra". Next, counts the repeat words and saves all words in a "array". After i call the array, and this show me the words.
The problem is... the code it's not functional =(.
When i use the code, this shows blank.
Could help me finish my code?. Was recommending me to using "tf-idf", but I will use it later.
I do believe this is what you were trying to do:
$url = 'http://es.wikipedia.org/wiki/Animalia';
$words = Keys($url);
/// do your database stuff with $words
function Keys($url)
{
$listanegra = array('a', 'ante', 'bajo', 'con', 'contra', 'de', 'desde', 'mediante', 'durante', 'hasta', 'hacia', 'para', 'por', 'que', 'qué', 'cuán', 'cuan', 'los', 'las', 'una', 'unos', 'unas', 'donde', 'dónde', 'como', 'cómo', 'cuando', 'porque', 'por', 'para', 'según', 'sin', 'tras', 'con', 'mas', 'más', 'pero', 'del');
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTMLFile($url);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
$webhtml = explode(' ', $webhtml);
$palabras = array();
foreach($webhtml as $word)
{
$word = strtolower(trim($word, ' .,!?()')); // remove trailing special chars and spaces
if (!in_array($word, $listanegra))
{
$palabras[] = $word;
}
}
$frq = array_count_values($palabras);
asort($frq);
return implode(' ', array_keys($frq));
}
Your server should show the errors if you are testing :
add this after
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
that way you will see the error:
Array to string conversion on line 24 (line 19 if you don't put the 5 new lines)
here are some errors i found 4 functions are not used as they should str_replace, str_word_count , asort , array_count_values.
Using str_replace is a little tricky. Trying to find and remove a removes all the "a" in the text even in "animal". (str_replace("a","animal") => nmal)
this link should be usefull : link
asort return true or false so doing just:
asort($frq);
will sort the values in alphabetical order. $frq returns the result of array_count_values --> $frq = array($word1=>word1_count , ...)
the value here is the number of times the word is used so when later you have :
print $**frq**[$i]; // you have print $frqq[$i]; in your code
the result will be empty since the index of this array are the words and the values the number of time the words appear in the text.
Also with str_word_count you must be really careful, since you are reading Hispanic text and text can have numbers you shoudl use this
str_word_count($string,1,'áéíóúüñ1234567890');
The code i would suggest :
<?php
header('Content-Type: text/html; charset=UTF-8');
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
$url = 'http://es.wikipedia.org/wiki/Animalia';
Keys($url);
function Keys($url) {
$listanegra = array("a", "ante", "bajo", "con", "contra", "de", "desde", "mediante", "durante", "hasta", "hacia", "para", "por", "que", "qué", "cuán", "cuan", "los", "las", "una", "unos", "unas", "donde", "dónde", "como", "cómo", "cuando", "porque", "por", "para", "según", "sin", "tras", "con", "mas", "más", "pero", "del");
$html=file_get_contents($url);
$doc = new DOMDocument('1.0', 'UTF-8');
$html = mb_convert_encoding($html, "HTML-ENTITIES", "UTF-8");
libxml_use_internal_errors(true);
$doc->loadHTML($html);
$webhtml = $doc->getElementsByTagName('p');
$webhtml = $webhtml ->item(0)->nodeValue;
$webhtml = strip_tags($webhtml);
print_r ($webhtml);
$webhtml = explode(" ", $webhtml);
// $webhtml = str_replace($listanegra, " ", $webhtml); str_replace() accepts array
foreach($listanegra as $key=> $ln) {
$webhtml = preg_replace('/\b'.$ln.'\b/u', ' ', $webhtml);
}
$palabras = str_word_count(implode(" ",$webhtml), 1, 'áéíóúüñ1234567890');
sort($palabras);
$frq = array_count_values ($palabras);
foreach($frq as $index=>$value) {
print "the word <strong>$index</strong> was used <strong>$value</strong> times";
print '<br />';
}
}
?>
Was really painfull trying to figure out the special chars issues

get attribute values with php dom

I try to get some attiributue values. But have no chance. Below yo can see my code and explanation. How to get duration, file etc.. values?
$url="http://www.some-url.ltd";
$dom = new DOMDocument;
#$dom->loadHTMLFile($url);
$xpath = new DOMXPath($dom);
$the_div = $xpath->query('//div[#id="the_id"]');
foreach ($the_div as $rval) {
$the_value = trim($rval->getAttribute('title'));
echo $the_value;
}
The output below:
{title:'title',
description:'description',
scale:'fit',keywords:'',
file:'http://xxx.ccc.net/ht/2012/05/10/419EE45F98CD63F88F52CE6260B9E85E_c.mp4',
type:'flv',
duration:'24',
screenshot:'http://xxx.ccc.net/video/2012/05/10/419EE45F98CD63F88F52CE6260B9E85E.jpg?v=1336662169',
suggestion_path:'/videoxml/player_xml/61319',
showSuggestions:true,
autoStart:true,
width:412,
height:340,
autoscreenshot:true,
showEmbedCode:true,
category: 1,
showLogo:true
}
How to get duration, file etc.. values?
What about
$parsed = json_decode($the_value, true);
$duration = $parsed['duration'];
EDIT:
Since json_decode() requires proper JSON formatting (key names and values must be enclosed in double quotes), we should fix original formatting into the correct one. So here is the code:
function my_json_decode($s, $associative = false) {
$s = str_replace(array('"', "'", 'http://'), array('\"', '"', 'http//'), $s);
$s = preg_replace('/(\w+):/i', '"\1":', $s);
$s = str_replace('http//', 'http://', $s);
return json_decode($s, $associative);
}
$parsed = my_json_decode($var, true);
Function my_json_decode is taken from this answer, slightly modified.

highlight the word in the string, if it contains the keyword

how write the script, which menchion the whole word, if it contain the keyword? example: keyword "fun", string - the bird is funny, result - the bird is * funny*. i do the following
$str = "my bird is funny";
$keyword = "fun";
$str = preg_replace("/($keyword)/i","<b>$1</b>",$str);
but it menshions only keyword. my bird is funny
Try this:
preg_replace("/\w*?$keyword\w*/i", "<b>$0</b>", $str)
\w*? matches any word characters before the keyword (as least as possible) and \w* any word characters after the keyword.
And I recommend you to use preg_quote to escape the keyword:
preg_replace("/\w*?".preg_quote($keyword)."\w*/i", "<b>$0</b>", $str)
For Unicode support, use the u flag and \p{L} instead of \w:
preg_replace("/\p{L}*?".preg_quote($keyword)."\p{L}*/ui", "<b>$0</b>", $str)
You can do the following:
$str = preg_replace("/\b([a-z]*${keyword}[a-z]*)\b/i","<b>$1</b>",$str);
Example:
$str = "Its fun to be funny and unfunny";
$keyword = 'fun';
$str = preg_replace("/\b([a-z]*${keyword}[a-z]*)\b/i","<b>$1</b>",$str);
echo "$str"; // prints 'Its <b>fun</b> to be <b>funny</b> and <b>unfunny</b>'
<?php
$str = "my bird is funny";
$keyword = "fun";
$look = explode(' ',$str);
foreach($look as $find){
if(strpos($find, $keyword) !== false) {
if(!isset($highlight)){
$highlight[] = $find;
} else {
if(!in_array($find,$highlight)){
$highlight[] = $find;
}
}
}
}
if(isset($highlight)){
foreach($highlight as $replace){
$str = str_replace($replace,'<b>'.$replace.'</b>',$str);
}
}
echo $str;
?>
Here by am added multi search in a string for your reference
$keyword = ".in#.com#dot.com#1#2#3#4#5#6#7#8#9#one#two#three#four#five#Six#seven#eight#nine#ten#dot.in#dot in#";
$keyword = implode('|',explode('#',preg_quote($keyword)));
$str = "PHP is dot .com the amazon.in 123455454546 dot in scripting language of choice.";
$str = preg_replace("/($keyword)/i","<b>$0</b>",$str);
echo $str;
Basically, since this is HTML, what you have to do is iterate over text nodes and split those containing the search string into up to three nodes (before match, after match and the highlighted match). If "after match" node exist, it must be processed too. Here is a PHP7 example using PHP DOM extension. The following function accepts preg_quoted UTF-8 search string (or regex-conpatible expression like apple|orange). It will enclose every match in a given tag with a given class.
function highlightTextInHTML($regex_compatible_text, $html, $replacement_tag = 'span', $replacement_class = 'highlight') {
$d = new DOMDocument('1.0','utf-8');
$d->loadHTML('<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head>' . $html);
$xpath = new DOMXPath($d);
$process_node = function(&$node) use($regex_compatible_text, $replacement_tag, $replacement_class, &$d, &$process_node) {
$i = preg_match("~(?<before>.*?)(?<search>($regex_compatible_text)+)(?<after>.*)~ui", $node->textContent, $m);
if($i) {
$x = $d->createElement($replacement_tag);
$x->setAttribute('class', $replacement_class);
$x->textContent = $m['search'];
$parent_node = $node->parentNode;
$before = null;
$after = null;
if(!empty($m['after'])) {
$after = $d->createTextNode($m['after']);
$parent_node->replaceChild($after, $node);
$parent_node->insertBefore($x, $after);
} else {
$parent_node->replaceChild($x, $node);
}
if(!empty($m['before'])) {
$before = $d->createTextNode($m['before']);
$parent_node->insertBefore($before, $x);
}
if($after) {
$process_node($after);
}
}
};
$node_list = $xpath->query('//text()');
foreach ($node_list as $node) {
$process_node($node);
}
return preg_replace('~(^.*<body>)|(</body>.*$)~mis', '', $d->saveHTML());
}
Search and highlight the word in your string, text, body and paragraph:
<?php $body_text='This is simple code for highligh the word in a given body or text'; //this is the body of your page
$searh_letter = 'this'; //this is the string you want to search for
$result_body = do_Highlight($body_text,$searh_letter); // this is the result with highlight of your search word
echo $result_body; //for displaying the result
function do_Highlight($body_text,$searh_letter){ //function for highlight the word in body of your page or paragraph or string
$length= strlen($body_text); //this is length of your body
$pos = strpos($body_text, $searh_letter); // this will find the first occurance of your search text and give the position so that you can split text and highlight it
$lword = strlen($searh_letter); // this is the length of your search string so that you can add it to $pos and start with rest of your string
$split_search = $pos+$lword;
$string0 = substr($body_text, 0, $pos);
$string1 = substr($body_text,$pos,$lword);
$string2 = substr($body_text,$split_search,$length);
$body = $string0."<font style='color:#FF0000; background-color:white;'>".$string1." </font> ".$string2;
return $body;
} ?>

Categories