PHP find keywords in XML feed - preg_match_all - php

I'm loading in a rss feed in XML. I am then attempting to search the RSS feed for keywords. I then need to it email myself with a notification that it found these key words and their contents.
Example i have:
$keywords = "iphone|ipad|ipod";
$rss = simplexml_load_file("http://myfeed.com/news?format=xml");
foreach ($rss->entry as $i)
{
//need to check the title tag for the keywords
//email myself a notification with the keywords found
}
UPDATE
Ok i have the following which is part working:
function getPartsWithRegex($text, $regex){
preg_match_all($regex, $text, $array);
return $array[0];
}
$keywords = "LLC|socks";
$rss = simplexml_load_file("http://feeds.myfeed.com/iphoneapps-news?format=xml");
$text = "Downcast - Jamawkinaw Enterprises LLC";
$regex = "/$keywords/i";
foreach ($rss->entry as $i)
{
$title = (string)$i->title;
$productNumbers = getPartsWithRegex($text, $regex);
}
if(count($productNumbers) > 0){
foreach($productNumbers as $productNumber){
echo $productNumber . '<br />';
}
}
else {
echo "Nothing found.";
}
This works when i pass $productNumbers = getPartsWithRegex($text, $regex); However, when i pass $title which has the value of $i->title it doesnt work. Even though when i echo out the value of $title to the browser it renders as: Downcast - Jamawkinaw Enterprises LLC
So to clarify when i manually enter and pass a string it works correct. However, when i pass the value from the XML which is read it doesnt. I've tried converting to string but that doesnt work.
Any ideas why?

Related

PHP: Test for string of text inside html tags from file_get_contents string

I need to perform a series of tests on a url. The first test is a word count, I have that working perfectly and the code is below:
if (isset($_GET[article_url])){
$title = 'This is an example title';
$str = #file_get_contents($_GET[article_url]);
$test1 = str_word_count(strip_tags(strtolower($str)));
if($test1 === FALSE) { $test = '0'; }
if ($test1 > '550') {
echo '<div><i class="fa fa-check-square-o" style="color:green"></i> This article has '.$test1.' words.';
} else {
echo '<div><i class="fa fa-times-circle-o" style="color:red"></i> This article has '.$test1.' words. You are required to have a minimum of 500 words.';
}
}
Next I need to get all h1 and h2 tags from $str and test them to see if any contain the text $title and echo yes if so and no if not. I am not really sure how to go about doing this.
I am looking for a pure php means of doing this without installing php libraries or third party functions.
please try below code.
if (isset($_GET[article_url])){
$title = 'This is an example title';
$str = #file_get_contents($_GET[article_url]);
$document = new DOMDocument();
$document->loadHTML($str);
$tags = array ('h1', 'h2');
$texts = array ();
foreach($tags as $tag)
{
//Fetch all the tags with text from the dom matched with passed tags
$elementList = $document->getElementsByTagName($tag);
foreach($elementList as $element)
{
//Store text in array from dom for tags
$texts[] = strtolower($element->textContent);
}
}
//Check passed title is inside texts array or not using php
if(in_array(strtolower($title),$texts)){
echo "yes";
}else{
echo "no";
}
}

Regex not quite right

I have a site crawler which displays a list of urls, but the problem is I cannot for the life of me get the last regex quite right.
all urls end up listed as:
http://www.website.org/page1.html&--EFTTIUGJ4ITCyh0Frzb_LFXe_eHw
http://website.net/page2/&--EyqBLeFeCkSfmvA7p0cLrsy1Zm1g
http://foobar.website.com/page3.php&--E5WRBxuTOQikDIyBczaVXveOdRFg
The Urls can all be different and the only thing which seems static is the & symbol.
How would go abouts getting rid of the & symbol and everything beyond it to the right?
Here is what I have tried with the above results:
function getresults($sterm) {
$html = file_get_html($sterm);
$result = "";
// find all span tags with class=gb1
foreach($html->find('h3[class="r"]') as $ef)
{
$result .= $ef->outertext . '<br>';
}
return $result;
}
function geturl($url) {
$var = $url;
$result = "";
preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\/url?q=\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
$var, $matches);
$matches = $matches[1];
foreach($matches as $var)
{
$result .= $var."<br>";
}
echo preg_replace('/sa=U.*?usg=.*?AFQjCN/', "--" , $result);
}
if url are ALWAYS in the same format, use explode :
<?php
$tmp = explode("&", "http://foobar.website.com/page3.php&--E5WRBxuTOQikDIyBczaVXveOdRFg");
?>
$tmp[0] should content "http://foobar.website.com/page3.php" and
$tmp[1] should content "--E5WRBxuTOQikDIyBczaVXveOdRFg"
A simple way to remove everything after the & character:
$result = substr($result, 0, strpos($result, '&'));

function to scrape page keywords , description and title?

i wrote simple 3 functions to scrape titles , description and keywords of simple html page
this is the first function to scrape titles
function getPageTitle ($url)
{
$content = $url;
if (eregi("<title>(.*)</title>", $content, $array)) {
$title = $array[1];
return $title;
}
}
and it works fine
and those are 2 functions to scrape description and keywords and those not working
function getPageKeywords($url)
{
$content = $url;
if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+keywords[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) {
$keywords = $array[1];
return $keywords;
}
}
function getPageDesc($url)
{
$content = $url;
if ( preg_match('/<meta[\s]+[^>]*?name[\s]?=[\s\"\']+description[\s\"\']+content[\s]?=[\s\"\']+(.*?)[\"\']+.*?>/i', $content, $array)) {
$desc = $array[1];
return $desc;
}
}
i know there may be something wrong with the preg_match line but i really don't know
i tried it so much things but it doesn't work
Why not use get_meta_tags? PHP Documentation Here
<?php
// Assuming the above tags are at www.example.com
$tags = get_meta_tags('http://www.example.com/');
// Notice how the keys are all lowercase now, and
// how . was replaced by _ in the key.
echo $tags['author']; // name
echo $tags['keywords']; // php documentation
echo $tags['description']; // a php manual
echo $tags['geo_position']; // 49.33;-86.59
?>
NOTE You can change the parameter to either a URL, local file or string.
Its better to use php's native DOMDocument to parse HTML then regex, you can also use , tho in this day in age allot of sites dont even add the keywords, description tags no more, so you cant rely on them always being there. But here is how you can do it with DOMDocument:
<?php
$source = file_get_contents('http://php.net');
$dom = new DOMDocument("1.0","UTF-8");
#$dom->loadHTML($source);
$dom->preserveWhiteSpace = false;
//Get Title
$title = $dom->getElementsByTagName('title')->item(0)->nodeValue;
$description = '';
$keywords = '';
foreach($dom->getElementsByTagName('meta') as $metas) {
if($metas->getAttribute('name') =='description'){ $description = $metas->getAttribute('content'); }
if($metas->getAttribute('name') =='keywords'){ $keywords = $metas->getAttribute('content'); }
}
print_r($title);
print_r($description);
print_r($keywords);
?>

Yahoo Search API Problem

I am having problem with the yahoo search API, sometimes it works and sometimes don't why I am getting problem with that
I am using this URL
http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml?appid=yahoosearchwebrss&query=originurlextension%3Apdf+$search&adult_ok=1&start=$start
The code is given below:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<? $search = $_GET["search"];
$replace = " "; $with = "+";
$search = str_replace($replace, $with, $search);
if ($rs =
$rss->get("http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml?appid=yahoosearchwebrss&query=originurlextension%3Apdf+$search&adult_ok=1&start=$start")
)
{ }
// Go through the list powered by the search engine listed and get
// the data from each <item>
$colorCount="0";
foreach($rs['items'] as $item) { // Get the title of result
$title = $item['title']; // Get the description of the result
$description = $item['description']; // Get the link eg amazon.com
$urllink = $item['guid'];
if($colorCount%2==0) {
$color = ROW1_COLOR;
} else {
$color = ROW2_COLOR;
}
include "resulttemplate.php"; $colorCount++;
echo "\n";
}
?>
Sometimes it gives results and sometimes don't. I get this error usually
Warning: Invalid argument supplied for foreach() in /home4/thesisth/public_html/pdfsearchmachine/classes/rss.php on line 14
Can anyone help..
The error Warning: Invalid argument supplied for foreach() in /home4/thesisth/public_html/pdfsearchmachine/classes/rss.php on line 14 means the foreach construct did not receive an iterable (usually an array). Which in your case would mean the $rs['items'] is empty... maybe the search returned no results?
I would recommended adding some checks to the results of $rss->get("...") first, and also having an action for when the request fails or returns no results:
<?php
$search = isset($_GET["search"]) ? $_GET["search"] : "default search term";
$start = "something here"; // This was left out of your original code
$colorCount = "0";
$replace = " ";
$with = "+";
$search = str_replace($replace, $with, $search);
$rs = $rss->get("http://api.search.yahoo.com/WebSearchService/rss/webSearch.xml?appid=yahoosearchwebrss&query=originurlextension%3Apdf+$search&adult_ok=1&start=$start");
if (isset($rs) && isset($rs['items'])) {
foreach ($rs['items'] as $item) {
$title = $item['title']; // Get the title of the result
$description = $item['description']; // Get the description of the result
$urllink = $item['guid']; // Get the link eg amazon.com
$color = ($colorCount % 2) ? ROW2_COLOR : ROW1_COLOR;
include "resulttemplate.php";
echo "\n";
$colorCount++;
}
}
else {
echo "Could not find any results for your search '$search'";
}
Other changes:
$start was not declared before your $rss->get("...") call
compounded the $color if/else clause into a ternary operation with fewer comparisons
I wasn't sure what the purpose of the if ($rs = $rss->get("...")) { } was, so I removed it.
I would also recommend using require instead of include as it will cause a fatal error if resulttemplate.php doesn't exist, which in my opinion is a better way to detect bugs than PHP Warnings which will continue execution. However I don't know you whole situation so it might not be of great use.
Hope that helps!
Cheers

Retrieving description and keywords meta tags in php

I was wondering: which is the fastest method or code to get meta tags?
I have this code with me, but using get_meta_tags function slows down the process. Any ideas?
$tags = get_meta_tags('http://www.example.com/');
echo $tags['keywords']; // keywords
echo $tags['description']; //description
The reason is because the whole page is parsed before php attempts to get the meta tags. It is probably best to use a regex to parse the returned html.
function get_meta_data($page) {
$meta_data = array();
preg_match_all(
"/]+name=\"([^"])\"[^>]+content="([^\"])"[^>]+>/i",
$page,
$result,
PREG_PATTERN_ORDER);
$total_found = count($result[1]);
while (--$total_found) {
strtolower($out[1][$i]) == "keywords") && $meta_data['keywords'] = $results[2][$i];
strtolower($out[1][$i]) == "description") && $meta_data['description'] = $results[2][$i];
}
return $meta;
}
hope that helps

Categories