Searching for a link in a website and displaying it PHP

Searching for a link in a website and displaying it PHP - php

hello im a newbie in php i am trying make a search function using php but only inside the website without any database
basically if i want to search a string namely "Health" it would display the lines
The Joys of Health
Healthy Diets
This snippet is the only thing i could find if properly coded would output the "lines" i want
$myPage = array("directory.php","pages.php");
$lines = file($myPage[n]);
echo $lines[n];
i havent tried it yet if it would work but before i do i want to ask if there is any better way to do this?
if my files have too many lines wont it stress out the server?

The file() function will return an array. You should use file_get_contents() instead, as it returns a string.
Then, use regular expressions to find specific text within a link.

Your goal is fine but the method you're thinking about is not. the file() function read a file, line by line, and inserts it into an array. This assumes the HTML is well-structured in a human-readable fashion, which is not always the case. However, if you're the one providing the HTML and you make sure the structure is perfectly defined, ok... here you have the example you provided us with but complete (take into account it's the 'wrong' way of solving your problem, but if you want to follow that pattern, it's ok):
function pagesearch($pages, $string) {
if (!empty($pages) && !empty($string)) {
$tags = [];
foreach ($pages as $page) {
if ($lines = file($page)) {
foreach ($lines as $line) {
if (!empty($line)) {
if (mb_strpos($line, $string)) {
$tags[$page][] = $line;
}
}
}
}
}
return $tags;
}
}
This will return you an array with all the pages you referenced with all occurrences of the word you look for, separated by page. As I said, it's not the way you want to solve this, but it's a way.
Hope that helps

Because you do not want to use any database and because the term database is very broad and includes the file-system you want to do a search in some database without having a database.
That makes no sense. In your case one database at least is the file-system. If you can accept the fact that you want to search a database (here your html files) but you do not want to use a database to store anything related to the search (e.g. some index or cached results), then what you suggest is basically how it is working: A real-time, text-based, line-by-line file-search.
Sure it is very rudimentary but as your constraint is "no database", you have already found the only possible way. And yes it will stress your server when used because real-time search is expensive.
Otherwise normally Lucene/Solr is used for the job but that is a database and a server even.

Related

Create value with specific parts of a text file

Ok, I am working on a flatfile shoutbox, and I am trying to achieve a way to get the username from the flatfile and making it a variable so I can use it to make a call to the database to check if the user is admin so they can delete/ban users directly from the shoutbox.
This is an example line in the flatfile
<div><i><div class='date'>12/08/2012 18:56 pm </div></i> <div class='groupAdmin'><b>Admin</b></div><b>kira423:</b> hiya :D</div>
So I wanna take the username which is kira423 in this case and create a variable such as $shoutname and make it equal kira423
I have tried a google search and looked around on here, but was unable to find an answer, so I am hoping that I can get some insight on how to do this with a question of my own here.
Thanks,
Kira

You should use preg_match for those tasks like this:
preg_match_all('|<div class=\'date\'>(?P<date>.*?) .*<a.*>(?P<user>.*)</a>|i', $data, $matches);
var_dump($matches);
Interating through all array elements:
foreach ($matches['user'] as $key => $user) {
var_dump($user);
}

I think you should just parse each line in the flatfile as HTML (there are simple HTML tags used), just like described in PHP Parse HTML code (or type "php parse HTML" in google). Then you may access the username (kira123) from an array or whatever.
PS HTML is not the best way you can store messages to display. Even CSV seems to be better - it'd be "kira123;date;some text" - it's easier to read and to access each part. When displaying, use the standar decorator pattern.

Alternative to php preg_match to pull data from an external website?

I want to extrat the content of a specific div in an external webpage, the div looks like this:
<dt>Win rate</dt><dd><div>50%</div></dd>
My target is the "50%". I'm actually using this php code to extract the content:
function getvalue($parameter,$content){
preg_match($parameter, $content, $match);
return $match[1];
};
$parameter = '#<dt>Score</dt><dd><div>(.*)</div></dd>#';
$content = file_get_contents('https://somewebpage.com');
Everything works fine, the problem is that this method is taking too much time, especially if I've to use it several times with diferents $content.
I would like to know if there's a better (faster, simplier, etc.) way to acomplish the same function? Thx!

You may use DOMDocument::loadHTML and navigate your way to the given node.
$content = file_get_contents('https://somewebpage.com');
$doc = new DOMDocument();
$doc->loadHTML($content);
Now to get to the desired node, you may use method DOMDocument::getElementsByTagName, e.g.
$dds = $doc->getElementsByTagName('dd');
foreach($dds as $dd) {
// process each <dd> element here, extract inner div and its inner html...
}
Edit: I see a point #pebbl has made about DomDocument being slower. Indeed it is, however, parsing HTML with preg_match is a call for trouble; In that case, I'd also recommend looking at event-driven SAX XML parser. It is much more lightweight, faster and less memory intensive as it does not build a tree. You may take a look at XML_HTMLSax for such a parser.

There are basically three main things you can do to improve the speed of your code:
Off load the external page load to another time (i.e. use cron)
On a linux based server I would know what to suggest but seeing as you use Windows I'm not sure what the equivalent would be, but Cron for linux allows you to fire off scripts at certain schedule time offsets - in the background - so not using a browser. Basically I would recommend that you create a script who's sole purpose is to go and fetch the website pages at a particular time offset (depending on how frequently you need to update your data) and then write those webpages to files on your local system.
$listOfSites = array(
'http://www.something.com/page.htm',
'http://www.something-else.co.uk/index.php',
);
$dirToContainSites = getcwd() . '/sites';
foreach ( $listOfSites as $site ) {
$content = file_get_contents( $site );
/// i've just simply converted the URL into a filename here, there are
/// better ways of handling this, but this at least keeps things simple.
/// the following just converts any non letter or non number into an
/// underscore... so, http___www_something_com_page_htm
$file_name = preg_replace('/[^a-z0-9]/i','_', $site);
file_put_contents( $dirToContainSites . '/' . $file_name, $content );
}
Once you've created this script, you then need to set the server up to execute it as regularly as you need. Then you can modify your front-end script that displays the stats to read from local files, this would give a significant speed increase.
You can find out how to read files from a directory here:
http://uk.php.net/manual/en/function.dir.php
Or the simpler method (but prone to possible problems) is just to re-step your array of sites, convert the URLs to file names using the preg_replace above, and then check for the file's existence in the folder.
Cache the result of calculating your statistics
It's quite likely this being a stats page that you'll want to visit it quite frequently (not as frequent as a public page, but still). If the same page is visited more often than the cron-based script is executed then there is no reason to do all the calculation again. So basically all you have to do to cache your output is do something similar to the following:
$cachedVersion = getcwd() . '/cached/stats.html';
/// check to see if there is a cached version of this page
if ( file_exists($cachedVersion) ) {
/// if so, load it and echo it to the browser
echo file_get_contents($cachedVersion);
}
else {
/// start output buffering so we can catch what we send to the browser
ob_start();
/// DO YOUR STATS CALCULATION HERE AND ECHO IT TO THE BROWSER LIKE NORMAL
/// end output buffering and grab the contents so we now have a string
/// of the page we've just generated
$content = ob_get_contents(); ob_end_clean();
/// write the content to the cached file for next time
file_put_contents($cachedVersion, $content);
echo $content;
}
Once you start caching things you need to be aware of when you should delete or clear your cache - otherwise if you don't your stats output will never change. With regards to this situation, the best time to clear your cache is at the point you go and fetch the external web pages again. So you should add this line to the bottom of your "cron" script.
$cachedVersion = getcwd() . '/cached/stats.html';
unlink( $cachedVersion ); /// will delete the file
There are other speed improvements you could make to the caching system (you could even record the modified times of the external webpages and load only when they have been updated) but I've tried to keep things easy to explain.
Don't use a HTML Parser for this situation
Scanning a HTML file for one particular unique value does not require the use of a fully-blown or even lightweight HTML Parser. Using RegExp incorrectly seems to be one of those things that lots of start-up programmers fall into, and is a question that is always asked. This has led to lots of automatic knee-jerk reactions from more experience coders to automatically adhere to the following logic:
if ( $askedAboutUsingRegExpForHTML ) {
$automatically->orderTheSillyPersonToUse( $HTMLParser );
} else {
$soundAdvice = $think->about( $theSituation );
print $soundAdvice;
}
HTMLParsers should be used when the target within the markup is not so unique, or your pattern to match relies on such flimsy rules that it'll break the second an extra tag or character occurs. They should be used to make your code more reliable, not if you want to speed things up. Even parsers that do not build a tree of all the elements will still be using some form of string searching or regular expression notation, so unless the library-code you are using has been compiled in an extremely optimised manner, it will not beat well coded strpos/preg_match logic.
Considering I have not seen the HTML you are hoping to parse, I could be way off, but from what I've seen of your snippet it should be quite easy to find the value using a combination of strpos and preg_match. Obviously if your HTML is more complex and might have random multiple occurances of <dt>Win rate</dt><dd><div>50%</div></dd> it will cause problems - but even so - a HTMLParser would still have the same problem.
$offset = 0;
/// loop through the occurances of 'Win rate'
while ( ($p = stripos ($html, 'win rate', $offset)) !== FALSE ) {
/// grab out a snippet of the surrounding HTML to speed up the RegExp
$snippet = substr($html, $p, $p + 50 );
/// I've extended your RegExp to try and account for 'white space' that could
/// occur around the elements. The following wont take in to account any random
/// attributes that may appear, so if you find some pages aren't working - echo
/// out the $snippet var using something like "echo '<xmp>'.$snippet.'</xmp>';"
/// and that should show you what is appearing that is breaking the RegExp.
if ( preg_match('#^win\s+rate\s*</dt>\s*<dd>\s*<div>\s*([0-9]+%)\s*<#i', $snippet, $regs) ) {
/// once you are here your % value will be in $regs[1];
break; /// exit the while loop as we have found our 'Win rate'
}
/// reset our offset for the next loop
$offset = $p;
}
Gotchas to be aware of
If you are new to PHP, as you state in a comment above, then the above may seem rather complicated - which it is. What you are trying to do is quite complex, especially if you want to do it optimally and fast. However, if you follow throught the code I've given and research any bits that you aren't sure of / haven't heard of (php.net is your friend), it should give you a better understanding of a good way to achieve what you are doing.
Guessing ahead however, here are some of the problems you might face with the above:
File Permission errors - in order to be able to read and write files to and from the local operating system you will need to have the correct permissions to do so. If you find you can not write files to a particular directory it might be that the host you are using wont allow you to do so. If this is the case you can either contact them to ask about how to get write permission to a folder, or if that isn't possible you can easily change the code above to use a database instead.
I can't see my content - when using output buffering all the echo and print commands do not get sent to the browser, they instead get saved up in memory. PHP should automatically output all the stored content when the script exits, but if you use a command like ob_end_clean() this actually wipes the 'buffer' so all the content is erased. This can lead to confusing situations when you know you are echoing something.. but it just isn't appearing.
(Mini Disclaimer :) I've typed all the above manually so you may find there are PHP errors, if so, and they are baffling, just write them back here and StackOverflow can help you out)

Instead of trying to not use preg_match why not just trim your document contents down in size? for example, you could dump everything before <body and everything after </body>. then preg_match will be searching less content already.
Also, you could try to do each one of these processes as a pseudo separate thread, so that way they aren't happening one at a time.

php scanning content for specific keywords

As part of a CMS admin, I would like to scan new articles for specific keyphrases/tags that are stored in a mysql db.
I am proficient enough to be able to pull the list of keywords out, loop through them and do stripos, and substr_count to build an array of the found keywords. but the average article is about 700 words and there are 16,000 tags and growing so currently the loop takes about 0.5s which was longer than I had hoped, and will only ever get longer.
Is there a better way of doing this? Even if this type of procedure has a special name, that could help.
I have PHP 5.3 on Fedora, it is also on dedicated servers so I don't have any shared host issues.
EDIT - I am such a scatterbrain, I swore blind that I copy and pasted some code! clearly not
$found = array();
while($row = $pointer->fetch_assoc())
{
if(stripos($haystack, $row["Name"]) )
{
$found[$row["Name"]] = substr_count( $haystack, $row["Name"]);
}
}
arsort($found);
I think I explained myself badly, because I want to do the procedure on new articles they are currently not in the database, so I was just going to use $_POST in an ajax request, rather than saving the article to the DB first.

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html is exactly what you are looking for if you don't want to use a search engine script such as sphinx/solr.

It sounds like your code looks something like this:
foreach($keywords as $keyword){
if(strpos($keyword, $articleText) != -1){
$foundKeywords[] = $keyword;
}
}
Something you may consider since the keywords array is so large and will continue to grow is to switch your processing to loop through the words in the text instead of the keywords array. Something like this:
$textWords = explode(" ", $articleText);
foreach($textWords as $word){
if( array_search($word, $keywords) && !array_search($word, $foundKeywords) ){
$foundKeywords[] = $word;
}
}

Permanently write variables to a php file with php

I need to be able to permanently change variables in a php file using php.
I am creating a multilanguage site using codeigniter and using the language helper which stores the text in php files in variables in this format:
$lang['title'] = "Stuff";
I've been able to access the plain text of the files using fopen() etc and I it seems that I could probably locate the areas I want to edit with with regular expressions and rewrite the file once I've made the changes but it seems a bit hacky.
Is there any easy way to edit these variables permanently using php?
Cheers

If it's just an array you're dealing with, you may want to consider var_export. It will print out or return the expression in a format that's valid PHP code.
So if you had language_foo.php which contained a bunch of $lang['title'] = "Stuff"; lines, you could do something along the lines of:
include('language_foo.php');
$lang['title2'] = 'stuff2';
$data = '$lang = ' . var_export($lang, true) . ';';
file_put_contents('language_foo.php', '<?PHP ' . $data . ' ?>');
Alternatively, if you won't want to hand-edit them in the future, you should consider storing the data in a different way (such as in a database, or serialize()'d, etc etc).

It looks way easier to store data somewhere else (for instance, a database) and write a simple script to generate the *.php files, with this comment on top:
#
# THIS FILE IS AUTOGENERATED - DO NOT EDIT
#

I once faced a similar issue. I fixed it by simply adding a smarty template. The way I did it was as follows:
Read the array from the file
Add to the array
Pass the array to smarty
Loop over the array in smarty and generate the file using a template (this way you have total control, which might be missing in reg-ex)
Replace the file
Let me know if this helps.

Assuming that
You need the dictionary file in a human-readable and human-editable form (no serializing etc.)
The Dictionary array is an one-dimensional, associative array:
I would
Include() the dictionary file inside a function
Do all necessary operations on the $lang array (add words, remove words, change words)
Write the $lang array back into the file using a simple loop:
foreach ($lang as $key => $value)
fwrite ($file, "\$lang['$key'] = '$value';\n";
this is an extremely limited approach, of course. I would be really interested to see whether there is a genuine "PHP source code parser, changer and writer" around. This should be possible to do using the tokenizer functions.

If it also is about a truly multilingual site, you might enjoy looking into the gettext extension of PHP. It falls back to a library that has been in use for localizing stuff for many years, and where tools to keep up with the translation files have been around for almost quite as long. This makes supporting all the languages in later revisions of the product more fun, too.
In other news, I would not use an array but rather appropriate definitions, so that you have a file
switch ($lang) {
case 'de':
define('HELLO','Hallo.');
define('BYE','Auf wiedersehen.');
break;
case 'fr':
define('HELLO','Bonjour');
define('BYE','Au revoir.');
break;
case 'en':
default:
define ('HELLO','Hello.');
define ('BYE','Bye.');
}
And I'd also auto-generate that from a database, if maintenance becomes a hassle.

Pear Config will let you read and write PHP files containing settings using its 'PHPArray' container. I have found that the generated PHP is more readable than that from var_export()

Assistance with building an inverted-index

It's part of an information retrieval thing I'm doing for school. The plan is to create a hashmap of words using the the first two letters of the word as a key and any words with the two letters saved as a string value. So,
hashmap["ba"] = "bad barley base"
Once I'm done tokenizing a line I take that hashmap, serialize it, and append it to the text file named after the key.
The idea is that if I take my data and spread it over hundreds of files I'll lessen the time it takes to fulfill a search by lessening the density of each file. The problem I am running into is when I'm making 100+ files in each run it happens to choke on creating a few files for whatever reason and so those entries are empty. Is there any way to make this more efficient? Is it worth continuing this, or should I abandon it?
I'd like to mention I'm using PHP. The two languages I know relatively intimately are PHP and Java. I chose PHP because the front end will be very simple to do and I will be able to add features like autocompletion/suggested search without a problem. I also see no benefit in using Java. Any help is appreciated, thanks.

I would use a single file to get and put the serialized string. I would also use json as the serialization.
Put the data
$string = "bad barley base";
$data = explode(" ",$string);
$hashmap["ba"] = $data;
$jsonContent = json_encode($hashmap);
file_put_contents("a-z.txt",$jsonContent);
Get the data
$jsonContent = file_get_contents("a-z.txt");
$hashmap = json_decode($jsonContent);
foreach($hashmap as $firstTwoCharacters => $value) {
if ($firstTwoCharacters == 'ba') {
$wordCount = count($value);
}
}

You didn't explain the problem you are trying to solve. I'm guessing you are trying to make a full text search engine, but you don't have document ids in your hashmap so I'm not sure how you are using the hashmap to find matching documents.
Assuming you want a full text search engine, I would look into using a trie for the data structure. You should be able to fit everything in it without it growing too large. Nodes that match a word you want to index would contain the ids of the documents containing that word.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Searching for a link in a website and displaying it PHP - php

The file() function will return an array. You should use file_get_contents() instead, as it returns a string. Then, use regular expressions to find specific text within a link.

Related

Create value with specific parts of a text file

Alternative to php preg_match to pull data from an external website?

php scanning content for specific keywords

Permanently write variables to a php file with php

Assistance with building an inverted-index

Categories

Resources