I’ve been playing around with searching text in big lists and found that using a PHP array seems to be a quick way of doing it.
E.g. if you had loads of place names and associated postcodes you could read them into a PHP array like this:
$place[‘place name here’] = “postcode”;
Then to look up you just take the place you want to look up and plug it in to the array:
$postcode_sought = $place[‘place I want to look up’];
I thought I could speed this up using C++ but of course C++ does not allow (as far as I know) arrays with a string as the index.
The only way I can think to do it is to create vectors for the place and postcode and loop through the place vector looking for a match but the repeated string comparisons take forever as I'd expected. I also experimented with hashing the text but I still couldn’t get it anywhere near as fast as PHP.
I think PHP is written in C so my question is how does C manage to create this string index name functionality for PHP?
I’m not looking for the actual code or anything, it just seems to me that there must be some fundamental technique that is used for this and I was just wondering if there is anyone out there who could briefly explain it.
Thanks in advance.
C
I thought I could speed this up using C++ but of course C++ does not allow (as far as I know) arrays with a string as the index.
It does, You can use std::map as an Associative array.
You could try using Berkeley DB. Back in the days it was the fastest but by default it's disk oriented. I don't know if you can run it in memory but you can always mount the directory from tmpfs.
PHP propably uses some external class for hashing table. You can get quite far by writing a quicksearch algorithm. Sort the keys and check up the key in the middle. Then again in middle until you've found the key. You can also use MD5() for keys as it's faster than pure string comparison.
C and C++ only allow integer types to be array indexes, and strings aren't even a type on C/C++, they're actually an array of chars.
As stated above, use std::map or similar.
Related
What I am trying to implement is a rather trivial "take search results (as in title & short description), cluster them into meaningful named groups" program in PHP.
After hours of googling and countless searches on SO (yielding interesting results as always, albeit nothing really useful) I'm still unable to find any PHP library that would help me handle clustering.
Is there such a PHP library out there that I might have missed?
If not, is there any FOSS that handles clustering and has a decent API?
Like this:
Use a list of stopwords, get all words or phrases not in the stopwords, count occurances of each, sort in descending order.
The stopwords needs to be a list of all common English terms. It should also include punctuation, and you will need to preg_replace all the punctuation to be a separate word first, e.g. "Something, like this." -> "Something , like this ." OR, you can just remove all punctuation.
$content=preg_replace('/[^a-z\s]/', '', $content); // remove punctuation
$stopwords='the|and|is|your|me|for|where|etc...';
$stopwords=explode('|',$stopwords);
$stopwords=array_flip($stopwords);
$result=array(); $temp=array();
foreach ($content as $s)
if (isset($stopwords[$s]) OR strlen($s)<3)
{
if (sizeof($temp)>0)
{
$result[]=implode(' ',$temp);
$temp=array();
}
} else $temp[]=$s;
if (sizeof($temp)>0) $result[]=implode(' ',$temp);
$phrases=array_count_values($result);
arsort($phrases);
Now you have an associative array in order of the frequency of terms that occur in your input data.
How you want to do the matches depends upon you, and it depends largely on the length of the strings in the input data.
I would see if any of the top 3 array keys match any of the top 3 from any other in the data. These are then your groups.
Let me know if you have any trouble with this.
"... cluster them into meaningful groups" is a bit to vague, you'll need to be more specific.
For starters you could look into K-Means clustering.
Have a look at this page and website:
PHP/irInformation Retrieval and other interesting topics
EDIT: You could try some data mining yourself by cross referencing search results with something like the open directory dmoz RDF data dump and then enumerate the matching categories.
EDIT2: And here is a dmoz/category question that also mentions "Faceted Search"!
Dmoz/Monster algorithme to calculate count of each category and sub category?
If you're doing this for English only, you could use WordNet: http://wordnet.princeton.edu/. It's a lexicon widely used in research which provides, among other things, sets of synonyms for English words. The shortest distance between two words could then serve as a similarity metric to do clustering yourself as zaf proposed.
Apparently there is a PHP interface to WordNet here: http://www.foxsurfer.com/wordnet/. It came up in this question: How to use word Net with php, but I have not tried it. However, interfacing with a command line tool from PHP yourself is feasible as well.
You could also have a look at Programming Collective Intelligence (Chapter 3 : Discovering Groups) by Toby Segaran which goes through just this use case using Python. However, you should be able to implement things in PHP once you understand how it works.
Even though it is not PHP, the Carrot2 project offers several clustering engines and can be integrated with Solr.
This may be way off but check out OpenCalais. They have a web service which allows you to pass a block of text in and it will pass you back a parseable response of things that it found in the text, such as places, people, facts etc. You could use these categories to build your "clouds" and too choose which results to display.
I've used this library a few times in php and it's always been quite easy to work with.
Again, might not be relevant to what your trying to do. Maybe you could post an example of what your trying to accomplish?
If you can pre-define the filters for your faceted search (the named groups) then it will be much easier.
Rather than relying on an algorithm that uses the current searcher's input and their particular results to generate the filter list, you would use an aggregate of the most commonly performed searches by all users and then tag results with them if they match.
You would end up with a table (or something) of URLs in a many-to-many join to a table of tags, so each result url could have several appropriate tags.
When the user searches, you simply match their search against the full index. But for the filters, you take the top results from among the current resultset.
I'll work on query examples if you want.
Which is the faster:
a regexp to search the contents of a large file for specific pattern, or
an array_search to search a large array to match value at any index.
other things being equal, I would expect the array search to always be faster, not having to read a file and not having to parse and execute a regex.
It will depend on the type of data you have, the type of data you are searching for, and the amount. You really need to try it and find which works for you, there isn't really a right answer any of us can give you without knowing the context and specific implementations.
Take a look at the benchmark class if you want to see some metrics and figure it out.
I can't find this anywhere. I have some old basic programs I am working on (thanks to qb64 that came out, now they work on winxp - win7)
in order to serialize (like php) I need to know how this process works so that i can convert BASIC do it. it does not have to be fancy, but I would like to get an understanding how it works.
I like the way php does it, although since BASIC can not do 'associative" arrays, i would think it is much easier.
so in simple terms, is there a source for serialize/unserialize ?
looks like you'd serialize it with simple string concatenation. Use something like "||" as your seperator. Since there are no associative arrays, you don't have to worry about names, just value.
Then you'd use instr() and left$() or mid$() to split them back out.
For multidimensional arrays, it would be considerably more complex and I haven't given it the time to figure out exactly how I'd do it, but I thought about using seperatators like ||0|0|| for array(0,0) and ||0|1|| for array(0,1) or even ||0|1|1|| for array (0,1,1)
People search in my website and some of these searches are these ones:
tapoktrpasawe
qweasd qwa as
aıe qwo ıak kqw
qwe qwe qwe a
My question is there any way to detect strings that similar to ones above ?
I suppose it is impossible to detect 100% of them, but any solution will be welcomed :)
edit: I mean the "gibberish searches". For example some people search strings like "asdqweasdqw", "paykaprkg", "iwepr wepr ow" in my search engine, and I want to detect jibberish searches.
It doesn't matter if search result will be 0 or anything else. I can't use this logic.
Some new brands or products will be ignored if I will consider "regular words".
Thank you for your help
You could build a model of character to character transitions from a bunch of text in English. So for example, you find out how common it is for there to be a 'h' after a 't' (pretty common). In English, you expect that after a 'q', you'll get a 'u'. If you get a 'q' followed by something other than a 'u', this will happen with very low probability, and hence it should be pretty alarming. Normalize the counts in your tables so that you have a probability. Then for a query, walk through the matrix and compute the product of the transitions you take. Then normalize by the length of the query. When the number is low, you likely have a gibberish query (or something in a different language).
If you have a bunch of query logs, you might first make a model of general English text, and then heavily weight your own queries in that model training phase.
For background, read about Markov Chains.
Edit, I implemented this here in Python:
https://github.com/rrenaud/Gibberish-Detector
and buggedcom rewrote it in PHP:
https://github.com/buggedcom/Gibberish-Detector-PHP
my name is rob and i like to hack True
is this thing working? True
i hope so True
t2 chhsdfitoixcv False
ytjkacvzw False
yutthasxcvqer False
seems okay True
yay! True
You could do what Stackoverflow does and calculate the entropy of the string.
Of course, this is just one of many heuristics SO uses to determine low-quality answers, and should not be relied upon as 100% accurate.
Assuming you mean jibberish searches... It would be more trouble than it's worth. You are providing them with a search functionality, let them use it however they please. I'm sure there are some algorithms out there that detect strange character groupings, but it would probably be more resource/labour intensive than just simply returning no results.
I had to solve a closely related problem for a source code mining project, and although the package is written in Python and not PHP, it seemed worth mentioning here in case it can still be useful somehow. The package is Nostril (for "Nonsense String Evaluator") and it is aimed at determining whether strings extracted during source-code mining are likely to be class/function/variable/etc. identifiers or random gibberish. It works well on real text too, not just program identifiers. Nostril uses n-grams (similar to the Gibberish Detector in the answer by Rob Neuhaus) in combination with a custom TF-IDF scoring function. It comes pretrained, and is ready to use out of the box.
Example: the following code,
from nostril import nonsense
real_test = ['bunchofwords', 'getint', 'xywinlist', 'ioFlXFndrInfo',
'DMEcalPreshowerDigis', 'httpredaksikatakamiwordpresscom']
junk_test = ['faiwtlwexu', 'asfgtqwafazfyiur', 'zxcvbnmlkjhgfdsaqwerty']
for s in real_test + junk_test:
print('{}: {}'.format(s, 'nonsense' if nonsense(s) else 'real'))
will produce the following output:
bunchofwords: real
getint: real
xywinlist: real
ioFlXFndrInfo: real
DMEcalPreshowerDigis: real
httpredaksikatakamiwordpresscom: real
faiwtlwexu: nonsense
asfgtqwafazfyiur: nonsense
zxcvbnmlkjhgfdsaqwerty: nonsense
The project is on GitHub and I welcome contributions.
I'd think you could detect these strings the same way you could detect "regular words." It's just pattern matching, no?
As to why users are searching for these strings, that's the bigger question. You may be able to stem off the gibberish searches some other way. For example, if it's comment spam phrases that people (or a script) is looking for, then install a CAPTCHA.
Edit: Another end-run around interpreting the input is to throttle it slightly. Allow a search every 10 seconds or so. (I recall seeing this on forum software, as well as various places on SO.) This will take some of the fun out of searching for sdfpjheroptuhdfj over and over again, and at the same time won't interfere with the users who are searching for, and finding, their stuff.
As some people commented, there are no hits in google for tapoktrpasawe or putjbtghguhjjjanika (Well, there are now, of course) so if you have a way to do a quick google search through an API, you could throw out any search terms that got no Google results and weren't the names of one of your products. Why you would want to do this is a whole other question - are you trying to save effort for your search library? Make your hand-review of "popular search terms" more meaningful? Or are you just frustrated at the inexplicable behaviour of some of the people out on the big wide internet? If it's the latter, my advice is just let it go, even if there is a way to prevent it. Some other weirdness will come along.
Short answer - Jibberish Search
Probabilistic Language Model works.
Logic
word is made up of sequence of characters, and if 2 characters come together more frequently and if we sum up all frequency of 2 contiguous characters coming together in word, and sum cross threshold limit (being an english word), it is said to proper english word. In brief, this logic is famous by Markov chains.
Link
For Mathematics of Gibberish and better understanding, refer to video https://www.youtube.com/watch?v=l15C8UJu17s . Thanks !!
If the search is performed on products, you could cache their names or codes and check them against that list before quering database. Else, if your site is for english users, you can build a dictionary of strings that aren't used in the english language, like qwkfagsd. Which, and agreeing with other answer, will be more resource intensive than if not there.
I am new to php and am asking for some coding help. I have little experience with php and have gone to the php.net site and read couple books to get some ideas on how to perform this task.
There seems to be many functions and I am confused on what would be the best fit. (i.e. fgetcsv, explode(), regex??) for extracting data in the file. THen I would need assistance printing/display this information in orderly fashion.
Here is what I need to do:
import, readin txt file that is
delimited (see sample)
The attributes are not always ordered and some records will have missing attributes.
Dynamically create a web table (html)
to present this data
Sample records:
attribute1=value;attribute2=value;attribute3=value;attribute4=value;
attribute1=value;attribute2=value;attribute4=value;
attribute1=value;attribute2=value;attribute3=value;
How do I go about this? What would be best practice for this? From my research it seems I would create an array? multidimensional? Thank you for your time and insight and i hope my question is clear.
Seems like homework, if so best to tag it as such.
You will want to look into file(), foreach() and explode() given that it is delimited by ;
The number of attributes should not matter if they are missing, but all depends on how you setup the display data. Given that they are missing though, you will need know what is the largest amount of attributes to setup the table correctly and not cause issues.
Best of luck!
i would first use the file() method, which will give you an array with each line as an element. Then a couple of explodes and loops to get through it all,first exploding on ';', then loop through each of these and explode on '='.