Most efficient way to compare/match two large arrays? - php

I am writing a very process-intensive function in PHP that needs to be as optimized as it can get for speed, as it can take up to 60 seconds to complete in extreme cases. This is my situation:
I am trying to match an array of people to an XML list of jobs. The array of people have keywords that I have already analyzed, delimited by spaces. The jobs are from a large XML file.
It's currently setup like this:
$matches = new array();
foreach($people as $person){
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
$count = substr_count($job->title, $keyword);
if($count > 0) $matches[$job->title] = $count;
}
}
}
I do the keywords loop a few times with different categories. It does what I need it to do, but it feels very sloppy and the process can take a very, very long time depending on the number of people/jobs.
Is there a more efficient, or faster, way of doing this?

$matches = new array();
foreach($people as $person){
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
$count = substr_count($job->title, $keyword);
if($count > 0) $matches[$job->title] = $count;
}
}
}
Truthfully, your method is a bit sloppy, but I assume that's because you have some specially formatted data that you have to work around? Although other than just being sloppy, I see a bit of lost data in the way you're processing things that I don't think was intentional.
I see that you're not just checking "is the keyword in the job title", but "how many times is the keyword in the job title" and then you're storing this. This means for the job title friendly friend of the friend company, the "keyword" friend shows up 3 times, and thus $matches["friendly friend of the friend company"] = 3. Since you're declaring $matches before you being your $people foreach loop, though, this means you keep over-writing this value any time a new person has that keyword. In other words, if the first person has the keyword "friend" then $matches["friendly friend of the friend company"] is set to 3. Then if the second person has the keyword "friendly", this value is over-written and $matches["friendly friend of the friend company"] now equals 1.
I think what you wanted to do was count how many people have a keyword which is contained in the job title. In this case, rather than counting how many times $keyword appears in $job->title, you should just see if it appears, and respond accordingly.
$matches = new array();
foreach($people as $person){
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
if(strpos($job->title, $keyword) !== FALSE) /* "If $keyword exists in $job->title" */
$matches[$job->title]++; /* Increment "number of people who match" */
}
}
}
Another possibility is that you wanted to know how many keywords a given person had which matched a given job title. In this case you'd want a separate array per person. This is done with a slight modification.
$matches = new array();
foreach($people as $person){
$matches[$person] = new array();
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
if(strpos($job->title, $keyword) !== FALSE) /* "If $keyword exists in $job->title" */
$matches[$person][$job->title]++; /* Increment "number of keywords which match" */
}
}
}
Or, alternatively, you could return to counting how many times a keyword matches now since per-person this is actually a meaningful value ("how well does the job match")
$matches = new array();
foreach($people as $person){
$matches[$person] = new array();
foreach($jobs as $job){
foreach($person['keywords'] as $keyword){
if($count = substr_count($job->title, $keyword)) /* if(0) = false */
$matches[$person][$job->title] += $count; /* Increase "number of keywords which match" by $count */
}
}
}
Essentially, before tackling the problem of making your loop for efficient, you need to figure out what it is your loop is really trying to accomplish. Figure this out and then your best bet for increasing the efficiency is to just decrease the number of iterations of the loop to a minimum and use as many built-in functions as possible since these are implemented in C (a non-interpreted and therefore quicker-running language).

You could use an index of the words in the job titles to make the lookup more efficient:
$jobsByWords = array();
foreach ($jobs as &$job) {
preg_match_all('/\w+/', strtolower($jobs->title), $words);
foreach ($words[0] as $word) {
if (!isset($jobsByWords[$word])) $jobsByWords[$word] = array();
$jobsByWords[$word][] = &$job;
}
}
Then you just iterate the people and check if the keywords are in the index:
foreach ($people as $person) {
foreach ($person['keywords'] as $keyword) {
$keyword = strtolower($keyword);
if (isset($jobsByWords[$keyword])) {
foreach ($jobsByWords[$keyword] as &$job) {
$matches[$job->title] = true;
}
}
}
}

Related

Finding a faster function then "gethostbyname"?

I run a script under xampp with a mysqlDB.
I check if a domainname has an ip.
The problem is, that I have to check over 100000 domain names from a MySQL_DB.
The function "gethostbyname" works great, but my solution is too slow.
while($row = mysqli_fetch_array($db_res)) { // get the DB domainnames entrys
if (empty($row['status'])) {
$items[] = $row['domainnames'];
}
foreach ($items AS $domain) {
if ( gethostbyname($domain) != $domain ) {
do somthing.....
}
}
}
How do I get it faster?
Your foreach() loop inside of your while() loop is simply a bad idea. Think about it.
As you iterate the result set, $items swells and swells -- this means that the foreach() will have to work longer and longer and longer.
Ultimately, if you need to process the gethostbyname() value for the next task in your script, you should be storing that value at the same time that you INSERT the entry into your table the first time -- perhaps the new column can be host.
The smart money is not to call gethostbyname() 100000 times; have the value ready when you SELECT it.
Beyond the above logic, I don't see the need to declare an array with a single element/string, then iterate it.
In fact, your query should contain a WHERE clause that excludes rows that have a null/0/blank status value AND includes rows that have a host (new column) value that matches $domain so that php doesn't have to bother any qualifying/disqualifying conditions.
foreach ($db_res as $row) { // yes, you can simply iterate the result object
// do whatever with the associative string elements (e.g. $row['domainnames'])
// ...you know this is a string and not an array, right? -^^^^^^^^^^^^^^^^^^^
}
thanks for the answers.
With your assistance i was able to reduce the procedure to:
while($row = mysqli_fetch_array($db_res))
{
$domain = $row['domainnames'];
if ( gethostbyname($domain) != $domain ) {
do somthing.....;
}
else{
do somthing.....;
}
}
it feels a little bit faster but not enough.
#mickmackusa i catch now only the empty "status" fields:
$db_res = mysqli_query ($db_link, "select domainnames FROM domaintable WHERE status = ''")
Looks like when your while loop iterates, it uses the $items from the last iteration - which will waste time - so please try this version (putting the foreach into the if:
while($row = mysqli_fetch_array($db_res)) { // get the DB domainnames entrys
if (empty($row['status'])) {
$items[] = $row['domainnames'];
foreach ($items AS $domain) {
if ( gethostbyname($domain) != $domain ) {
do somthing.....
}
}
}
}

PHP crawling data from website

I am currently trying to crawl alot of data from a website, however I am struggling a little bit with it. It has an a-z index and 1-20 index, so it has a bunch of loops and DOM stuff in there. However, it managed to crawl and save about 10.000 rows at first run, but now I am at around 15.000 and it is only crawling around 100 per run.
It is probably because it has to skip the rows that it already has inserted, (made a check for that). I cant think of a way to easily skip some pages, as the 1-20 index varies a lot (for one letter there are 18 pages, other letter are only 2 pages).
I was checking if there already was an record with the given ID, if not, insert it. I assumed that would be slow, so now before the script stars I retrieve all rows, and then check with an in_array(), assuming thats faster. But it just wont work.
So my crawler is navigating 26 letters, 20 pages each letter, and then up to 50 times each page, so if you calculate it, its a lot.
Thought of running it letter by letter, but that wont really work as I am still stuck at "a" and cant just hop onto "b" as I will miss records from "a".
Hope I have explained the problem good enough for someone to help me. My code kinda looks like this: (I have removed some stuff here and there, guess all the important stuff is in here to give you an idea)
function in_array_r($needle, $haystack, $strict = false) {
foreach ($haystack as $item) {
if (($strict ? $item === $needle : $item == $needle) || (is_array($item) && in_array_r($needle, $item, $strict))) {
return true;
}
}
return false;
}
/* CONNECT TO DB */
mysql_connect()......
$qry = mysql_query("SELECT uid FROM tableName");
$all = array();
while ($row = mysql_fetch_array($qru)) {
$all[] = $row;
} // Retrieving all the current database rows to compare later
foreach (range("a", "z") as $key) {
for ($i = 1; $i < 20; $i++) {
$dom = new DomDocument();
$dom->loadHTMLFile("http://www.crawleddomain.com/".$i."/".$key.".htm");
$finder = new DomXPath($dom);
$classname="table-striped";
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname ')]");
foreach ($nodes as $node) {
$rows = $finder->query("//a[contains(#href, '/value')]", $node);
foreach ($rows as $row) {
$url = $row->getAttribute("href");
$dom2 = new DomDocument();
$dom2->loadHTMLFile("http://www.crawleddomain.com".$url);
$finder2 = new DomXPath($dom2);
$classname2="table-striped";
$nodes2 = $finder2->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' $classname2 ')]");
foreach ($nodes2 as $node2) {
$rows2 = $finder2->query("//a[contains(#href, '/loremipsum')]", $node2);
foreach ($rows2 as $row2) {
$dom3 = new DomDocument();
//
// not so important variable declarations..
//
$dom3->loadHTMLFile("http://www.crawleddomain.com".$url);
$finder3 = new DomXPath($dom3);
//2 $finder3->query() right here
$query231 = mysql_query("SELECT id FROM tableName WHERE uid='$uid'");
$result = mysql_fetch_assoc($query231);
//Doing this to get category ID from another table, to insert with this row..
$id = $result['id'];
if (!in_array_r($uid, $all)) { // if not exist
mysql_query("INSERT INTO')"); // insert the whole bunch
}
}
}
}
}
}
}
$uid is not defined, also, this query makes no sense:
mysql_query("INSERT INTO')");
You should turn on error reporting:
ini_set('display_errors',1);
error_reporting(E_ALL);
After your queries you should do an or die(mysql_error());
Also, I might as well say it, if I don't someone else will. Don't use mysql_* functions. They're deprecated and will be removed from future versions of PHP. Try PDO.

Get Popular words in PHP+MySQL

How do I go about getting the most popular words from multiple content tables in PHP/MySQL.
For example, I have a table forum_post with forum post; this contains a subject and content.
Besides these I have multiple other tables with different fields which could also contain content to be analysed.
I would probably myself go fetch all the content, strip (possible) html explode the string on spaces. remove quotes and comma's etc. and just count the words which are not common by saving an array whilst running through all the words.
My main question is if someone knows of a method which might be easier or faster.
I couldn't seem to find any helpful answers about this it might be the wrong search patterns.
Somebody's already done it.
The magic you're looking for is a php function called str_word_count().
In my example code below, if you get a lot of extraneous words from this you'll need to write custom stripping to remove them. Additionally you'll want to strip all of the html tags from the words and other characters as well.
I use something similar to this for keyword generation (obviously that code is proprietary). In short we're taking provided text, we're checking the word frequency and if the words come up in order we're sorting them in an array based on priority. So the most frequent words will be first in the output. We're not counting words that only occur once.
<?php
$text = "your text.";
//Setup the array for storing word counts
$freqData = array();
foreach( str_word_count( $text, 1 ) as $words ){
// For each word found in the frequency table, increment its value by one
array_key_exists( $words, $freqData ) ? $freqData[ $words ]++ : $freqData[ $words ] = 1;
}
$list = '';
arsort($freqData);
foreach ($freqData as $word=>$count){
if ($count > 2){
$list .= "$word ";
}
}
if (empty($list)){
$list = "Not enough duplicate words for popularity contest.";
}
echo $list;
?>
I see you've accepted an answer, but I want to give you an alternative that might be more flexible in a sense: (Decide for yourself :-)) I've not tested the code, but I think you get the picture. $dbh is a PDO connection object. It's then up to you what you want to do with the resulting $words array.
<?php
$words = array();
$tableName = 'party'; //The name of the table
countWordsFromTable($words, $tableName)
$tableName = 'party2'; //The name of the table
countWordsFromTable($words, $tableName)
//Example output array:
/*
$words['word'][0] = 'happy'; //Happy from table party
$words['wordcount'][0] = 5;
$words['word'][1] = 'bulldog'; //Bulldog from table party2
$words['wordcount'][1] = 15;
$words['word'][2] = 'pokerface'; //Pokerface from table party2
$words['wordcount'][2] = 2;
*/
$maxValues = array_keys($words, max($words)); //Get all keys with indexes of max values of $words-array
$popularIndex = $maxValues[0]; //Get only one value...
$mostPopularWord = $words[$popularIndex];
function countWordsFromTable(&$words, $tableName) {
//Get all fields from specific table
$q = $dbh->prepare("DESCRIBE :tableName");
$q->execute(array(':tableName' = > $tableName));
$tableFields = $q->fetchAll(PDO::FETCH_COLUMN);
//Go through all fields and store count of words and their content in array $words
foreach($tableFields as $dbCol) {
$wordCountQuery = "SELECT :dbCol as word, LENGTH(:dbCol) - LENGTH(REPLACE(:dbCol, ' ', ''))+1 AS wordcount FROM :tableName"; //Get count and the content of words from every column in db
$q = $dbh->prepare($wordCountQuery);
$q->execute(array(':dbCol' = > $dbCol));
$wrds = $q->fetchAll(PDO::FETCH_ASSOC);
//Add result to array $words
foreach($wrds as $w) {
$words['word'][] = $w['word'];
$words['wordcount'][] = $w['wordcount'];
}
}
}
?>

Show all keys with phpcassa

I'm fairly new to cassandra but i have making good progress so far.
$conn = new ConnectionPool('Cluster');
$User = new ColumnFamily($conn, 'User');
$index_exp = CassandraUtil::create_index_expression('email', 'John#dsaads.com');
$index_clause = CassandraUtil::create_index_clause(array($index_exp));
$rows = $User->get_indexed_slices($index_clause);
foreach($rows as $key => $columns) {
echo $columns['name']."<br />";
}
Im using this type of query to get specific date from somebodys email adress.
However, i now want to do 2 things.
Count every user in the database and display the number
List every user in the database with $columns['name']." ".$columns['email']
In mysql i would just remove the 'where attribute' from the select query, however i think its a little bit more complicated here?
In Cassandra there's no easy way to count all of the rows. You basically have to scan everything. If this is something that you want to do often, you're doing it wrong. Example code:
$rows = $User->get_range("", "", 1000000);
$count = 0;
foreach($rows as $row) {
$count += 1;
}
The second answer is similar:
$rows = $User->get_range("", "", 1000000, null, array("name", "email"));
foreach($rows as $key => $columns) {
echo $columns["name"]." ".$columns["email"];
}
Tyler Hobbs give very nice example.
However if you have many users, you do not want to iterate on them all the time.
It is better to have this iteration once or twice per day and to store the data in cassandra or memcached / redis.
I also would do a CF with single row and put all usernames (or user keys) there on single row. However some considered this as odd practice and some people will not recommend it. Then you do:
$count = $cf->get_count($rowkey = 0);
note get_count() is slow operation too, so you still need to cache it.
If get_count() returns 100, you will need to upgrade your phpcassa to latest version.
About second part - if you have less 4000-5000 users, I would once again do something odd - put then on single row as supercolumns. Then read will be with just one operation:
$users = $scf->get($rowkey = 0, new ColumnSlice("", "", 5000));
foreach($users as $user){
echo $user["name"]." ".$user["email"];
}

PHP/MySQL: Highlight "SOUNDS LIKE" query results

Quick MYSQL/PHP question. I'm using a "not-so-strict" search query as a fallback if no results are found with a normal search query, to the tune of:
foreach($find_array as $word) {
clauses[] = "(firstname SOUNDS LIKE '$word%' OR lastname SOUNDS LIKE '$word%')";
}
if (!empty($clauses)) $filter='('.implode(' AND ', $clauses).')';
$query = "SELECT * FROM table WHERE $filter";
Now, I'm using PHP to highlight the results, like:
foreach ($find_array as $term_to_highlight){
foreach ($result as $key => $result_string){
$result[$key]=highlight_stuff($result_string, $term_to_highlight);
}
}
But this method falls on its ass when I don't know what to highlight. Is there any way to find out what the "sound-alike" match is when running that mysql query?
That is to say, if someone searches for "Joan" I want it to highlight "John" instead.
Note that SOUNDS LIKE does not work as you think it does. It is not equivalent to LIKE in MySQL, as it does not support the % wildcard.
This means your query will not find "John David" when searching for "John". This might be acceptable if this is just your fallback, but it is not ideal.
So here is a different suggestion (that might need improvement); first use PHPs soundex() function to find the soundex of the keyword you are looking for.
$soundex = soundex($word);
$soundexPrefix = substr($soundex, 0, 2); // first two characters of soundex
$sql = "SELECT lastname, firstname ".
"FROM table WHERE SOUNDEX(lastname) LIKE '$soundexPrefix%' ".
"OR SOUNDEX(firstname) LIKE '$soundexPrefix%'";
Now you'll have a list of firstnames and lastnames that has a vague similarity in sounding (this might be a lot entries, and you might want to increase the length of the soundex prefix you use for your search). You can then calculate the Levenshtein distance between the soundex of each word and your search term, and sort by that.
Second, you should look at parameterized queries in MySQL, to avoid SQL injection bugs.
The SOUND LIKE condition just compares the SOUNDEX key of both words, and you can use the PHP soundex() function to generate the same key.
So, if you found a matching row and needed to find out which word to highlight, you can fetch both the firstname and lastname, and then use PHP to find which one matches and highlight just that word.
I made this code just to try this out. (Had to test my theory xD)
<?php
// A space seperated string of keywords, presumably from a search box somewhere.
$search_string = 'John Doe';
// Create a data array to contain the keywords and their matches.
// Keywords are grouped by their soundex keys.
$data = array();
foreach(explode(' ', $search_string) as $_word) {
$data[soundex($_word)]['keywords'][] = $_word;
}
// Execute a query to find all rows matching the soundex keys for the words.
$soundex_list = "'". implode("','", array_keys($data)) ."'";
$sql = "SELECT id, firstname, lastname
FROM sounds_like
WHERE SOUNDEX(firstname) IN({$soundex_list})
OR SOUNDEX(lastname) IN({$soundex_list})";
$sql_result = $dbLink->query($sql);
// Add the matches to their respective soundex key in the data array.
// This checks which word matched, the first or last name, and tags
// that word as the match so it can be highlighted later.
if($sql_result) {
while($_row = $sql_result->fetch_assoc()) {
foreach($data as $_soundex => &$_elem) {
if(soundex($_row['firstname']) == $_soundex) {
$_row['matches'] = 'firstname';
$_elem['matches'][] = $_row;
}
else if(soundex($_row['lastname']) == $_soundex) {
$_row['matches'] = 'lastname';
$_elem['matches'][] = $_row;
}
}
}
}
// Print the results as a simple text list.
header('content-type: text/plain');
echo "-- Possible results --\n";
foreach($data as $_group) {
// Print the keywords for this group's soundex key.
$keyword_list = "'". implode("', '", $_group['keywords']) ."'";
echo "For keywords: {$keyword_list}\n";
// Print all the matches for this group, if any.
if(isset($_group['matches']) && count($_group['matches']) > 0) {
foreach($_group['matches'] as $_match) {
// Highlight the matching word by encapsulatin it in dashes.
if($_match['matches'] == 'firstname') {
$_match['firstname'] = "-{$_match['firstname']}-";
}
else {
$_match['lastname'] = "-{$_match['lastname']}-";
}
echo " #{$_match['id']}: {$_match['firstname']} {$_match['lastname']}\n";
}
}
else {
echo " No matches.\n";
}
}
?>
A more generalized function, to pull out the matching soundex word from a strings could look like:
<?php
/**
* Attempts to find the first word in the $heystack that is a soundex
* match for the $needle.
*/
function find_soundex_match($heystack, $needle) {
$words = explode(' ', $heystack);
$needle_soundex = soundex($needle);
foreach($words as $_word) {
if(soundex($_word) == $needle_soundex) {
return $_word;
}
}
return false;
}
?>
Which, if I am understanding it correctly, could be used in your previously posted code as:
foreach ($find_array as $term_to_highlight){
foreach ($result as $key => $result_string){
$match_to_highlight = find_soundex_match($result_string, $term_to_highlight);
$result[$key]=highlight_stuff($result_string, $match_to_highlight);
}
}
This wouldn't be as efficient tho, as the more targeted code in the first snippet.

Categories