Sphinx field-start and field-end extended2 search not working

Sphinx field-start and field-end extended2 search not working - php

I know 'not working' is never a good start when asking for help but I have been at this on and off for months and I've got virtually nowhere.
So far I have at least determined I CAN get the field-start/end operaters working but ONLY when I stick in a space character like:
#gametitle "^diablo$ "
Strangely that returns JUST the game Diablo, however:
#gametitle "^diablo$"
Returns ALL games with Diablo in the name. Now that's great, I apparently can rely on the fact this extra space character will apply proper matching of the game titles (it seems to work with "^age of empires$ " too).
However when it comes to my OTHER field, the one I actually want to do this full field matching on (#console), I get no such luck. I simply get NO results (if I try and do "^PlayStation$ "), or else I get all the results with playstation in the console field (i.e. the PS1/2/3 and portable) when I do "^PlayStation$".
Now the only difference between the #gametitle and #console fields is that the console field contains some NULL entries. I tried to get around this by selecing the string 'NULL' with an IF statement in MySQL (that's my source) but no joy. In addition, both the console and game title fields are VARCHAR(255) in MySQL.
I'm hoping someone will have some a-ha moment with what I've mentioned with regards to the extra space making this thing work, but I'm not holding my breath! Anyway enough of my pessimism, looking forward to your thoughts.
I am using the PHP API provided by sphinx which I'm extending to make minor changes. I am querying a searchd instance, which is Sphinx v1.10-beta. Here are the query logs:
[...] 0.024 sec [ext2/1/attr- 7 (0,50)] [application] #gametitle "^age of empires$"
[...] 0.024 sec [ext2/1/attr- 1 (0,50)] [application] #gametitle "^age of empires$ "
There you can really see how the addition of the space knocks the record count down from 7 to 1, when really you should expect them both to return 1...

I'm almost certain this is a bug in Sphinx.
I've added it to the Issue Tracker
http://sphinxsearch.com/bugs/view.php?id=909
but so far it hasn't been acknowledged

Related

Managing old elements by timestamp in Redis sorted set

You have a sorted set A in redis, every now and then you add new elements to it, they get sorted by rank e.g. You also have a sorted set B.
Is there a way to check if there are elements in set A that have been there for more then say 20 seconds, and move them to sorted set B
because this checking operation is done very frequently, and list can be very big, iterating through every element in set is a not a good solution. Need fastest one.
Thanks.
UPDATE:
Here is what I was trying to do:
Basically the idea was, imagine you have some kind of game server that matches opponents when they put a fight request. The current design was that every request get's to the set, and the rank/score is the player rank. so that way every 2 players that are near each other in the list are perfect matches. every 5 seconds or so a script get's called that pulls 50 rows from top of set, and matches them 2 by 2 (and removes them). This was working fine, and I think that was a very fast working solution. But then the idea of creating a Bot (AI) players came. so that when player is waiting too long in que, he get's matched with a bot (AI) player. And I cannot figure out a way to see "who is waiting too long" Basically maybe the entire idea was wrong.. so any better ideas are welcome :) Thanks a lot.

If the score in your sorted set is a unix timestamp, you can use zrange to grab the oldest NN items from set A. You can then do your checks, add qualifying entries to set B, then remove them from set A.
If your scoring in set A is not based on timestamp, then you will have to iterate over your set A entirely, or rethink your design. Redis keys do not have an inherent available timestamp of when they are added (which holds doubly true for items in a key such as a sorted set), so it has to be something you specifically create and track. Perhaps if you share more about what you are doing and why we can help with more detail.
Edit:
based on the additions to your questions, I would recommend trying a solution similar to what #akonsu is proposing.
Specifically:
Sorted-Set-A: players by rank just as they are now.
Sorted-Set-B:
uses timestamp as the time the person went into the queue, stores their userid. In other words, when you zadd to SetA with their rank & ID, you zadd to SetB with the timestamp and ID.
When you match players you remove them from both sets. If you pull your set of users to match from SetB using a zrange command to grab the X oldest entries, you will have the time they queued up, in order of their entry (like a FIFO). You then use a zrange command on SetA with their rank +/- whatever rank range you need. If you get a match you proceed with the match by removing them from both sets and moving on.
If there is no suitable opponent in SetA and their timestamp is old enough you match with an AI, then remove them from both sets and move on.
Essentially it is an index queue of users->timestamp. Doing it this way mean shorter queue times for all users as you are now matching them in order of queue length. You still use SetA for matching based on players' rank, but now you are indexing and prioritizing based on time.
The specific implementation may be a bit more interesting than this, but as an overall stratagem I think this fits what you need.

In sphinx how does the search result display if the index updates inbetween two setlimit calls

I have just started working on sphinx with php. Was just wondering is if i set limit to 20 records per call.
$cl->SetLimits ( 0, 20);
the index recreate is say set to 5 minutes with a --rotate option.
So if in my application i have to call the next 20 search results i call the command
$cl->SetLimits ( 20, 20);
Suppose the index is recreated in between the two setlimit calls. And say a new document is inserted with say the highest weight. (and i am sorting results by relevance.)
Wouldnt the search result shift by one position down so the earlier 20th record will now be the 21st record and so i again get the same result at the 21st position that i got in the 20th position & so my application will display a duplicate search result. Is this true..any body else got this problem.
Or how should I overcome this?
Thanks!
Edit (Note: The next setlimit command is called based on a user event say 'See more Results')

Yes, that can happen.
But usually happens so rarely that nobody notices.
About the only way to avoid it would be to store some sort of index with the query. So as well as a page number, you include a last id. Then when on the second page etc, use that id to exclude any new results created since the search started.
On the first page query, you lookup the biggest id in the index, need to run a second query for that.
(this at least copes with new additions to the index, but its harder to cope with changes to documents, but can be done in a similar way)

setLimit sets the offset on the result server side, http://php.net/manual/en/sphinxclient.setlimits.php.
So to answer your question, no, it will query the with max_matches and save a result set, from there you work with the result set and not the indexed data.
One question though, why are you indexing it every 5 minutes? It would be better just to re-index every time your data changes.

php array phone number fields stored as int

I had a MySQL table with some user data, which I needed to correct and migrate to a new MySQL table. I exported the table using "Export to PHP Array plugin for PHPMyAdmin" from "Geoffray Warnants" and it returned a (PHP) array.
One of the fields contains a telephone number. Now some of the entries have been exported as string. However, some of the entries have the telephone number represented as an integer. When I try to read the number, it returns something like:
4.36991052022E+12
when it should be:
4369910520219
I suppose the integer value is too big, so that must be the problem. (that's the reason for the E+12)
I have close to 300 entries and there is no way I can start writing quotes in front and end of the number manually, since I also have a fax field.
Most recently, I tried (with help of demo sublime text 2) to cast the number by writing (string) in front of it - it doesn't work.
I'm kind of helpless now and ask for your help. What can I do?

Please take a look at this question, which should answer yours:
Convert a big integer to a full string in PHP

Since I didn't have the time to get trough the "complicated" process of installing the GMP library, I decided to make it old-skool and just put double quotes ("") in front of every phone number value no matter if is was a string or a "big integer" and remove (single) quotes from the final string.
Thanks to Sublime Text 2!
So i had:
array(..., 'phone'=>' 43 664 1000383', ...);
and
array(..., 'phone'=>4369910520219, ...);
Search for and Find All 'phone'=> and add afterwards "
Then search for (in my case) ,'fax'=> and add beforehand "
The for every string
preg_replace("/\'/i", "", $user["phone"]);
Thanks though for the library. I might actually use it someday. ;)
Greetings,
Joseph

Twitter style trends with php/mysql

I am coding a social network and I need a way to list the most used trends, All statuses are stored in a content field, so what it is exactly that I need to do is match hashtag mentions such as: #trend1 #trend2 #anothertrend
And sort by them, Is there a way I can do this with MySQL? Or would I have to do this only with PHP?
Thanks in advance

The maths behind trends are somewhat complex; machine learning may be a bit over the top, but you probably need to work through some examples.
If you go with #deadtrunk's sample code, you would miss trends that have fired up in the last half hour; if you go with #eggyal's example, you miss trends that have been going strong all day, but calmed down in the last half hour.
The classic solution to this problem is to use a derivative function (http://en.wikipedia.org/wiki/Derivative); it's worth building a sample database and experimenting with this, and making your solution flexible enough to change this over time.
Whilst you want to build something simple, your users will be used to trends, and will assume it's broken if it doesn't work the way they expect.

You should probably extract the hash tags using PHP code, and then store them in your database separately from the content of the post. This way you'll be able to query them directly, rather then parsing the content every time you sort.

I think it is better to store tags in dedicated table and then perform queries on it.
So if you have a following table layout
trend | date
You'll be able to get trends using following query:
SELECT COUNT(*), trend FROM `trends` WHERE `date` = '2012-05-10' GROUP BY trend
18 test2
7 test3

Create a table that associates hashtags with statuses.
Select all status updates from some recent period - say, the last half hour - joined with the hashtag association table and group by hashtag.
The count in each group is an indication of "trend".

Tricky file parsing. Inconsistent Delimeters

I need to parse a file with the following format.
0000000 ...ISBN.. ..Author.. ..Title.. ..Edit.. ..Year.. ..Pub.. ..Comments.. NrtlExt Nrtl Next Navg NQoH UrtlExt Urtl Uext Uavg UQoH ABS NEB MBS FOL
ABE0001 0-679-73378-7 ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM 0.00 13.90 0.00 10.43 0 21.00 10.50 6.44 3.22 2 2.00 0.50 2.00 2.00 ABS
The ID and ISBN are not a problem, the title is. There is no set length for these fields, and there are no solid delimiters- the space can be used for most of the file.
Another issue is that there is not always an entry in the comments field. When there is, there are spaced within the content.
So I can get the first two, and the last fourteen. I need some help figuring out how to parse the middle six fields.
This file was generated by an older program that I cannot change. I am using php to parse this file.

I would also ask myself 'How good does this have to be' and 'How many records are there'?
If, for example, you are parsing this list to put up a catalog of books to sell on a website - you probably want to be as good as you can, but expect that you will miss some titles and build in feedback mechanism so your users can help you fix the issue ( and make it easy for you to fix it in your new format).
On the other hand, if you absolutely have to get it right because you will loose lots of money for each wrong parse, and there are only a few thousand books, you should plan on getting close, and then doing a human review of the entire file.
(In my first job, we spend six weeks on a data conversion project to convert 150 records - not a good use of time).

Find the title and publisher of the book by ISBN (in some on-line database) and parse only the rest :)
BTW. are you sure that what looks like space actually is a space? There are more "invisible" characters (like non-break space). I know, not a good idea, but apparently author of that format was pretty creative...

You need to analyze you data by hand and find out what year, edition and publisher look like. For example if you find that year is always two digits and publisher always comes from some limited list, this is something you can start with.

While I don't see any way other then guessing a bit I'd go about it something like this:
I'd scale off what I know I can parse out reliably. Leaving you with ABE WOMAN IN THE DUNES (INT'L ED) 1st 64 RANDOM
From there I'd try locate the Edition and split the string into two at that position after storing and removing the Edition leaving you with ABE WOMAN IN THE DUNES (INT'L ED) & 64 RANDOM, another option is to try with the year but of course Titles such as 1984 might present a problem . (Guessing edition is of course assuming it's 7th, 51st etc for all editions).
Finally I'd assume I could somewhat reliable guess the year 64 at the start of the second string and further limit the Publisher(/Comment) part.
The rest is pure guesswork unless you got a list of authors/publishers somewhere to match against as I'd assume there are not only comments with spaces but also publishers with spaces in their names. But at least you should be down to 2 strings containing Author/Title in one and Publisher(/Comments) in the other.
All in all it should limit the manual part a bit.
Once done I'd also save it in a better format somewhere so I don't have to go about parsing it again ;)

I don't know if the pcre engine allows multiple groups from within selection, therefore:
([A-Z0-1]{7})\ (\d-\d{3}-\d{5}-\d)\
(.+)\ (\d(?:st|nd|rd))\ \d{2}\
([^\d.]+)\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d{1})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\d+.\d{2})\ (\d)\ (\d+.\d{2})\
(\d+.\d{2})\ (\d+.\d{2})\ (\d+.\d{2})\
(\w{3})
It does look quite ugly and doesn't fix your author-title problem but it matches quite good for the rest of it.
Concerning your problem I don't see any solution but having a lookup table for authors or using other services to lookup title and author via the ISBN.
Thats if unlike in your example above the authors are not just represented by their first name.
Also double check all exception that might occur with the above regex as titles may contain 1st or alike.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.