Sphinx Search indexing some fields but not others

Sphinx Search indexing some fields but not others - php

I'm using generic Sphinx with Python (though I tested this against PHP as well and got the same problem). I have a table where I have several fields I want to be able to search in sphinx against but it seems like only some of the fields get indexed.
Here's my source (dbconfig just has the connection information):
source bill_src : dbconfig
{
sql_query = \
SELECT id,title,official_title,summary,state,chamber,UNIX_TIMESTAMP(last_action) AS bill_date FROM bill
sql_attr_timestamp = bill_date
sql_query_info = SELECT * FROM bill WHERE id=$id
}
Here's the index
index bills
{
source = bill_src
path = /var/data/bills
docinfo = extern
charset_type = sbcs
}
I'm trying to use extended match mode. It seems that title and summary are fine but the official_title, the state and the chamber fields are ignored in the index. So for example if I do:
#official_title Affordable Care Act
I get:
query error: no field 'official_title' found in schema
but the same query with #summary produces results. Any ideas what I'm missing?
EDIT
Here's the table I'm trying to index:
+--------------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| bt50_id | int(11) | YES | MUL | NULL | |
| type | varchar(10) | YES | | NULL | |
| title | varchar(255) | YES | | NULL | |
| official_title | text | YES | | NULL | |
| summary | text | YES | | NULL | |
| congresscritter_id | int(11) | NO | MUL | NULL | |
| last_action | datetime | YES | | NULL | |
| sunlight_id | varchar(45) | YES | | NULL | |
| number | int(11) | YES | | NULL | |
| state | char(2) | YES | | NULL | |
| chamber | varchar(45) | YES | | NULL | |
| session | varchar(45) | YES | | NULL | |
| featured | tinyint(1) | YES | | 0 | |
| source_url | varchar(255) | YES | | | |
+--------------------+--------------+------+-----+---------+----------------+

I seem to have fixed the problem, though I'll admit this is all dumb luck so it might not be a root cause:
First I thought maybe it didn't like the order of the fields in the query I have the only attribute field last so I decided to move it to after the ID:
SELECT id, UNIX_TIMESTAMP(last_action) AS bill_date, \
title,official_title,summary,state,chamber, FROM bill
This did not fix the problem.
Secondly, I noticed all the example date fields are converted using UNIX_TIMESTAMP and then aliased to the same name, so instead of UNIX_TIMESTAMP(last_action) AS bill_date I changed it to UNIX_TIMESTAMP(last_action) AS last_action ... the first attempt tripped me up though because it still wasn't working.
Finally I dropped the date altogether and added each field successfully (re-indexing and testing each time). Each time it worked and finally I added the date field on the end and I was able to sort by it and search all the fields. So the final query is:
SELECT \
id,title,official_title,summary,state,chamber, \
UNIX_TIMESTAMP(last_action) AS last_action FROM bill
It seems that attribute fields must come after the full text fields and aliases must be the same name as the actual field name. I find it strange that the date field seemed fine but other fields suddenly disappeared (randomly!).
I hope this helps someone else though I feel it might be some kind of isolated bug that doesn't affect many people. (This is on OSX and sphinx was compiled by hand)

Little rusty on sphinx, but believe in your source { } clause needs a sql_field_string definition.
source bill_src : dbconfig
{
sql_query = \
SELECT \
id,title,official_title,summary,state,chamber, \
UNIX_TIMESTAMP(last_action) AS bill_date \
FROM bill
sql_attr_timestamp = bill_date
sql_field_string = official_title
sql_query_info = SELECT * FROM bill WHERE id=$id
}
According to http://sphinxsearch.com/docs/1.10/conf-sql-field-string.html the sql_field_string declaration will index and store the string for referencing. That's different from a sql_attr_string, which is stored but not indexed.

Related

What is causing this memory leak when (inner) joining this table?

I have SQL that in my head, would and should run in under 1 second:
SELECT mem.`epid`,
mem.`model_id`,
em.`UKM_Make`,
em.`UKM_Model`,
em.`UKM_CCM`,
em.`UKM_Submodel`,
em.`Year`,
em.`UKM_StreetName`,
f.`fit_part_number`
FROM `table_one` AS mem
INNER JOIN `table_two` em ON mem.`epid` = em.`ePID`
INNER JOIN `table_three` f ON `mem`.`model_id` = f.`fit_model_id`
LIMIT 1;
When I run in the terminal this SQL executes in 16 seconds. However, if I remove the line:
INNER JOIN `table_three` f ON `mem`.`model_id` = f.`fit_model_id`
then it executes in 0.03 seconds. Unfortunately for me, I'm not to sure how to debug MYSQL performance issues. This causes my PHP script to run out of memory trying to execute the query.
Here are my table structures:
table_one
+----------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------+------+-----+---------+-------+
| epid | int(11) | YES | | NULL | |
| model_id | int(11) | YES | | NULL | |
+----------+---------+------+-----+---------+-------+
table_two
+----------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------------+--------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| ePID | int(11) | NO | | NULL | |
| UKM_Make | varchar(100) | NO | | NULL | |
| UKM_Model | varchar(100) | NO | | NULL | |
| UKM_CCM | int(11) | NO | | NULL | |
| UKM_Submodel | varchar(100) | NO | | NULL | |
| Year | int(11) | NO | | NULL | |
| UKM_StreetName | varchar(100) | NO | | NULL | |
| Vehicle Type | varchar(100) | NO | | NULL | |
+----------------+--------------+------+-----+---------+-------+
table_three
+-----------------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+-------------+------+-----+---------+----------------+
| fit_fitment_id | int(11) | NO | PRI | NULL | auto_increment |
| fit_part_number | varchar(50) | NO | | NULL | |
| fit_model_id | int(11) | YES | | NULL | |
| fit_year_start | varchar(4) | YES | | NULL | |
| fit_year_end | varchar(4) | YES | | NULL | |
+-----------------+-------------+------+-----+---------+----------------+
The above is output from describe $table_name
Is there anything that I'm obviously missing and if not, how can I try to find out why including table_three causes such a slow response time?
EDIT ONE:
After the indexing suggestion (used CREATE INDEX fit_model ON table_three (fit_model_id), it performs the query in 0.00 seconds (in MYSQL). Removing the limit, is still running from after doing the suggestion ... so not quite there. Anton's suggestion about using EXPLAIN I used it and got this output:
+------+-------------+-------+------+---------------+-----------+---------+----------------------+-------+-------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+---------------+-----------+---------+----------------------+-------+-------------------------------------------------+
| 1 | SIMPLE | mem | ALL | NULL | NULL | NULL | NULL | 5587 | Using where |
| 1 | SIMPLE | f | ref | fit_model | fit_model | 5 | mastern.mem.model_id | 14 | |
| 1 | SIMPLE | em | ALL | NULL | NULL | NULL | NULL | 36773 | Using where; Using join buffer (flat, BNL join) |
+------+-------------+-------+------+---------------+-----------+---------+----------------------+-------+-------------------------------------------------+
EDIT TWO
I've added a Foreign Key based on suggestions using the below query:
ALTER TABLE `table_one`
ADD CONSTRAINT `model_id_fk_tbl_three`
FOREIGN KEY (`model_id`)
REFERENCES `table_three` (`fit_model_id`)
MYSQL is still executing the command - there are a lot of rows, so half-expecting this behaviour. With PHP I can break up the query and build my array like that, so I guess that possibly solves the issue - thought is there anything more I can do to try and reduce execution time?

Based on everyone's comments etc. I managed to perform a few things that made my query run a hell of a lot quicker and not crash my script.
1) Indexes
I created an index on my table_three for the field fit_model_id:
CREATE INDEX fit_model ON `table_three` (`fit_model_id`);
This made my LIMIT 1 query go from 16 seconds to 0.03 seconds execution time (in MYSQL CLI).
However, 100 rows or so would still take a lot longer than I thought.
2) Foreign Keys
I created a foreign key that linked table_one.model_id = table_three.fit_model_id using the below query:
ALTER TABLE `table_one`
ADD CONSTRAINT `model_id_fk_tbl_three`
FOREIGN KEY (`model_id`)
REFERENCES `table_three` (`fit_model_id`)
This definitely helped, but still felt like more could be done.
3) OPTIMIZE TABLE
I then used OPTIMIZE TABLE on these tables:
table_one
table_three
This then made my script work and my query fast as ever. However, the issue I had was a large data set, so I let, the query run in MYSQL CLI whilst increasing the LIMIT by 1000 each script run time to help the indexing process, got all the way to 30K rows before it started crashing.
CLI took 31 minutes and 8 seconds to complete. So I did this:
31 x 60 = 1860
1860 + 8 = 1868
1868 / 448476 = 0.0042
So each row took 0.0042 seconds to complete - which is fast enough in my eyes.
Thanks to everyone for commenting and helping me debug and fix the issue :)

Based on comments correct answer is as follows:
In case of long execution of select statement add EXPLAIN statement before SELECT
Check whether possible_keys are empty in subqueries for specific tables.
Add FOREIGN KEYs for tables found in step 2. In case of vast table it's recommended to adjust MAX_EXECUTION_TIME variable (can be done for single query)
In case of massive insert/update/delete operations OPTIMIZE TABLE can adjust performance also.

Data migration into CiviCRM - keep legacy IDs

I develop custom migration code using CiviCRM's PHP API calls like:
<?php
$result = civicrm_api3('Contact', 'create', array(
'sequential' => 1,
'contact_type' => "Household",
'nick_name' => "boo",
'first_name' => "moo",
));
There's a need to keep original IDs, but specifying 'id' or 'contact_id' above does not work. It either does not create the contact or updates an existing one.
The ID is auto-incremented, for sure, but MySQL supports to insert arbitrary, unique values in that case.
How would you proceed? Hack CiviCRM to somehow pass the id to MySQL at the INSERT statement? Somehow dump the SQL after the import and manipulate the IDs in-place at the .sql textfile (hard to maintain integrity)? Any suggestions for that?
I have ~300.000 entries at least to deal with, so a fully automated and robust solution is a must. Any SQL magic potentially to do that?
For those who are not familiar with CiviCRM, the table structure is the following:
mysql> desc civicrm_contact;
+--------------------------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------------+------------------+------+-----+-------------------+-----------------------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| contact_type | varchar(64) | YES | MUL | NULL | |
| contact_sub_type | varchar(255) | YES | MUL | NULL | |
| do_not_email | tinyint(4) | YES | | 0 | |
| do_not_phone | tinyint(4) | YES | | 0 | |
| do_not_mail | tinyint(4) | YES | | 0 | |
| do_not_sms | tinyint(4) | YES | | 0 | |
| do_not_trade | tinyint(4) | YES | | 0 | |
| is_opt_out | tinyint(4) | NO | | 0 | |
| legal_identifier | varchar(32) | YES | | NULL | |
| external_identifier | varchar(64) | YES | UNI | NULL | |
and we talk about the first field.

You should use the external_identifier field which is exactly done for what you want.
This field is not used by CiviCRM itself so there is no risk to mess with core functionality. It's done to link with an external system (legacy for example).
CiviCRM consider the external_identifier to be unique so it will throw an error (using API - I think) or update (using CiviCRM contact import screen) if you try to insert a contact with the same external_identifier.

Store data in MySQL or a PHP file?

I am working on a project and I ended up with the table below:
+---------------+--------------+------+-----+--------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+--------------+------+-----+--------------------+-------+
| id | int(11) | NO | MUL | NULL | A_I |
| user _id | int(11) | NO | | NULL | |
| info | varchar(255) | NO | | NULL | |
| country | tinyint(3) | NO | | NULL | |
| date_added | timestamp | NO | | 0000-00-00 00:00:00| |
+---------------+--------------+------+-----+--------------------+-------+
Because I wanted to avoid storing countries as varchar all the time I thought I should use number IDs instead. My question is, would it be better to store the country IDs in a table where I would give a name to each one of them or do that in a php file? Countries won't change or anything. It will be a list of around 100 countries.
Thanks!

Use a seperate country table.
countries table
---------------
id
name
Then you can relate to the country ID in your table. That way you make sure only countries from your list are added and you don't need to store strings everywhere and you can easily change country names or addnew ones.

What methods are there for storing the order of items in a database?

I'm creating a portfolio website that has galleries that contain images. I want the user of this portfolio to be able to order the images within a gallery. The problem itself is fairly simple I'm just struggling with deciding on a solution to implement.
There are 2 solutions I've thought of so far:
Simply adding an order column (or priority?) and then querying with an ORDER BY clause on that column. The disadvantage of this being that to change the order of a single image I'd have to update every single image in the gallery.
The second method would be to add 2 nullable columns next and previous that simply store the ID of the next and previous image. This would then mean there would be less data to update when the order was changed; however, it would be much more complex to set up and I'm not entirely sure how I'd actually implement it.
Extra options would be great.
Are those options viable?
Are there better options?
How could / should they be implemented?
The current structure of the two tables in question is the following:
mysql> desc Gallery;
+--------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+-------------------+-----------------------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| title | varchar(255) | NO | | NULL | |
| subtitle | varchar(255) | NO | | NULL | |
| description | varchar(5000) | NO | | NULL | |
| date | datetime | NO | | NULL | |
| isActive | tinyint(1) | NO | | NULL | |
| lastModified | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+--------------+------------------+------+-----+-------------------+-----------------------------+
mysql> desc Image;
+--------------+------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+------------------+------+-----+-------------------+-----------------------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| galleryId | int(10) unsigned | NO | MUL | NULL | |
| description | varchar(250) | YES | | NULL | |
| path | varchar(250) | NO | | NULL | |
| lastModified | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+--------------+------------------+------+-----+-------------------+-----------------------------+
Currently there is no implementation of ordering in any form.

while 1 is a bit ugly you can do:
UPDATE table set order=order+1 where order>='orderValueOfItemYouCareAbout';
this will update all the rest of the images and you wont have to do a ton of leg work.

As bart2puck has said and I stated in the question, option 1 is a little bit ugly; it is however the option I have chosen to go with to simplify the solution all round.
I have added a column (displayOrder int UNSIGNED) to the Image table after path. When I want to re-order a row in the table I simply swap rows around. So, if I have 3 rows:
mysql> SELECT id, galleryId, description, displayOrder FROM Image ORDER BY displayOrder;
+-----+-----------+----------------------------------+--------------+
| id | galleryId | description | displayOrder |
+-----+-----------+----------------------------------+--------------+
| 271 | 20 | NULL | 1 |
| 270 | 20 | Tracks leading into the ocean... | 2 |
| 278 | 20 | NULL | 3 |
+-----+-----------+----------------------------------+--------------+
3 rows in set (0.00 sec)
If I want to re-order row 278 to appear second rather than third, I'll simply swap it with the second by doing the following:
UPDATE Image SET displayOrder =
CASE displayOrder
WHEN 2 THEN 3
WHEN 3 THEN 2
END
WHERE galleryId = 20
AND displayOrder BETWEEN 2 AND 3;
Resulting in:
mysql> SELECT id, galleryId, description, displayOrder FROM Image ORDER BY displayOrder;
+-----+-----------+----------------------------------+--------------+
| id | galleryId | description | displayOrder |
+-----+-----------+----------------------------------+--------------+
| 271 | 20 | NULL | 1 |
| 278 | 20 | NULL | 2 |
| 270 | 20 | Tracks leading into the ocean... | 3 |
+-----+-----------+----------------------------------+--------------+
3 rows in set (0.00 sec)
One possible issue that some people may find is that you can only alter the position by one place with this method, i.e. to move image 278 to appear first I'd have to make it second, then first, otherwise the current first image would appear third.

Complex sorting on MySQL database

I'm facing the following situation.
We've got an CMS with an entity with translations. These translations are stored in a different table with a one-to-many relationship. For example newsarticles and newsarticle_translations. The amount of available languages is dynamically determined by the same CMS.
When entering a new newsarticle the editor is required to enter at least one translation, which one of the available languages he chooses is up to him.
In the newsarticle overview in our CMS we would like to show a column with the (translated) article title, but since none of the languages are mandatory (one of them is mandatory but i don't know which one) i don't really know how to construct my mysql query to select a title for each newsarticle, regardless of the entered language.
And to make it all a little harder, our manager asked for the possibilty to also be able to sort on title, so fetching the translations in a separate query is ruled out as far as i know.
Anyone has an idea on how to solve this in the most efficient way?
Here are my table schema's it it might help
> desc news;
+-----------------+----------------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+----------------+------+-----+-------------------+----------------+
| id | int(10) | NO | PRI | NULL | auto_increment |
| category_id | int(1) | YES | | NULL | |
| created | timestamp | NO | | CURRENT_TIMESTAMP | |
| user_id | int(10) | YES | | NULL | |
+-----------------+----------------+------+-----+-------------------+----------------+
> desc news_translations;
+-----------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| enabled | tinyint(1) | NO | | 0 | |
| news_id | int(1) unsigned | NO | | NULL | |
| title | varchar(255) | NO | | | |
| summary | text | YES | | NULL | |
| body | text | NO | | NULL | |
| language | varchar(2) | NO | | NULL | |
+-----------------+------------------+------+-----+---------+----------------+
PS: i've though about subqueries and coalesce() solutions but those seem rather dirty tricks, wondering if something better is know that i'm not thinking of?

This is not a fast approach, but I think it gives you what you want.
Let me know how it works, and we can work on speed next :)
select nt.title
from news n
join news_translations nt on(n.id = nt.news_id)
where nt.title is not null
and nt.language = (
select max(x.language)
from news_translations x
where x.title is not null
and x.new_id = nt.news_id)
order
by nt.title;

Assuming I've read your problem aright, you want to get a list of titles for articles, preferring the "required" language? A query for that might go along the lines of ...
SELECT * FROM (
SELECT nt.`title`, nt.news_id
FROM news n
INNER JOIN news_translations nt ON (n.id = nt.news_id)
WHERE title != ''
ORDER BY
CASE
WHEN nt.language = 'en' THEN 3
WHEN nt.language = 'jp' THEN 2
WHEN nt.language = 'de' THEN 1
ELSE 0 END DESC
) AS t1
GROUP BY `news_id`
This example prefers a title in English (en) if available, Japanese (jp) as a second preference, and German (de) as a third, but will display the first 'other' entry if none of the requested languages are available.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Sphinx Search indexing some fields but not others - php

Related

What is causing this memory leak when (inner) joining this table?

Data migration into CiviCRM - keep legacy IDs

Store data in MySQL or a PHP file?

What methods are there for storing the order of items in a database?

Complex sorting on MySQL database

Categories

Resources