I have database table with fields:
pagerank
sites_in_google_index
conversion_rate
sites_in_google_index
adsense_revenue
yahoo_backlins
google_baclinks
and about 20 more parameteres - all are integers collected every day
Now I need to add user posibility to define own keys and virtualize this data on graphs, for example user can define sth like this:
NEW_KEY_1 = pagerank*google_backlinks-(conversion_rate+site_in_google_index)
and
NEW_KEY_2 = adsense_revenue*NEW_KEY_1
An now he should show this two data keys on graph for selected time period, ex: 01-12-2009 to 02-03-2011
What is the best strategy to store and evaluate such data/keys?
The easiest way would be to allow users to define the SQL selector and use in the query to your database. However, do not do this never ever (!), since it is one of the biggest security wholes you can dig!
The general idea should be to provide some kind of "custom query language" to your users. You then need a parser to validate its expressions and create an abstract syntax tree out of it. This syntax tree can then be transformed to a corresponding SQL query part.
In your specific case, you will basically end up parsing arithmetic expressions, in order to validate that the user provided field only contains such expressions and that they are valid, and serialize the corresponding tree back to SQL arithmetic expressions. Other benefits here are that you can ensure only valid field names are used, that you can limit the complexity of expressions and that you can add custom operators at a later stage.
Related
To secure a HTML GET method form submission so that users cannot use MySQL wildcards to get the whole table values, I was looking for a list of PHP-MySQL wildcard characters.
For example, this GET method URL takes lowrange & highrange values as 1 and 100 respectively, and generates the appropriate results between that range: example.com/form.php?lowrange=1&highrange=100
But my table values may range from -10000 to +10000, & a smart alec may like to get the whole list by changing the URL as example.com/form.php?lowrange=%&highrange=% (or other special characters like *, ?, etc. etc.)
The basic purpose is to not allow anything that can lead to whole db values getting exposed in one shot.
So far, I've found the following characters to be avoided as in the preg_match:
if(preg_match('/^~`!##$%\^&\*\(\)\?\[\]\_]+$/',$url)) {
echo "OK";
}
else {
echo "NOT OK";
}
Any other characters to be included in the list to completely block the possibility of wildcard based querying?
There are string fields & numbers fields. String field have LIKE matching (where field1 like '%GET-FORM-VALUE%'), & nos. fields have equal to and BETWEEN matching (where field2 = $GET-FORM-VALUE, OR where field3 between $GET-FORM-LOVALUE and $GET-FORM-HIVALUE) $in SQL.
Thank you.
No doubt that Prepared Statements are the best implementation, & MUST be the norm.
But sometimes, one gets into a "tricky scenario" where it may not be possible to implement it. For example, while working on a client project as external vendor, I was required to do similar implementation, but without having access to the code that made the connection (like, execute_query was not possible to implement, as connection to db was differently set in another config file). So I was forced to implement the "sanitization" of incoming form values.
To that, the only way was to check what data type & values were expected, & what wild characters can be used to exploit the submission.
If that is the case with you, then the alternate solution for your situation (String LIKE matching) & (numbers EQUAL TO or BETWEEN 2 given numbers) is as follows:
As soon as form is submitted, at backend first thing to do is:
Put a check for alphabets on String, BLOCK percentage sign & underscore.
if (preg_match('/[^A-Za-z]+/', $str) && !(preg_match('/%/',$strfield)))
{
// all good...proceed to execute the query
}
else
{
// error message
}
Similarly, put a check for numbers/floats on number fields, like if (preg_match('/[^0-9]+/', $nofield))
Only if above are satisfied, then proceed to connect to database, and run the query. Add more checks on field to prevent other wild-cards, as needed.
Another option I implemented (may not necessarily fit, but mentioning as food for thought): In addition to above checks, first generate a count of records that fit the query. If count is abnormally high, then either throw error asking user to narrow the range by resubmitting, or display a limited records per page making it cumbersome for them to keep clicking.
Again to reiterate, go for Prepared Statements if you can.
In my case, I have a table which stores a collection of records with similar information but each with unique type column, used in various parts of my application. I know, I know this is "micro-optimisation" but it it an integral part of my application (it will store many records), and I would like it to be optimised, and I am also simply curious, is is faster to use text type and select it like
SELECT ... WHERE type = 'some_type'
or use a PHP defined constant like
const SOMETYPE = 1;
run_query('SELECT ... WHERE type = '.SOMETYPE);
?
String comparison will always be slower than integer comparison. Typically, what is done is the string are stored in a separate table, perhaps called standard_types or whatever makes sense for the "constants" being stored. That table then has a unique id field that can be referenced by other tables.
This way if you need the strings for reporting, the reporting queries can join to the "types" table for the display strings. Ideally, in my opinion, the id values should reflect a standard numbers that can be expressed as enum values or constants in client code; this minimizes the dependence on the "types" table for non-reporting queries.
Some might argue against keeping the list of standard id values and their meanings coordinated across the database and one or more application codebases; but the alternative is coordinating standard strings across all that (domains that can handle those string values quite differently).
The dominant time spent in any query is in fetching rows to work with. Functions, string vs int, etc, make only minor differences in performance. Focus on what is clean for your code, and what minimizes the number of rows touched.
Once you have done that, minimize the number of round trips to the server. Even so, I have created many web pages that do 20-50 queries (each well optimized); the page performance is adequate.
You could also consider the ENUM data type.
sex ENUM('unk', 'male', 'female') NOT NULL
gives you WHERE sex = 'male', implemented as a 1-byte integer under the covers.
I wonder if it is possible to query a specific part of a comma separated string, something like the following:
$results = mysql_query("SELECT * FROM table1 WHERE $pid=table1.recordA[2] ",$con);
$pid is a number
and recordA contains data like
34,9008,606,,416,2
where i want to check the third part (606)
Thank you in advance
Having comma seperated lists or any data seperation within a mySQL field is frowned upon and is to all extents bad practice.
Rather than looking at querying an element of a delimetered list within a mySQL field consider breaking the field into its own table and then creating an adjacency list to create a 1:many relationship between table1 and it's associated variables.
If you are commited to this route, the simplest method would be to use PHP to manage it as mySQL has very few tools (above and beyond regex / text searches) to drill down to the data you want to extract. $results = explode(',',$query); would create an array of your variables from the returned field allowing you to run as many conditional checks against it as needed.
However, consider adding this to your 'need to re-write / re-think' list. A relational tables structure would allow you to query the database for $pid's value directly as it would be contained within it's own field and linked
If the delimetered variable list is of an inderterminate length or the relationships between the variables are heirarchical you'd be better off searching stackoverflow for information on Directed Acyclic Graphs in mySQL to find a better solution to the problem.
Without knowing the nature or the intended purpose for this script I can't answer in any more detail. I hope this has helped a little.
How about this:
SELECT * FROM table1 WHERE FIND_IN_SET({$pid}, recordA) = 3
Make sure to index recordA. I love normalization as much as the next guy, but sometimes breaking it up is just more trouble than it's worth ;)
Background
I am creating a MySQL database to store items such as courses where there may be many attributes to a single course. For example:
A single course may have any or all of the following attributes:
Title (varchar)
Secondary Title (varchar)
Description (text)
Date
Time
Specific Location (varchar; eg. White Hall Room 7)
General Location (varchar; eg. Las Vegas, NV)
Location Coords (floats; eg. lat, long)
etc.
The database is set up as follows:
A table storing specific course info:
courses table:
Course_ID (a Primary Key unique ID for each course)
Creator_ID (a unique ID for the creator)
Creation_Date (datetime of course creation)
Modified_Date (where this is the most recent timestamp the course was modified)
The table storing each courses multiple attributes is set up as follows:
course_attributes table:
Attribute_ID (a unique ID for each attribute)
Course_ID (reference to the specific course attribute is for)
Attribute (varchar definining the attribute; eg. 'title')
Value (text containing value of specified attribute; eg. 'Title Of My Course')
Desire
I would like to search this database using sphinx search. With this search, I have different fields weighing different amounts, for example: 'title' would be more important than 'description'.
Specific search fields that I wish to have are:
Title
Date
Location (string)
Location (geo - lat/long)
The Question
Should I define a View in Mysql to organize the attributes according to 'title', 'description', etc., or is there a way to define my sphinx.conf file to understand specific attributes?
I am open to all suggestions to solving this problem, whether it be rearrangement of the database/tables or the way in which I search.
Let me know if you need any additional details to help me find a solution.
Thanks in advance for the help
!--Update--!
OK, so after reading some of the answers, I feel that I should provide some additional information.
Latitude / Longitude
The latitude/longitude attributes are created by me internally after receiving the general location string. I can generate the values in any way I wish, meaning that I can store them together in a single lat/long attribute as 'float lat, float long' values or any other desired format. This is done only after they have been generated from the initial location string and verified. This is to guard against malformed data as #X-Zero and #Cody have suggested.
Keep in mind that the latitude and longitude was merely illustrating the need to have that field be searchable as opposed to anything more than that. It is simply another attribute; one of many.
Weighting Search Results
I know how to add weights to results in a Sphinx search query:
$cl->setFieldWeights( array('title'=>1000, 'description'=>500) );
This causes the title column to have a higher weight than the description column if the structure was as #X-Zero suggested. My question was more directed to how one would apply the above logic with the current table definition.
Database Structure, Views, and Efficiency
Using my introductory knowledge of Views, I was thinking that I could possibly create something that displays a row for each course where each attribute is its own column. I don't know how to accomplish this or if it's even possible.
I am not the most confident with database structures, but the reason I set my tables up as described was because there are many cases where not all of the fields will be completed for every course and I was attempting to be efficient [yes, it seems as though I've failed].
I was thinking that using my current structure, each attribute would contain a value and would therefore cause no wasted space in the table. Alternatively, if I had a table with tons of potential attributes, I would think there would be wasted space. If I am incorrect, I am happy to learn why my understanding is wrong.
Let me preface this by saying that I've never even heard of Sphinx, nor (obviously) used it. However, from a database perspective...
Doing multi-domain columns like this is a terrible (I will hunt you down and kill you) idea. For one thing, it's impossible to index or sort meaningfully, period. You also have to pray that you don't get a latitude attribute with textual data (and because this can only be reinforced programatically, I'm going to garuantee this will happen) - doing so will cause all distance based formulas to crash. And speaking of location, what happens if somebody stores a latitude without a longitude (note that this is possible regardless of whether you are storing a single GeoLocation attribute, or the pair)?
Your best bet is to do the following:
Figure out which attributes will always be required. These belong in the course table (...mostly).
For each related set of optional attributes, create a table. For example, location (although this should probably be required...), which would contain Latitude/Longitude, City, State, Address, Room, etc. Allow the columns to be nullable (in sets - add constraints so users can't add just longitude and not latitude).
For every set of common queries add a view. Even (perhaps especially) if you persist in using your current design, use a view. This promotes seperation between the logical and physical implementations of the database. (This assumes searching by SQL) You will then be able to search by specifying view_column is null or view_column = input_parameter or whichever.
For weighted searching (assuming dynamic weighting) your query will need to use left joins (inside the view as well - please document this), and use prepared-statement host-parameters (just save yourself the trouble of trying to escape things yourself). Check each set of parameters (both lat and long, for example), and assign the input weighting to a new column (per attribute), which can be summed up into a 'total' column (which must be over some threshold).
EDIT:
Using views:
For your structure, what you would normally do is left join to the attributes table multiple times (one for each attribute needed), keying off of the attribute (which should really be an int FK to a table; you don't want both 'title' and 'Title' in there) and joining on course_id - the value would be included as part of the select. Using this technique, it would be simple to then get the list of columns, which you can then apparently weight in Sphinx.
The problem with this is if you need to do any data conversion - you are betting that you'll be able to find all conversions if the type ever changes. When using strongly typed columns, this is between trivial (the likelyhood is that you end up with a uniquely named column) to unnecessary (views usually take their datatype definitions from the fields in the query); with your architecture, you'll likely end up looking through too many false positives.
Database efficiency:
You're right, unfilled columns are wasted space. Usually, when something is optional(ish), that means you may need an additional table. Which is Why I suggested splitting off location into it's own table: this prevents events which don't need a location (... what?) from 'wasting' the space, but then forces any event that defines a location to specify all required information. There's an additional benefit about splitting it off this way: if multiple events all use the same location (... not at the same time, we hope), a cross-reference table will save you a lot of space. Way more than your attributes table ever could (you're still having to store the complete location per event, after all). If you still have a lot of 'optional' attributes, I hear that NoSQL is made for these kinds of things (but I haven't really looked into it). However, other than that, the cost of an additional table is trivial; the cost of the data inside may not be, but the space required is weighed against the perceived value of the data stored. Remember that disk space is relatively cheap - it's developer/maintainer time that is expensive.
Side note for addresses:
You are probably going to want to create an address table. This would be completely divorced from the event information, and would include (among other things) the precomputed latitude/longitude (in the recommended datatype - I don't know what it is, but it's for sure not a comma-separated string). You would then have an event_address table that would be the cross-reference between the events and where they take place - if there is additional information (such as room), that should be kept in a location table that is referenced (instead of referencing address directly). Once a lat/long value is computed, you should never need to change it.
Thoughts on later updates for lat/long:
While specifying the lat/long values yourself is better, you're going to want to make them a required part of the address table (or part of/in addition to a purely lat/long only table). Frankly, multi-value columns (delimited lists) of any sort are just begging for trouble - you keep having to parse them every time you search on them (among other related issues). And the moment you make them separate rows, one of the pair will eventually get dropped - Murphy himself will personally intervene, if necessary. Additionally, updating them at different times from the addresses will result in an address having a lat/long pair that does not match; your best bet is to compute this at insertion time (there are a number of webservices to find this information for you).
Multi-domain tables:
With a multi-domain table, you're basically betting that the domain key (attribute) will never become out-of-sync with the value (err, value). I don't care how good you are, somewhere, somehow, it's going to happen: at my company, we had one of these in our legacy application (it stored FK links and which files the FKs refer to, along with an attribute). At one point an application was installed in production which promptly began storing the correct file links, but the FK links to a different file, for a given class of attribute. Thankfully, there were audit records in another file which allowed this to be reversed (... as near as they were able tell).
In summary:
Revisit your required/optional data. Don't be afraid to create additional tables, each for a single entity, with every column for a single domain; you will also need relationship tables. You may also wish to place your audit data (last_updated_time) in a set of separate tables (single-domain tables will help immensely in this regard).
In the sphinx config you define your index and the SQL queries that populate it. You can define basic attributes, see Sphinx Attributes
Sphinx also supports geo searches on lat/long but they need to be expressed in radians, definitely not text columns like you have. I agree with X-Zero that storing lat/lng values are strings is a bad idea.
I have 3 groups of fields (each group consists of 2 fields) that I have to check against some condition. I don't check each field, but some combination, for example:
group priceEurBus, priceLocalBus
group priceEurAvio, priceLocalAvio
group priceEurSelf, priceLocalSelf
My example (formatted for legibility) — how can this be improved?
$rest .="
WHERE
(
((priceEurBus+(priceLocalBus / ".$ObrKursQuery.")) <= 400)
OR
((priceEurAvio+(priceLocalAvio / ".$ObrKursQuery.")) <= 400)
OR
((priceEurSelf+(priceLocalSelf / ".$ObrKursQuery.")) <= 400)
)
";
$ObrKursQuery is the value I use to convert local currency to Euro.
Performance improvement: Your query is OR based, meaning that it will stop evaluating the conditions as soon as it finds one of them being true. Try to order your conditions in such a way that, for example, in your case, the first check is the most likely to be under 400.
Security imporvement: Use prepared statements and filter out your variables before using them. In case of the $ObrKursQuery, if it comes from a user input or an untrusted source, this is a non-quoted numeric value and you are exposed to a big variety of sql injection problems (including arithmetic sql injection: if that value is 0, you'll get a divideByZero error that can be used as a blind sql injection condition).
Readability imporvement: Be sure to always be consistent in the way you write your code, and if possible, follow some accepted de facto standard, like starting variable names lower case: $ObrKursQuery -> $obrKursQuery. Also for the sake of self documenting code, choose names for your variables that mean what they are: $ObrKursQuery -> $conversionRatio.
Maintainability/Scalability improvement: Use a constant instead of a fixed value for the 400. When you change that value in the future, you will want to change it in just one place and not all over your code.
Never use concatenation to generate your SQL, you should be using prepared SQL statements with parameters.
The only way to simplify this statement without having greater knowledge of the problem domain is to reduce the number of columns. It looks as if you've got three prices per product entry. You could create a table of product prices instead of columns of product prices and this would make it a single comparison and give you the flexibility to create yet more product prices in the future.
So you'll need to create a one->many relationship between product and prices.