Best solution for custom live search task

Best solution for custom live search task - php

I'm going to add simple live search to website (tips while entering text in input box).
Main task:
39k plain text lines for search into (~500 length of each line, 4Mb total size)
1k online users can simultaneously typing something in inputbox
In some cases 2k-3k resuts can match user request
I'm worried about the following questions:
Database VS textfile?
Are there any general rules or best practices related to my task aimed for decreasing db/server memory load? (caching/indexing/etc)
Do Sphinx/Solr are appropriate for such task?
Any links/advice will be extremely helpful.
Thanks
P.S. May be this is the best solution? PHP to search within txt file and echo the whole line

Put your data in a database (SQLite should do just fine, but you can also use a more heavy-duty RDBMS like MySQL or Postgres), and put an index on the column or columns that will be searched.
Only do the absolute minimum, which means that you should not use a framework, an ORM, etc. They will just slow down your code.
Create a PHP file, grab the search text and do a SELECT query using a native PHP driver, such as SQLite, MySQLi, PDO or similar.
Also, think about how the search box will work. You can prevent many requests if you e.g. put a minimum character limit (it does not make sense to search only for one or two characters), put a short delay between sending requests (so that you do not send requests that are never used), and so on.
Whether or not to use an extension such as Solr depends on your circumstances. If you have a lot of data, and a lot of requests, then maybe you should look into it. But if the problem can be solved using a simple solution then you should probably try it out before making it more complicated.

I have implemented 'live search' many times, always using AJAX with querying the database (MySQL) and haven't had/observed any speed or large load issues yet.
Anyway I saw an implementations using Solr but cannot suggest whether it was quicker or consumed less resources.
It completely depends on the HW the server will run on, IMO. As I wrote somewhere, I had seen a server with very slow filesystem so implementing live search while reading and parsing from txt files (or using Solr) could be slower than when querying the database. On the other hand You can host on poor shared webhosting with slow DB connection (that gets even slower with more concurrent connections) so this won't be the best solution.
My suggestion: use MySQL with AJAX (look at this jquery plugin or this article), set proper INDEXes on the searched columns and if this is found slow You still can move to a txt file.

In the past, i have used Zend search Lucene with great success.
It is a general purpose text search engine written entirely in PHP 5. It manages the indexing of your sources and is quite fast (in my experience). It supports many query types, search fields, search ranking.

Related

PHP website without mysql

I am currently working on an existing website that lists products, there are currently a little over 500 products.
The website has a text file for every product and I want to make a search option, thinking of reading all the text files and create an xml document with the values once a day that can be searched.
The client indicated that they wanted to add products and is used to add them using the text files. There might be over 5000 products in the future so I think it's best to do this with mysql. This means importing the current products and create a crud page for products.
Does anyone have experience with a PHP website that does not use MySQL? Is it possible to keep adding text files and just index them once a day even if it would mean having over 5000 products?

5000 seems like an amount that's still managable to index with a daily cron job. As long as you don't plan on searching them real-time, it should work. It's not ideal, but it would work.

Yes, it is very much possible, NOT plausible that you use files for these type of transactions.
It is also better to use XML instead of normal TXTs for the job. 5000 products with what kind of data associated to them might create problems in future.
PS
Why not MySQL?

Mysql was made because file based databases are slow and inaccurate.
Just use mysql. If you want to keep your old txt based database, just build an easy script that will import each file one by one and create corresponding tables in your sql database.
Good luck.

It's possible, however if this is a anything more than simply an online catalog, then managing transaction integrity is horrendously difficult - and that you're even asking the question implies that you are not in a good position to implement the kind of controls required. And as you've already discovered, it doesn't make for easy searching (BTW: mysql's fulltext indexing is a very blunt instrument - it's not a huge amount of effort to implement an effective search engine yourself - or there are excellent ones available off-the-shelf, e.g. mnogosearch)
(as a conicdental point, why XML? It makes managing the data much more complicated than it needs to be)
and create a crud page for products
Why? If the client wants to maintain the data via file uploads and you already need to port the data, then just use the same interface - where the data is stored is not relevant just now.
If there are issues with hosting+mysql, then using SQLite gives most of the benefits (although it wion't scale as well).

Collect, manage data and make it available through an api

Here is my problem:
I have many known locations (I have no influence to these) with a lot of data. Each locations offers me in individual periods of a lot new data. Some give me differential updates, some just the whole dataset, some via xml, for some I have to build a webscraper, some need authentication etc...
These collected data should be stored in a database. I have to program an api to send requested data in xml back.
Many roads lead to Rome but which should i choose?
Which software would you suggest me to use?
I am familiar with C++,C#,Java,PHP,MySQL,JS but new stuff is still ok.
My idea is to use cron jobs + php (or shell script) + curl to fetch the data.
Then I need a module to parse and insert the data into a database (mysql).
The data requests from clients could answer a php script.
I think the input data volume is about 1-5GB/day.
The one correct answer doesn't exist, but can you give me some advice?
It would be great if you can show me smarter ways to do this.
Thank you very much :-)

LAMP: Stick to PHP and MySQL (and make occasional forays into perl/python): availability of PHP libraries, storage solutions, scalability and API solutions and its community size well makes up for any other environment offerings.
API: Ensure that the designed API queries (and storage/database) can meet all end-product needs before you get to writing any importers. Date ranges, tagging, special cases.
PERFORMANCE: If you need lightning fast queries for insanely large data sets, sphinx-search can help. It's got more than just text search (tags, binary, etc) but make sure you spec the server requirements with more RAM.
IMPORTER: Make it modular: as in, for each different data source, write a pluggable importer that can be enabled/disabled by admin, and of course, individually tested. Pick a language and library based on what's best and easiest fit for the job: bash script is okay.
In terms of parsing libraries for PHP, there are many. One of recent popular ones is simplehtmldom and I found it to work quite well.
TRANSFORMER: Make data transformation routines modular as well so it can be written as a need arises. Don't make the importer alter original data, just make it the quickest way into an indexed database. Transformation routines (or later plugins) should be combined with API query for whatever end result.
TIMING: There is nothing wrong with cron executions, as long as they don't become runaway or cause your input sources to start throttling or blocking you so you need that awareness.
VERSIONING: Design the database, imports, etc to where errant data can be rolled back easily by an admin.
Vendor Solution: Check out scraperwiki - they've made a business out of scraping tools and data storage.
Hope this helps. Out of curiosity, any project details to volunteer? A colleague of mine is interested in exchanging notes.

Good alternatives/practices to "LIKE" with PostgreSQL and PHP?

I'm working with a Postgres database that I have no control over the administration of. I'm building a calendar that deals with seeing if resources (physical items) were online or offline on a specific day. Unfortunately, if they're offline I can only confirm this by finding the resource name in a text field.
I've been using
select * from log WHERE log_text LIKE 'Resource Kit 06%'
The problem is that when we're building a calendar using LIKE 180+ times (at least 6 resources per day) is slow as can be. Does anybody know of a way to speed this up (keep in mind I can't modify the database). Also, if there's nothing I can do on the database end, is there anything I can do on the php end?

I think, that some form of cache will be required for this. As you cannot change anything in database, your only chance is to pull data from it and store it in some more accessible and faster form. This is highly dependent on frequency of data inserted into table. If there are more inserts than selects, it will not probably help much. Other way there is slight chance of improved performance.
Maybe you can consider using Lucene search engine, which is capable of fulltext indexing. There is implementation from Zend and even Apache has some http service. I haven't opportunity to test it however.
If you don't use something that robust, you can write your own caching mechanism in php. It will not be as fast as postgres, but probably faster than not indexed LIKE queries. If your queries need to be more sofisticated (conditions, grouping, ordering...), you can use SQLite database, which is file based and doesn't need extra service running on server.
Another way could be using triggers in database, which could on insert data store required information to some other table in more indexed manner. But without rights to administer database, it is probably dead end.
Please be more specific with your question, if you want more specific information.

Is PHP serialization a good choice for storing data of a small website modified by a single person

I'm planning a PHP website architecture. It will be a small website with few visitors and small set of data. The data is modified exclusively by a single user (administrator).
To make things easier, I don't want to bother with a real database or XML data. I think about storing all data through PHP serialization into several files. So for example if there are several categories, I will store an array containing Category class instances for each category.
Are there any pitfalls using PHP serialization in those circumstances?

Use databases -- it is not that difficult and any extra time spent will be well learnt with database use.
The pitfalls I see are as Yehonatan mentioned:
1. Maintenance and adding functionality.
2. No easy way to query or look at data.
3. Very insecure -- take a look at "hackthissite.org". A lot of the beginning examples have to do with hacking where someone put the data hard coded in files.
4. Serialization will work for one array, meaning one table. If you have to do anything like have parent categories that have to match up to other data, not going to work so well.

The pitfalls come when with maintenance and adding functionality.
it is a very good way to learn but you will appreciate databases more after the lessons.

I tried to implement PHP serialization to store website data. For those who want to do the same thing, here's a feedback from the project started a few months ago and heavily modified since:
Pros:
It was very easy to load and save data. I don't have to write SQL queries, optimize them, etc. The code is shorter (with parametrized SQL queries, it may grow a lot).
The deployment does not require additional effort. We don't care about what is supported on the web server: if there is just PHP with no additional extensions, database servers, etc., the website will still work. Sqlite is a good thing, but it is not possible to install it on some servers, and it also requires a PHP extension.
We don't have to care about updating a database server, nor about the database server to use (thus avoiding the scenario where the customer wants to migrate from Microsoft SQL Server to Oracle, etc.).
We can add more properties to the objects without having to break everything (just like we can add other columns to the database).
Cons:
Like Kerry said in his answer, there is "no easy way to query or look at data". It means that any business intelligence/statistics cases are impossible or require a huge amount of work. By the way, some basic scenarios become extremely complicated. Let's say we store products and we want to know how much products there are. Instead of just writing select count(1) from Products, in my case it requires to create a PHP file just for that, load all data then count the number of items, sometimes by adding stuff manually.
Some changes required to implement data migration, which was painful and required more work than just executing an SQL query.
To conclude, I would recommend using PHP serialization for storing data of a small website modified by a single person only if all the following conditions are true:
The deployment context is unknown and there are chances to have a server which supports only basic PHP with no extensions,
Nobody cares about business intelligence or similar usages of the information,
There will be no changes to the requirements with large impact on the data structure.

I would say use a small database like sqlite if you don't want to go through setting up a full db server. However I will also say that serializing an array and storing that in a text file is pretty dang fast. I've had to serialize an array with a few thousand records (a dump from a database) and used that as a temp database when our DB server was being rebuilt for a few days.

MySQL vs File Databases

So I'm going to be working on a home made blog system in PHP and I was wondering which way of storing data is the fastest. I could go in the MySQL direction, or I could go with my own little way of doing it which is storing all of the information (encoded in JSON) in files.
Which way would be the fastest, MySQL or JSON files?

For a small, single user 'database', a file system would likely be quicker - as the size and complexity grows, a database server like MySQL or SQL Server is hard to beat.

I would definately choose a DB option (as you need to be able to search and index stuff). But that does not mean you need a fully realized separate DB service.
MySQL is definitely the more scalable solution.
But the downside is you need to set up and maintain a separate service.
On the other hand there are DBs that are file based and still give you access with standard SQL (SQLite SQLite.org) jumps to mind. You get the advantages of SQL but you do not need to maintain a separate service. The disadvantage is that they are not as scalable.

I would choose a MySQL database - simply because it's easier to manage.
JSON is not really a format for storage, it's for sending data to JavaScripts. If you want to store data in files look into XML or Serialized PHP (which I suspect is what you are after, rather than JSON).

Forgive me if this doesn't answer your question very directly, but since it is a homecooked blog system is it really worth spending time thinking about what storage backend right now is faster?
You're not going to be looking at 10,000 concurrent users from day 1, it doesn't sound like it will need to scale to any maningful degree in the foreseeable future.
Why not just stick with MySQL as a sensible choice rather than a fast one? If you really want some sense that you designed for speed maybe bolt sqlite on instead.

Since you are thinking you may not have the need for a complex relational structure, this might be a fun opportunity to try something more down the middle.
Check out CouchDB, it is a document-based, schema free database (yet still indexable). The database is made of documents that contain named fields (think key-value pairs).
Have fun....

Though I don't know for certain, it seems to me that a MySQL database would be a lot faster, especially as the amount of data gets larger and larger.
Also, using MySQL with PHP is super easy, especially if you use an abstraction class like ezSQL. ezSQL makes working with a database really simple and I think you'd be creating more unnecessary work for yourself by going the home-brewed JSON direction.

I've done both. I like files for very simple problems and databases for complicated problems.
For file solutions, note these problems as the number of files increases:
1) Much more disk space is used than you might expect, because even tiny files use up a whole block. Blocks are fairly large on filesystems which support large drives.
2) Most filesystems get very slow when the number of files in a directory gets very large. My solution to this (assuming the names of the files are reasonably spread out across the alphabet) is to create a directory consisting of the first two letters of the filename. Thus, the file, "animal.txt" would be found at an/animal.txt. This works surprisingly well. If your filenames are not reasonable well-distributed across the alphabet, use some sort of hashing function to create the directories. Sounds a little crazy, but this can work very, very well, and I've used it for very fast solutions with tens of thousands of files.
But the file solutions really only fit sometimes. Unless you have a great reason to go with files, use a database.

This is really cool. It's a PHP class that controls a flat-file database with queries http://www.fsql.org/index.php
For blogs, I recommend caching the pages because blogs usually only have static content. This way, the queries only get run once while caching. You can update the cached pages when a new blog post is added.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.