How do Lucene/Sphinx/Solr work? [closed]

How do Lucene/Sphinx/Solr work? [closed] - php

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I have a website in Phalcon and I'm trying to add a search engine to it. The content, however, is not in a DB and is in flat files.. located in app/views/.
I've never implemented a search engine, but from what I gather it seems like Lucene or Solr/Sphinx is what I need.
Do these tools offer the option to parse my website ala HTTrack, thus creating the index and necessary absolute URI hyperlinks?
How do I go about specifying what portion of the HTML files I want to be parsed? How do they interact with ignoring certain areas ( eg HTML, JS )?

Lucene is first and foremost an index. That's not even a database, it's just the index portion of the database if you will. It's highly configurable in what it indexes and how and what data should be retained in its original format and what can be discarded once it has been indexed. You create a schema first, just like you create a database schema. However, in the case of Lucene that schema defines what kind of tokenisers and filters to use to create the index for your fields. You then feed your documents into it to let it populate the index. That's up to you, there are several different APIs that let you feed data in. A "web crawler" is not one of them, it won't go out and find your data automatically. You can then query the index in various ways to retrieve documents you have fed in before. That's it in a nutshell.
Lucene is pretty much exclusively the index engine, which is about tokenising and transforming text and other data into an index that can be queried quickly. It's the part that let's you query for "manufacturer of widgets" and return a document with the text "widget manufacturers", if you have tweaked your indexing and querying accordingly. Solr is an appliance wrapped around Lucene that adds an HTTP based API and some other niceties. Both are still somewhat low-level tools you can use to build a search engine. It's not an out-of-the-box "search engine" like Google by any means.

Related

Shoukd I use JSON for my posting and commenting system? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 5 years ago.
Improve this question
I am building a social networking website and should i use JSON instead of database for storing users posts and comments? Is it good and secure?

JSON is a format, not a file standard. So you could even think of storing JSON in DB.
So, you will need to answer two questions
is file format better than real DB format (like MySQL like databases)?
is JSON format good for the use I have?
File format
Because you will have to deal with quick access and sometimes complex queries (search engine, statistics, ... ), it's clear that a DB system will be much more efficient than a file system oriented solution. File creations / openings / writings / closings need a lot of time and it will be a pain to create search queries. In PHP you would have to open all your files and put them in memory beforte doing the real job.
Using a Database, you would perhaps be directly be able to ask the system if someone with "%john%" in his name has posted something about "%futurama%" in the title.
So... one point for DB
JSON format stored in DB
One more time, you'd better use rows capacities of the DB system. By that, I mean using a author_id row for example. It would greatly impact the perfs. On the other hand, you would have to think of complex queries dealing with the downsides of the JSON format in your DB engine query.
One more point for not using JSON
But when using JSON?
JSON is great when dealing with APIs. If you need to serve data for an application (i.e. an Angular2 front end that will query your API), JSON is a native format to manage data... but not for storing it... JSON is more often used for stream purposes.

Saving User Generated Form Created Queries [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm programming a LAMP stack online application that uses a very complex search form, I want to give the user the ability to name and save their current search for faster use in the future (they will be checking these results daily). What is the best methodology to do this? I've been coming across stored procedures, but this doesn't seem like what I'm looking for.
My current idea:
Pull php generated query into dedicated database for saving queries (all form input data is sanitized / validated). Is this a security risk? I know all form generated SQL is a risk of course. When the user wants to query with it, the PHP code will simply use the saved query over the form generated one. If I change the form generated query code in the future, this should prevent conflicts, but of course, it won't take advantage of any new design features.

I don't imagine this is "best practice" (or that there is one, in this case). Personally I think I'd rather store their search terms in a format devoid of context (say in a JSON-encoded object if there are multiple search terms or conditions), and then when they recall the search rebuild the queries from the JSON object.
(Storing the actual query seems to run the risk of old queries becoming obsolete if/when the underlying database structure changes. Storing only what they're searching for and rebuilding it allows you to accommodate for that.)
My $0.02.
To answer your question, yes. Regardless of how you store it, you would store the values they entered in whatever your form collects, then when they rerun the stored "search" you would go through that structure and remake your query.
The table might have search_id, user_id, search_name, parameters and whatever else. They pull up a list of their saved searches, choose one, execute it, you pull parameters, rebuild the query, run it, and display the results, just as you would when they did the original search through the normal form.

PHP + MySQL - What alternatives to handle (small) time series? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I was asked to build an expense report framework which allows users to store their expenses, one at a time, via a web form. The number of entries will never be more than 100-200 per day.
In addition to the date and time, to be provided by the user, there must be a pre-defined set of tags (e.g.: transportation, lodging, food) to choose from for each new row of data, as well as fields for currency, amount and comments.
Afterwards, it must be possible (or rather, easy) to fetch the entries in the db between two dates and store the data in a pandas data frame (or R data table) for posterior statistical analysis and plotting.
I first thought about using PHP to insert the data in a mySQL database table, where the tags would be columns of booleans (True/False). The very simple web form would load by default with all tags set to False and it would be up to the user to turn the right ones to True prior to submission.
This said, I am now wondering about the other approaches I can or should explore. I've been reading about openTSDB and InfluxDB, which are designed to handle massive amounts of data, but I am also interested in hearing from coders up-to-date with the latest technologies about other possible options.
In short, I wish to choose a wise approach which is neither dated nor a (complex) cannon to kill a fly.

You could try Axibase Time-Series Database Community Edition. It's free.
Supports tags for entities, metrics, and series tags
Provides open-source API clients for R, Python, and PHP
Range time-series query is a core use case
Check out App examples you can easily build in PHP, Go, NodeJS. Application code is open source under Apache 2 license and is hosted on github.
Disclosure: I work for the Axibase.

automaticly using someones online search database [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Does anyone have any ideas for what would be the best way to automaticly use someones online search database, given a static search (see example). It might also make this question more usefull to add a solution for a none static search.
So for example, I have a website and I wan't to create a link to the PDF file of the latest report by a certain person on this site: http://aris.empr.gov.bc.ca The search criteria does not change, all that changes is new results as the database is updated, so the search result is always http://aris.empr.gov.bc.ca/search.asp?mode=find Notice that not all entries have a report yet.
So far my idea is to use a php script to search through the source code of the completed search result page, search for the first instance of a .pdf string, and then extract the whole link (the page is orderd by date, so the first pdf file found would be the latest report that has a pdf file available.
The problems with this solutions:
1) it is very specific to my problem and only works for a static search result, and so is not a good Q&A
2) I am not sure if the completed search link researches everytime you follow it, or if it leads to an old result that could become out to of date
3) my solution is not sexy and is held together by duct tape, if you know what I mean.
Thanks,
-Adrian

In real terms you want to scrape the page(s).
You have 2 options in PHP:
1. Use CURL to fetch the page and USE PHP DOM parser to parse and extract the content from it.
2. You can use PHP Simple DOM Library, check here : http://simplehtmldom.sourceforge.net
It has ready made functions and you won't need to use CURL much here.
I hope you get an idea.
Try some code, show us here and we will guide more on this...

Use of PHP generated sites in SEO? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 11 years ago.
Improve this question
I've got a site that is based around a contact form. This form is generated according to the variables passed in the URL, and the information passed is put in headers in the body, and also in the title. Additionally images are customized, so basically the whole content is changed according to these variables.
So I've run a sitemap generator, and it's actually generated lots of these www.site.tld/me.php?a=hi&b=pie, www.site.tld/me.php?a=hi&b=chocolate, www.site.tld/me.php?c=hi&hello.... you get the point.
So, my question is: is it smart to use this to my advantage, include these in the sitemap and customize them for SEO, or should I just ignore it and omit it from the sitemap?

In general having dynamic urls is okay, but you don't necessarily want them indexed for SEO purposes. In general its better to have a well organized url structure, as its seen to be more appealing. (i.e. site.com/article/sports/baseball123 is better than site.com?id=123433). So depending on your content (whether its static or dynamic) you may want to move to that type of a url structure, and have your pages indexed. On the other hand if you need to keep dynamic urls (for some reason) and depending on the nature of the content, it may be best to leave them out of the equation from an SEO perspective. It ultimately comes down to what you're serving from these pages.

Search engine friendly urls by using apache rewrite module search apache mod_rewrite in google there are many ways to do this and there are some ways that work much better than others. Google will index your site based on the content on the page rather than the url or any meta information. Using the mod rewrite makes it easier for your viewers but as far as search engines are concerned it dont really matter. Hope this helps

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.