I am writing an application, where it is necessary to fetch data from a third party website. Unfortunately, a specific type of info needed (a hotel name) can only be obtained by CURLing the webpage, and then parsing it (I'm using XPATHs) looking for an < h1> DOM element.
Since I'm going to run this script many times within the day, and I'll probably have to fetch the same hotel names again and again, I thought that a caching mechanism would be good: Checking if the hotel has been parsed in the past and then decide whether to make the webpage request or not.
However I have two concerns: this implementation is better to be made in a DB (since there will be an ID-Hotel name matching) or in a file? The second one is whether this "optimization" worth the whole trouble. Will I gain some significant speed up?
Go with DB, because it will give to you more flexibility and functionality for the data manipulation (filtering, sorting, etc.) by default.
Related
Instead of eval() I am investigating the pros and cons with creating .php-files on the fly using php-code.
Mainly because the generated code should be available to other visitors and for a long period of time, and not only for the current session. The generated php-files is created using functions dedicated for that and only that and under highly controlled conditions (no user input will ever reach those code files).
So, performance wise, how much load is put on the webserver when creating .php-files for instant execution using include() later elsewhere compared to updating a database record and always query a database at every visit?
The generated files should be updated (overwritten) quite frequently but not very frequent compared to how frequently they will be executed
What are the other pro/cons? Should the possibility of the combination of one user overwriting the code files at the same time as others is currently executing them introduce complicated concurrent conflict solving? Using Mutex? Is it next to impossible to overwrite the files if visitors is constantly "viewing" (executing) them?
PS. I am not interested in alternative methods/solutions for reaching "the same" goal, like:
Cached and/or saved output buffers, as an alternative, is out of the question, mainly because the output from the generated php-code is highly dynamic and context-sensitive
Storing the code as variables in a database and create dynamic php code that can do what is requested based on stored data, mainly because I don't want to use a database as backend for the feature. I don't ever need to search the data, query it for Aggregation, ranking or any other data collecting or manipulation
Memcached, APC etcetera. It's not a caching feature I want
Stand-alone (not PHP) server with custom compiled binary running in memory. Not what I am looking for here, although this alternative have crossed my mind.
EDIT:
Got many questions about what "type" of code is generated. Without getting into details I can say: It's very context sensitive code. Code is not based on user direct input but input in terms of choices, position and flags. Like "closed" objects in relation to other objects. Most code parts is related to each other in many different, but very controlled, ways (similar to linked lists, genetic cells in AI-code etcetera) so querying a database is out of the question. One code file will include one or more others, and so on..
I do the same thing in an application. It generates static PHP Code from data in a MySQL database. I store the code in memcached and use ‘eval’ to execute it. Only when something changes in the MySQL database I regenerate the PHP. It saves an awful lot of MySQL reads
I've got a heavy-read website associated to a MySQL database. I also have some little "auxiliary" information (fits in an array of 30-40 elements as of now), hierarchically organized and yet gets periodically and slowly updated 4-5 times per year. It's not a configuration file though since this information is about the subject of the website and not about its functioning, but still kind of a configuration file. Until now, I just used a static PHP file containing an array of info, but now I need a way to update it via a backend CMS from my admin panel.
I thought of a simple CMS that allows the admin to create/edit/delete entries, periodical rare job, and then creates a static JSON file to be used by the page building scripts instead of pulling this information from the db.
The question is: given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
I just used a static PHP
This sounds like contradiction to me. Either static, or PHP.
given the heavy-read nature of the website, is it better to read a rarely updated JSON file on the server when building pages or just retrieve raw info from the database for every request?
Cache was invented for a reason :) Same with your case - it all depends on how often data changes vs how often is read. If data changes once a day and remains static for 100k downloads during the day, then not caching it or not serving from flat file would would simply be stupid. If data changes once a day and you have 20 reads per day average, then perhaps returning the data from code on each request would be less stupid, but from other hand, all these 19 requests could be served from cache anyway, so... If you can, serve from flat file.
Caching is your best option, Redis or Memcached are common excellent choices. For flat-file or database, it's hard to know because the SQL schema you're using, (as in, how many columns, what are the datatype definitions, how many foreign keys and indexes, etc.) you are using.
SQL is about relational data, if you have non-relational data, you don't really have a reason to use SQL. Most people are now switching to NoSQL databases to handle this since modifying SQL databases after the fact is a huge pain.
Here is my problem:
I have many known locations (I have no influence to these) with a lot of data. Each locations offers me in individual periods of a lot new data. Some give me differential updates, some just the whole dataset, some via xml, for some I have to build a webscraper, some need authentication etc...
These collected data should be stored in a database. I have to program an api to send requested data in xml back.
Many roads lead to Rome but which should i choose?
Which software would you suggest me to use?
I am familiar with C++,C#,Java,PHP,MySQL,JS but new stuff is still ok.
My idea is to use cron jobs + php (or shell script) + curl to fetch the data.
Then I need a module to parse and insert the data into a database (mysql).
The data requests from clients could answer a php script.
I think the input data volume is about 1-5GB/day.
The one correct answer doesn't exist, but can you give me some advice?
It would be great if you can show me smarter ways to do this.
Thank you very much :-)
LAMP: Stick to PHP and MySQL (and make occasional forays into perl/python): availability of PHP libraries, storage solutions, scalability and API solutions and its community size well makes up for any other environment offerings.
API: Ensure that the designed API queries (and storage/database) can meet all end-product needs before you get to writing any importers. Date ranges, tagging, special cases.
PERFORMANCE: If you need lightning fast queries for insanely large data sets, sphinx-search can help. It's got more than just text search (tags, binary, etc) but make sure you spec the server requirements with more RAM.
IMPORTER: Make it modular: as in, for each different data source, write a pluggable importer that can be enabled/disabled by admin, and of course, individually tested. Pick a language and library based on what's best and easiest fit for the job: bash script is okay.
In terms of parsing libraries for PHP, there are many. One of recent popular ones is simplehtmldom and I found it to work quite well.
TRANSFORMER: Make data transformation routines modular as well so it can be written as a need arises. Don't make the importer alter original data, just make it the quickest way into an indexed database. Transformation routines (or later plugins) should be combined with API query for whatever end result.
TIMING: There is nothing wrong with cron executions, as long as they don't become runaway or cause your input sources to start throttling or blocking you so you need that awareness.
VERSIONING: Design the database, imports, etc to where errant data can be rolled back easily by an admin.
Vendor Solution: Check out scraperwiki - they've made a business out of scraping tools and data storage.
Hope this helps. Out of curiosity, any project details to volunteer? A colleague of mine is interested in exchanging notes.
Based on this tutorial I have built a page which functions correctly. I added a couple of dropdown boxes to the page, and based on this snippet, have been able to filter the results accordingly. So, in practice everything is working as it should. However, my question is regarding the efficiency of the proceedure. Right now, the process looks something like this:
1.) Users visits page
2.) Body onload() is called
3.) Javascript calls a PHP script, which queries the database (based on criteria passed along via the URL) and exports that query to an XML file.
4.) The XML file is then parsed via javascript on the users local machine.
For any one search there could be several thousand results (and thus, several thousand markers to place on the map). As you might have guessed, it takes a long time to place all of the markers. I have some ideas to speed it up, but wanted to touch base with experienced users to verify that my logic is sound. I'm open to any suggestions!
Idea #1: Is there a way (and would it speed things up?) to run the query once, generating an XML file via PHP which contained all possible results, store the XML file locally, then do the filtering via javascript?
Idea #2: Create a cron job on the server to export the XML file to a known location. Instead of using "Gdownloadurl(phpfile.php," I would use gdownloadurl(xmlfile.xml). Thus eliminating the need to run a new query every time the user changes the value of a drop down box
Idea #3: Instead of passing criteria back to the php file (via the URL) should I just be filtering the results via javascript before placing the marker on the map?
I have seen a lot of webpages that place tons and tons of markers on a google map and it doesn't take nearly as long as my application. What's the standard practice in a situation like this?
Thanks!
Edit: There may be a flaw in my logic: If I were to export all results to an XML file, how (other than javascript) could I then filter those results?
Your logic is sound, however, I probably wouldn't do the filtering in Javascript. If the user's computer is not very fast, then performance will be adversely affected. It is better to perform the filtering server side based on a cached resource (xml in your case).
The database is probably the biggest bottleneck in this operation, so caching the result would most likely speed your application up significantly. You might also consider you have setup your keys correctly to make your query as fast as possible.
To store multi language content, there is lots of content, should they be stored in the database or file? And what is the basic way to approach this, we have page content, reference tables, page title bars, metadata, etc. So will every table have additional columns for each language? So if there are 50 languages (number will keep growing as this is a woldwide social site, so eventual goal is to have as many languages as possible) then 50 extra columns per table? Or is there a better way?
There is a mixture of dynamic system and user content + static content.
Scalability and performance are important. Being developed in PHP and MySQL.
User will be able to change language on any page from the footer. Language can be either session based or preference based. Not sure what is a better route?
If you have a variable, essentially unknown today number of languages, than this definately should NOT be multiple columns in a record. Basically the search key on this table should be something like message id plus language id, or maybe screen id plus message id plus language id. Then you have a separate record for each language for each message.
If you try to cram all the languages into one record, your maintenance will become a nightmare. Every time you add another language to the app, you will have to go through every program to add "else if language=='Tagalog' then text=column62" or whatever. Make it part of the search key and then you're just reading "where messageId='Foobar' and language=current_language", and you pass the current language around. If you have a new language, nothing should have to change except adding the new language to the list of valid language codes some place.
So really the question is:
blah blah blah. Should I keep my data in flat files or a database?
Short answer is whichever you find easier to work with. Depending on how you structure it, the file based approach can be faster than the database approach. OTOH, get it wrong and performance impact will be huge. The database approach enforces more consistent structure from the start. So if you make it up as you go along, then the database approach will probably pay off in the long run.
eventual goal is to have as many languages as possible) then 50 extra columns per table?
No.
If you need to change your database schema (or the file structure) every time you add a new language (or new content) then your schema is wrong. If you don't understand how to model data properly then I'd strongly recommend the database approach for the reasons given.
You should also learn how to normalize your data - even if you ultimately choose to use a non-relational database for keeping the data in.
You may find this useful:
PHP UTF-8 cheatsheet
The article describes how to design the database for multi-lingual website and the php functions to be used.
Definitely start with a well defined model so your design doesn't care whether data comes from a file, db or even memCache or something like that. Probably best to do a single call per page to get an object that contains all the fields for that single page, rather than multiple calls. The you can just reference that single returned object to get each localised field. Behind the scenes you could then code the respository access and test. Personally I'd probably go the DB approach over a file - you don't have to worry about concurrent file access and it's probably easier to deploy changes - again you don't have to worry about files being locked by reads when you're deploying new files - just a db update.
See this link about php ioc, that might help you as that would allow you to abstract from your code what type of respository is used to hold the data. That way if you go one approach and later you want to change it - you won't have to do so much rework.
There's no reason you need to stick with one data source for all "content". There is dynamic content that will be regularly added to or updated, and then there is relatively static content that only rarely gets modified. Then there is peripheral content, like system messages and menu text, vs. primary content—what users are actually here to see. You will rarely need to search or index your peripheral content, whereas you probably do want to be able to run queries on your primary content.
Dynamic content and primary content should be placed in the database in most cases. Static peripheral content can be placed in the database or not. There's no point in putting it in the database if the site is being maintained by a professional web developer who will likely find it more convenient to just edit a .pot or .po file directly using command-line tools.
Search SO for the tags i18n and l10n for more info on implementing internationalization/localization. As for how to design a database schema, that is a subject deserving of its own question. I would search for questions on normalization as suggested by symcbean as well as look up some tutorials on database design.