Import path of folders and subfolders into a faster database engine - php

I created a proxy/crawler a while ago and it ended up logging a lot of files. I thought this would be a simple and OK solution to begin with, but realized I came across more and more problems once it came close 1 000 000 files. Searching the database can take up to 15 seconds, and I have experience the server crash twice in the last week. I tested restarting apache2, search for "test" and spam "free -m" command in terminal. I notice the ram went up high imminently, and it's probably the ram that causes crash. I'm not sure what makes a search engine fast, but would really like to know.
All files are stored under:
database/*/*/*.txt
And use this code to go through them all:
$files = array();
$dir = '/var/www/html/database';
foreach (glob($dir . '/*/*/*.txt', GLOB_NOCHECK) as $path) {
$title = basename($path, ".txt");
if(strripos($title,$search) !== false){
array_push( $files, $path );
}
}
The code is much longer, but I just wanted to show the basics of how it works.
Each file contains about 6 lines of useful info.
So I started looking for a solution, and thought. What if I parse the search to something that can search faster than PHP like Java or C? Ahh, it would be a mess.. So I thought about MySQL. But how should I be able to transfer all files from the folders and subfolders to MySQL? Server is running Debian, with 4 GB ram and i3 processor.
I haven't taken any actions yet because MySQL was confusing and hasn't found any other solution. What should I do?

This question is asking for too much. It's not just a click and go. I thought more people had problems like this, but then I realized that everyone are using premade search engines.
I ended up download the whole database to my windows computer, and code a program in c# that automatically goes through all files, gets the content and POST it to an elasticsearch database which I installed on the Debian server. I should probably have created a file to file converter instead of a file to pure POST request.
Drawback of doing this, is that speeds are not too high and it took 2 hours to transfer 700 000 files to the database.
Program will not be released publicly because of specific strings I used in the files. So this was way harder than I expected.
C# Appllication result:

Related

PHP read text files very slowly

I have a large number of files in a directory and I'm using php to read it to a string. For example, a file's path looks like this: filerootdir/dir1/dir2/dir3/dir4/dir5/dir6/file.txt.
I have a million such txt files. Based on different parameter, php will read the txt file and display it as a part of the webpage. I'm testing the php program on Windows 7 Pro right now. When a file's absolute path is short, e.g., filerootdir/dir1/file.txt, it's pretty fast to load. But when the absolute path is long, it is VERY slow. I'm wondering if there is a better solution for this problem.
I'm testing my program under windows WAMP, but it will be moved to LAMP later eventually. Will the file loading program fun faster on linux servers? Could this be a problem of Windows operating system?
The code I'm using looks like the following:
if (file_exists($filePath.".html")) {
$code = file_get_contents($filePath.".html");
}
Thanks very much!
You might consider storing the data in a database - if you are using this number of records, especially if they are small files, a database will probably be more efficient. Before you do, read up on indexes - they can grab the right record out of billions in a tiny fraction of a second.

Opencart export tool choking around 12,000 products

I am using this extension to perform export/import for the product database on our website. The site is pretty snappy--loads quickly, on a killer server, everything functions flawlessly except for the product import/export.
Everything worked fine up until the point where we had about 12,000 products in the catalog. Now, it appears that product import works fine. Problem is, exporting products is choking. Here's what happens... click export, hangs for about 10-12 minutes (during which time the site goes down, unless I kill the process via CLI), then goes to a "page not found" error, the same link the admin export function was accessing.
Technical data & stuff I have tried or considered...
Import export code may be downloaded here. Opencart is based on MVC framework, so controller and model are obviously the important files to look at.
I have upgraded the original plugin to utilize the absolute latest version of PHPExcel and Pear library, with OLE and Spreadsheet extensions--both utilized by the import/export module.
PHP.ini settings are maxed out--allowing up to 8 gigs of RAM, post_max_size, max upload, and all other settings are pretty much maxed out. Server is running dual quad-core Xeons with a number of SAS hard drives, with an average processor usage around 3%. So, it's not the server, and unless I'm missing something, it's not the php settings that is the root of the problem here.
There are no errors being thrown in the error log that would indicate any specific problems in the code. Just the fact that it was working before, and now locks up while exporting products when more than 12k products are in the DB.
I have tried repairing the product tables, optimizing the database, and re-installing the base Opencart framework.
I realize this is a pretty general question here, but I'm at my wits end. I am not going to code a custom import/export module from scratch to nail this problem down. Simply hoping that someone might be able to shed some light (extension author has not been able to answer this issue). I've picked this thing apart from top to bottom and can't find any reason why it wouldn't be working the way it should.
10 - 12 minutes is a relatively short time for larger imports. I've seen them last over 45 minutes before now on dedicated servers with plenty of RAM. It had the same problem as yours where the site became totally unresponsive during the upload/import. The problem is the inefficiency of using excel and decoding all of the values from the saved excel sheet. I did actually custom code an efficient version for a client back on 1.4.X that did this but it was by no means pretty and took quite a lot of debugging. The actual export was massively inefficient too just joining all the tables together and taking up vast amounts of memory (over 1.8GB if I remember correctly). This too was massively reduced by selecting smaller duplicated rows and parsing them separately then inserting them back into arrays for the export data. It was quite incredible at just how much faster this was
The solution was simpler than I would have ever imagined.
Export tool that actually works for large product databases (and of course, exports to CSV).
Here is the link.

Aptana Studio 3 with PHP - constant indexing

I'm using Aptana Studio 3 with several big PHP projects (10.000+ files) and it suffers from very slow indexing of PHP files.... which takes 10-20 minutes to complete and starts every time at the startup of Aptana, and also sometimes at random moments, for example when synchronizing with SVN...
In the progress view I get multiple 'Indexing new PHP Modules' items.
All the time it is doing this Aptana is unusably slow. I don't get why this indexing starts over and over again on files that aren't new at all!
I already turned off automatic refreshes and automatic build. If I exclude 'PHP' from the 'Project Natures' in the properties of the projects, the indexing stops, but then I don't have code completion in PHP files.
I cleaned all projects, created a new workspace, etc. and nothing helps... This happens on multiple pc's (Windows) so I guess more people get this behaviour.
Any possible solutions?
UPDATE
I added the folder of my workspace to the 'ignore'-folders of my virus scanner (Microsoft Security Essentials). At first this seemed to work, but then the indexing started again...
Seems like you did the right steps to try and resolve it, and it also seems we should have a ticket for that, so I created one at https://jira.appcelerator.org/browse/APSTUD-4500 (please add yourself as a 'watcher').
One more thing to try is to break down a big project into a few smaller ones (whenever possible, of course). The indexer creates a binary index file for each project, and this file size is proportional to amount of classes, functions, variables and constants you have in your project. In case, for some reason (e.g. a bug) this file gets corrupted, a re-index will happen, so having multiple smaller projects may help with that. Again... just an idea.

Generate an image / thumbnail of a webpage using X/Gui-less linux

There exists numerous solutions on generating a thumbnail or an image preview of a webpage. Some of these solutions are webs-based like websnapshots, windows libraries such as PHP's imagegrabscreen (only works on windows), and KDE's wkhtml. Many more do exist.
However, I'm looking for a GUI-less solution. Something I can create an API around and link it to php or python.
I'm comfortable with python, php, C, and shell. This is a personal project, so I'm not interested in commercial applications as I'm aware of their existence.
Any ideas?
You can run a web browser or web control within Xvfb, and use something like import to capture it.
I'll never get back the time I wasted on wkhtml and Xvfb, along with the joy of embedding a monolithic binary from google onto my system. You can save yourself a lot of time and headache by abandoning wkhtml2whatever completely and installing phantom.js. Once I did that, I had five lines of shell code and beautiful images in no time.
I had a single problem - using ww instead of www in a url caused the process to fail without meaningful error messages. Eventually I saw the dns lookup problem, and my faith was restored.
But seriously, every other avenue of thumbnailing seemed to be out of date and/or buggy.
phantom.js = it changed my life.

Website doesn't work during uploading of crucial files

I have a problem with maintenance of my php based website. My website is built on the Zend Framework. When I wish to upload a new copy or version online - during the time of upload especially when crucial files like models and controllers are being uploaded and rewritten - the site won't run understandably.
Is there a way to upload a website without having to go through this issue?
My updates are really quite regular. Let's say like once or twice a week in this case.
You can make use of the fact that renaming directories is quick and easy even through FTP. What I usually do is:
Have two directories, website_live and website_upload
website_live contains the live website (obviously)
Upload contents to website_upload
Rename website_live to website_old (or whatever)
Rename website_upload to website_live
done! Downtime less than two seconds if you rename quickly.
It gets a bit more complex if you have uploaded content in the old version (e.g. from a CMS) that you need to transfer to the new one. It's cumbersome to do manually every time, but basically, it's just simple rename operations too (renaming works effortlessly in FTP as well).
This is a task that can be automated quite nicely using a simple deployment script. If you're on Linux, setting up a shell script for this is easy. On Windows, a very nice tool I've worked with to do automated FTP synchronizing, renaming and error handling - even with non-technical people starting the process - is ScriptFTP. It comes with a good scripting language, and good documentation. It's not free, though.
If you're looking to get into hard-core automated PHP deployment, I've been doing some research in that field recently. Maybe the answers to my recent bounty question can give you inspiration.

Categories