Summary
I am on a PHP news crawler project and want to pull RSS news feeds from nearly a hundred of news websites using wget (version 1.12) to capture whole RSS feed files, all in one directory (without hierarchy) in local server regarding:
Some of these websites do not have RSS feed and so I should capture and parse their HTML, but at the beginning I can just concentrate on XML feeds.
All feed files from all websites in one directory.
No extra content should be downloaded. all extra content (like images if any) should be hosted on the remote.
Performance is important
Feed files need to be renamed before save according to my convention like source.category.type.xml (each XML remote URL has its own source, category and type but not with my naming convention)
Some of these feeds do not include news timestamp like with <pubDate> and so I have to choose a good working approach to handle news time even with a slight difference but robust, working and always functional.
To automate it, I need to perform a cron job on this wget on regular basis
url-list.txt includes:
http://source1/path/to/rss1
http://source2/diiferent/path/to/rss2
http://source3/path/to/rss3
.
.
.
http://source100/different/path/to/rss100
I want this:
localfeed/source1.category.type.xml
localfeed/source2.category.type.xml
localfeed/source3.category.type.xml
.
.
.
localfeed/source100.category.type.xml
Category and type can have multiple predefined values like sport, ...
What do I have?
At the very first level I should do my wget using a list of remote URLs: According to this wget instructions:
url-list.txt should consist of a series of URLs, one per line
When running wget without -N, -nc, -r, or -p, downloading the same file in the same directory will result in the original copy of FILE being preserved and the second copy being named FILE.1.
Use of -O like wget -O FILE is not intended to mean simply "use the name FILE instead of the one in the URL". It outputs the whole downloads into just one file.
Use -N for time stamping
-w SECONDS will hold on for SECONDS seconds of time before next retrieval
-nd forces wget not to create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the filenames will get extensions `.n')
-nH disables generation of host-prefixed directories (the behavior which -r by default does).
-P PREFIX sets directory prefix to PREFIX. The "directory prefix" is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree.
-k converts links for offline browsing
$ wget -nH -N -i url-list.txt
Issues with (wget & cron job and php):
How to handle news time? is it better to save timestamp in file names like source.category.type.timestamp.xml or fetch change time using phps stat function like this:
$stat = stat('source.category.type.xml');
$time = $stat('mtime'); //last modification time
or any other ideas (which is always working and robust)
How to handle file names? I want to save files locally on a distinct convention (source.category.type.xml) and so I think wget options like --trust-server-names or --content-disposition could not help. I think I should go to a while loop like this:
while read url; do
wget -nH -N -O nameConvention $url
done < utl-list.txt
I suggest to stay away from wget for your task as it makes your life really complicated for no reason. PHP is perfectly fine to fetch downloads.
I would add all URLs into a database (it might be just a text file, like in your case). Then I would use a cronjob to trigger the script.
On each run I would check a fixed number of sites and put their RSS feeds into the folder. E.g. with file_get_contents and file_put_contents you are good to go. This allows you full control over what to fetch and how to save it.
The I would use another script that goes over the files and does the parsing. Separating the scripts from beginning will help you to scale later on.
For a simple site, just sorting the files by mtime should do the trick. For a big scaleout, I would use a jobqueue.
The overhead in PHP is minimal while the additional complexity by using wget is a big burden.
Related
I've done a bulk download from archive.org using wget which was set to spit out a list of all files per IDENTIFIER into their respective folders.
wget -r -H -nc -np -nH -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'
Which results in folders organised thus from a root, for example:
./IDENTIFIER1/file.blah
./IDENTIFIER1/something.la
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb001.gif
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb002.gif
./IDENTIFIER1/IDENTIFIER_files.xml
./IDENTIFIER2/etc.etc
./IDENTIFIER2/blah.blah
./IDENTIFIER2/thumbnails/IDENTIFIER_thumb001.gif
etc
The IDENTIFIER is the name of a collection of files that comes from archive.org, hence, in each folder, there is also the file called IDENTIFIER_files.xml which contains checksums for all the files in that folder, wrapped in the various xml tags.
Since this is a bulk download and there's hundreds of files, the idea is to write some sort of script (preferably bash? Edit: Maybe PHP?) that can select each .xml file and scrape it for the hashes to test them against the files to reveal any corrupted, failed or modified downloads.
For example:
From archive.org/details/NuclearExplosion, XML is:
https://archive.org/download/NuclearExplosion/NuclearExplosion_files.xml
If you check that link you can see there's both the option for MD5 or SHA1 hashes in the XML, as well as the relative file paths in the file tag (which will be the same as locally).
So. How do we:
For each folder of IDENTIFIER, select and scrape the XML for each filename and the checksum of choice;
Actually test the checksum for each file;
Log outputs of failed checksums to a file that lists only the failed IDENTIFIER (say a file called ./RetryIDs.txt for example), so a download reattempt can be tried using that list...
wget -r -H -nc -np -nH -e robots=off -l1 -i ./RetryIDs.txt -B 'http://archive.org/download/'
Any leads on how to piece this together would be extremely helpful.
And another added incentive---probably a good idea too, if there is a solution, if we let archive.org know so they can put it on their blog. I'm sure I'm not the only one that will find this very useful!
Thanks all in advance.
Edit: Okay, so a bash script looks tricky. Could it be done with PHP?
If you really want to go the bash route, here's something to you started. You can use the xml2 suite of tools to convert XML into something more amendable to traditional shell scripting, and then do something like this:
#!/bin/sh
xml2 < $1 | awk -F= '
$1 == "/files/file/#name" {name=$2}
$1 == "/files/file/sha1" {
sha1=$2
print name, sha1
}
'
This will produce on standard output a list of filenames and their corresponding SHA1 checksum. That should get you substantially closer to a solution.
Actually using that output to validate the files is left as an exercise to the reader.
I have a little problem, I have a large 41gb file on a server and I need to extract it..
How would i go about it, the file is in a tar.gz format and it will take 24hr on a godaddy server and then it stops for some reason
I need to exclude a folder name data this contains the bulk of the data 40.9gb the rest is just php.
home/xxx/public_html/xxx.com.au/data << this is the folder I don't need
I have been searching google and other sites for day's but it doesn't work..
shell_exec('tar xvf xxx_backup_20140921.tar.gz'); this is the command I use I have even used the 'k' to skip files and it dont work
I have used the -exclude command but nothing,
Try this:
shell_exec("tar xzvf xxx_backup_20140921.tar.gz --exclude='home/xxx/public_html/xxx.com.au/data'");
This should prevent the path listed (relative to the root of the archive) from being extracted.
I have created hundreds of folders and text files using php, I then add them to a zip archive.
This all works fine but if I create another zip archive using the same folders and files, the new archive will have a different hash to the first one. This is the same if I use winrar instead of php to create an archive.
It only seems to show different hashes when I zip the files I have created through php, yet they open fine.
Very strange anyone shed any light on this?
Thanks
Zip is not deterministic. To solve this zip problem (it's really problem when you have CI and need to update AWS lambda, for example and don't want to update it each time, but only when something was really changed) I used this article: https://medium.com/#pat_wilson/building-deterministic-zip-files-with-built-in-commands-741275116a19
Like this:
find . -exec touch -t "$(git ls-files -z . | \
xargs -0 -n1 -I{} -- git log -1 --date=format:"%Y%m%d%H%M" --format="%ad" '{}' | \
sort -r | head -n 1)" '{}' +
zip -rq -D -X -9 -A --compression-method deflate dest.zip sources...
There is certainly some difference in the files. If the lengths are not exactly the same, the hash will be different. You can use a comparing hex editor, like Hex Workshop for example, to see what exactly the differences are.
Possibilities that come to my mind:
As #orn mentioned, there may be a timestamp in the zip format you are using (not sure).
The order that the files are added to the archive may be different (depending on how you're selecting them / building the source array).
You can consider using deterministic_zip it solves this issue, from its documentation:
There are three tricks to building a deterministic zip:
Files must be added to the zip in the same order. Directory iteration order may vary across machines, resulting in different zips. deterministic_zip sorts all files before adding them to the zip archive.
Files in the zip must have consistent timestamps. If I share a directory to another machine, the timestamps of individual files may differ, despite having identical content. To achieve timestamp consistency, deterministic_zip sets the timestamp of all added files to 2019-01-01 00:00:00.
Files in the zip must have consistent permissions. File permissions look like -rw-r--r-- for a file that is readable by all users, and only writable by the user who owns the file. Similarly executable files might have permissions that look like: -rwxr-xr-x or -rwx------. deterministic_zip sets the permission of all files added to the archive to either -r--r--r--, or -r-xr-xr-x. The latter is only used when the user running deterministic_zip has execute access on the file.
Note: deterministic_zip does not modify nor update timestamps of any files it adds to archives. The techniques used above apply only to the copies of files within archives deterministic_zip creates.
I am working with Magento, and there is a function that merges CSS and Javascript into one big file.
Regardless the pros and cons of that, there is the following problem:
The final file gets cached at multiple levels that include but are not limited to:
Amazon CloudFront
Proxy servers
Clients browser cache
Magento uses an MD5 sum of the concatenated css filenames to generate a new filename for the merged css file. So that every page that has a distinct set of css files gets a proper merged css file.
To work around the caching issue, I also included the file modification timestamps into that hash, so that a new hash is generated, everytime a css file is modified.
So the full advantages of non revalidative caching score, but if something gets changed, its visible instantly, because the resource link has changed.
So far so good:
Only problem is, that the filenames that are used to generate the has, are only the ones that would normally be directly referenced in the HTML-Head block, and don't include css imports inside those files.
So changes in files that are imported inside css files don't result in a new hash.
No I really don't want to recursively parse all out imports and scan them or something like that.
I rather thought about a directory based solution. Is there anything to efficiently monitor the "last change inside a directory" on a file system basis?
We are using ext4.
Or maybe is there another way, maybe with the find command, that does all the job based on inode indexes?
Something like that?
I have seen a lot of programs that instantly "see" changes without scanning whole filesystems. I believe there are also sort of "file manipulation watch" daemons available under linux.
The problem is that the css directory is pretty huge.
Can anyone point me in the right direction?
I suggest you use php-independent daemon to modify change date of your main css file when one of dependent php files are modified. You can use dnotify for it, something like:
dnotify -a -r -b -s /path/to/imported/css/files/ -e touch /path/to/main/css/file;
It will execute 'touch' on main css file each time one of the files in other folder are modified (-a -r -b -s = any access/recursive directory lookup/run in background/no output). Or you can do any other action and test for it from PHP.
If you use the command
ls -ltr `find . -type f `
It will give you a long listing of all files with the newest at the bottom.
Try to have a look to inotify packages that will allows you to be notified eah time a modification occurs in a directory.
InotifyTools
php-inotify
I've never used it, but apparently there is inotify support for PHP.
(inotify would be the most efficient way to get notifications under Linux)
For a particular project I have, no server side code is allowed. How can I create the web site in php (with includes, conditionals, etc) and then have that converted into a static html site that I can give to the client?
Update: Thanks to everyone who suggested wget. That's what I used. I should have specified that I was on a PC, so I grabbed the windows version from here: http://gnuwin32.sourceforge.net/packages/wget.htm.
If you have a Linux system available to you use wget:
wget -k -K -E -r -l 10 -p -N -F -nH http://website.com/
Options
-k : convert links to relative
-K : keep an original versions of files without the conversions made by wget
-E : rename html files to .html (if they don’t already have an htm(l) extension)
-r : recursive… of course we want to make a recursive copy
-l 10 : the maximum level of recursion. if you have a really big website you may need to put a higher number, but 10 levels should be enough.
-p : download all necessary files for each page (css, js, images)
-N : Turn on time-stamping.
-F : When input is read from a file, force it to be treated as an HTML file.
-nH : By default, wget put files in a directory named after the site’s hostname. This will disabled creating of those hostname directories and put everything in the current directory.
Source: Jean-Pascal Houde's weblog
build your site, then use a mirroring tool like wget or lwp-mirror to grab a static copy
I have done this in the past by adding:
ob_start();
In the top of the pages and then in the footer:
$page_html = ob_get_contents();
ob_end_clean();
file_put_contents($path_where_to_save_files . $_SERVER['PHP_SELF'], $page_html);
You might want to convert .php extensions to .html before baking the HTML into the files.
If you need to generate multiple pages with variables one quite easy option is to append the filename with md5sum of all GET variables, you just need to change them in the HTML too. So you can convert:
somepage.php?var1=hello&var2=hullo
to
somepage_e7537aacdbba8ad3ff309b3de1da69e1.html
ugly but works.
Sometimes you can use PHP to generate javascript to emulate some features, but that cannot be automated very easily.
Create the site as normal, then use spidering software to generate a HTML copy.
HTTrack is software I have used before.
One way to do this is to create the site in PHP as normal, and have a script actually grab the webpages (through HTTP - you can use wget or write another php script that just uses file() with URLs) and save them to the public website locations when you are "done". Then you can just run the script again when you decide to change the pages again. This method is quite useful when you have a slowly changing database and lots of traffic, as you can eliminate all SQL queries on the live site.
If you use modx it has a built in function to export static files.
If you have a number of pages, with all sorts of request variables and whatnot, probably one of the spidering tools the other commenters have mentioned (wget, lwp-mirror, etc) would be the easiest and most robust solution.
However, if the number of pages you need to get is low, or at least manageable, you've got a few options which don't require any third party tools (not that you should discount them JUSt because they are third party).
You can use php on the command line to get it to output directly into a file.
php myFile.php > myFile.html
Using this method could get painful (though you could put it all into a shell script), and it doesn't allow you to pass variables in the same way (eg: php myFile.php?abc=1 won't work).
You could use another PHP file as a "build" script which contains a list of all the URLs you want and then grabs them via file_get_contents() or file() and writes them to a local file. Using this method, you can also get it to check if the file has changed (md5_file() should work for that), so you'll know what to give your client, should they only want updates.
Further to #2, before you write the output to file, scan it for local urls and then add those to your list of files to download. While you're there, change those urls to link to what you'll eventually name your output so you have a functioning web at the end. Note of caution here - if this is sounding good, you could probably use one of the tools which already exist and do this for you.
Alternatively to wget you could use (Win|Web)HTTrack (Website) to grab the static page. HTTrack even corrects links to files and documents to match the static output.
You can use python or visual basic (or your choice) to create your static files all at once then upload them.
For a project with 11 million business listings in excel files I used VBA to extract the spreadsheet data into 11 mil small .php files, then zipped, ftp'd, unzipped.
https://contactlookup.us
Voila - a super fast business directory
I started with Jekyll, but after about half million entries the generator got bogged down. For 11 million it looked like it would finalize the build in about 2 months!
I do it on my own web site for certain pages that are guaranteed not to change -- I simply run a shell script that could be boiled to (warning: bash pseudocode):
find site_folder -name \*.static.php -print -exec Staticize {} \;
with Staticize being:
# This replaces .static.php with .html
TARGET_NAME="`dirname "$1"`/"`basename "$1" .static.php`".html
php "$1" > "$TARGET_NAME"
wget is probably the most complete method. If you don't have access to that, and you have a template based layout, you may want to look into using Savant 3. I recommend Savant 3 highly over other template systems like Smarty.
Savant is very light weight and uses PHP as the template language, not some proprietary sublanguage. The command you would want to look up is fetch(), which will "compile" your template and place it in a variable that you can output.
http://www.phpsavant.com/