I have around 100k URLS in a database and I want to check if all the URLS are valid. I tried with PHP and curl but its very slow and gives script timeout. Is there any better way to do this using some other shell script?
so far I tried this:
// By default get_headers uses a GET request to fetch the headers. If you
// want to send a HEAD request instead, you can do so using a stream context:
stream_context_set_default(
array(
'http' => array(
'method' => 'HEAD'
)
)
);
$headers = get_headers('http://example.com');
It's running in for loop.
There is a lot of latency in servers replying, so this problem lends itself to parallelising. Try splitting the list into a number of sublists and running scripts in parallel, each one processing a different list.
Try looking at the split command to generate the lists.
So, you will get something like this:
#!/bin/bash
split -l 1000 urllist.txt tmpurl # split bigfile into 1000 line subfiles called tmpurl*
for p in tmpurl* # for all tmpurl* files
do
# Start a process to check the URLs in that list
echo start checking file $p in background &
done
wait # till all are finished
Where I have put "start checking file $p in background" you would need to supply a simple PHP or shell script that takes a filename as parameter (or reads from its stdin) and does the checking in a for loop of all the URLs in the file however you are already doing it.
Extra Information:
Just for fun, I made a list of 1,000 URLs and curled headers from each of them, with curl -I -s. In the sequential case, it took 4 minutes 19 seconds. When I used the above script to split the 1,000 URLs into sub-lists of 100 in each file and started 10 processes, the entire test took 22 seconds - so a 12x speedup. Splitting the list into sublists of 50 URLs, resulted in 20 processes that all completed in 14 seconds. So, as I said, the problem is readily parallelisable.
You can use mechanize python module to visit websites and take response from it
my bash solution:
#!/bin/bash
###############################################
# mailto: ggerman#gmail.com
# checkurls
# https://github.com/ggerman/checkurls
# require curl
###############################################
url() {
cat urls.csv |
replace |
show
}
replace() {
tr ',' ' '
}
show() {
awk '{print $1}'
}
url | \
while read CMD; do
echo $CMD
curl -Is $CMD | head -n 1
done
Related
I am using something called DAP (https://github.com/rapid7/dap) which helps deal with large file handling and outputs an ever growing list of data.
For example:
curl -s https://scans.io/data/rapid7/sonar.http/20141209-http.gz | zcat | head -n 10 | dap json + select vhost + lines
This code correctly works and it will output 10 lines of IP addresses.
My question is how can I read this data from PHP - in effect where a data feed is continuous/live (it will end at some point) how can I process each line I'm given?
I've tried piping to it but I don't get passed the output. I don't want to use exec because the data is constantly growing. I think it could be a stream but not sure that is the case.
For anyone else that finds themselves in the same situation - here is the answer that works for me (can be run directly from the command line also):
curl -s 'https://scans.io/data/rapid7/sonar.http/20141209-http.gz' | zcat | head -n 1000 | dap json + select vhost + lines | while read line ; do php /your_script/path/file.php $line ; done
Then pull out $argv[1] and the data is all yours.
I have some code executed in PHP after meeting some criteria through if/then statements which looks something like this:
if(in_array($ext,$video)&&($ext!=="mp4")){
exec("ffmpeg -i ".$fileName.".".$ext." -s 640x360 ".$fileName.".mp4");
/*
if(successful){
unlink($fileName.$ext);
$status="Video entry approved. File converted.";
}
*/
}
As you can see, the issue I'm having is trying to figure out what should go in place of if(successful). The point of this section of the code is to check the files extension against an array of known extensions that are in video format, and that aren't already in the mp4 format. If it passes this check, ffmpeg should run and convert to mp4.
So a few questions here. Firstly, how can I return a status to tell me if it is converting, succeeded, or failed? Secondly, how can this be run asynchronously? That is, if I wanted to convert multiple files, would I be able to do so? Would I be able to limit ffmpeg to ensure it does not take up all of my server's processing power and inadvertently bring the site to a grinding halt?
Or is there a better way to go about converting files than this? I'm pretty sure my method must be crude.
EDIT: In addition to this, how does one run ffmpeg in the background, so that the page can be closed, and/or another instance from the same page can be started up by the user for multiple simultaneous conversions? Is it possible to include a real-time progress status of each conversion?
I like your use of exec. Did you happen to notice the php.net entry for that command? It's signature looks like this:
string exec ( string $command [, array &$output [, int &$return_var ]] )
You're using the $command parameter, but not $output or $return_var. So you basically have two choices here:
1 - Add the $output parameter to your exec command and print_r. Once you know the return you can use that information to help you with your logic. (The $output parameter is filled with any output that would have printed to the console had you run the command on the console.) Note that you may have to add the option -loglevel 'verbose' to see useful output.
exec("ffmpeg -i -loglevel 'verbose' ".$fileName.".".$ext." -s 640x360 ".$fileName.".mp4", $my_output_var);
print_r($my_output_var);<br>
2- Add $output and $return_var, then check to make sure $return_var is equal to 0 (which means the command executed successfully). A list of return vars can be found here:
exec("ffmpeg -i ".$fileName.".".$ext." -s 640x360 ".$fileName.".mp4", $my_output_var, $var);
if($var == 0){
// do something
}
I've read here and cannot really understand how to speed up my simple exec() which basically looks like this:
zcat access_log.201312011745.gz | grep 'id=6' | grep 'id2=10' | head -n10
I've added ini_set('memory_limit', 256); to the top of the PHP document, but the script still takes about 1 minute to run (contrasted with about near instant completion in Penguinet). What can I do to improve it?
I would try some of the following:
Change your exec to just run somethig simple, like
echo Hello
and see if it still takes so long - if it does, the problem is in the process creation and exec()ing area.
If that runs quickly, try changing the exec to something like:
zcat access_log.201312011745.gz > /dev/null
to see if it is the "zcat" slowing you down
Think about replacing the greps with a "sed" that quits (using "q") as soon as it finds what you are looking for rather than continuing all the way to end of file - since it seems (by your "head") you are only interested in the first few, not all occurrences of your strings. For example, you seem to be looking for lines that contain "id=6" and also "id2=10", so if you used "sed" like below, it may be faster because "sed" will print it and stop immediately the moment it finds a line with "id=6" followed by "id2=10"
zcat access_log.201312011745.gz | sed -n '/id=2.*id2=10/p;q'
The "-n" says "don't print, in general" and then it looks for "id=2" followed by any characters then "id2=10". If it finds that, it prints the line and the "q" makes it quit immediately without looking through to end of file. Note that I am assuming "id=2" comes before "id2=10" on the line. If that is not true, the "sed" will need additional work.
I wrote this in Linux BASH shell, but if there's a better solution in PHP that would be fine.
I need to produce a random selection from an array of 12 elements. This is what I've been doing so far:
# Display/return my_array that's been randomly selected:
# Random 0 to 11:
r=$(( $RANDOM % 12 ))
echo ${my_array[$r]}
Each time the call is made, it randomly selects an element. However, too often, it "randomly" selects the same element in a row sometimes several times. How can this be accomplish in BASH shell or PHP so make a random selection which is also not a repeat of the last one selected? Thanks!
r=$last
while [ "$last" = "$r" ]
do
r=$(($RANDOM % 12))
done
export last=$r
If you are calling the script again and again, then suppose the script name is test.sh you need to call it like . test.sh instead of ./test.sh, it will make the script run in current shell. Else even the export is not needed. Otherwise creating a temp file approach is another robust way of getting the last value.
You can create a permutation and then pop values from it
perm=`echo $perm | sed 's/[0-9]\+//'` #remove the first number from $perm
if [ -z "$perm" ];then #if perm == ""
perm=`shuf -e {0..12}` #create new permutation
#now perm="11 7 0 4 8 12 9 5 10 6 2 1 3" for example
fi
echo $perm | cut -d' ' -f1 #show the first number from $perm
Note that this script is stateful. It need to store the generated permutation between executions. Is does it by storing them in a shell variable $perm. Because shell scripts cannot modify the calling shell environment, you need to execute it itside your current shell:
source ./next-index.sh
having saved the script is to next-index.sh file.
You could alternatively save $perm to file between executions.
I have a small issue here, I need to be able to read a file of unknown size it could be a a few hundred lines or many more the log files change all the time and depending on when i check. I would like to have a method that is in php or in linux that i can read a range of lines from a file. I dont want to have to read the entire file in to php memory then remove the lines because the file may be larger then the allowed memory of php.
I also want it to be using default php modules or default linux tools dont want to need to install anything because it needs to be portable.
Edit:
For the linux based options I would like to be able to supply more then one range, i may need to get a few different ranges of lines I know how to do it in php by not in linux and to avoid reading past lines i have already read?
With awk:
awk 'NR>=10 && NR<=15' FILE
With awk (two ranges):
awk 'NR>=10 && NR<=15 || NR>=26 && NR<=28' FILE
With ed:
echo 2,5p | ed -s FILE
With ed and two ranges :
echo -e "2,5p\n7,8p" | ed -s FILE
Last but not least, a sed solution with two ranges (fastest solution, tested with time):
sed -n '2,5p;7,8p' FILE
What about something like
head -100 | tail -15
gives you lines 86-100
$ cat input.txt
q
w
e
r
t
y
u
i
$ sed -n 2,5p input.txt
w
e
r
t
while ($lines_read < last_line_desired) {
while ($line = fgets($filehandle, $buffersize) !== false)
if (line >= first_desired_line) {
push($interesting_lines, $line)
}
}
$lines_read++
}
Opening the file handle, selecting the appropriately large enough buffer size to cover any expected line lengths, etc. is up to you.
If you're reading files that are regularly appended to, you should look into the ftell and fseek functions to note where you are in your data and skip past all the old stuff before reading more.