Where to put a Crawler script in Laravel project?

Where to put a Crawler script in Laravel project? - php

I have created a really simple PHP crawler, which I want to implement in a Laravel project. I don't know where to put it tho.. I want to start the script and just run it while the application is up.
I know that it should not be in the Controllers, or in the Cron schedule, so any suggestions where to set it up?
$homepage = 'https://example.com';
$already_crawled = [];
$crawling = [];
function follow_links($url){
global $already_crawled;
global $crawling;
$doc = new DOMDocument();
$doc->loadHTML(file_get_contents($url));
$linklist = $doc->getElementsByTagName('a');
foreach ($linklist as $link) {
$l = $link->getAttribute("href");
$full_link = 'https://example.com'.$l;
if (!in_array($full_link, $already_crawled)) {
$already_crawled[] = $full_link;
$crawling[] = $full_link;
echo $full_link.PHP_EOL;
// Insert data in the DB
}
}
array_shift($crawling);
foreach ($crawling as $link) {
follow_links($link);
}
}
follow_links($homepage);

I would recommend a combination of a Service class, Command, and possibly Jobs — and then running them from worker processes.
Your Service would be a class which contains all of the logic for crawling a page. The crawler service is then used either by an artisan command, a queued job, or a combination of both.
You are right that you don't want to run the crawler directly from the built-in Laravel scheduler (because it might run for a long time and prevent other scheduled tasks from running). However, one option is to use your Laravel schedule to run a task which checks for urls that need to be re-crawled and dispatches queued jobs to your worker processes, which are very easy to implement in Laravel.
Each new discovered url can be thought of as a separate task and queued individually for crawling, rather than running the process "continually" while the application is online.

Related

rabbitmq and php - Process multiple queues with one worker (broker)

I have 1000 queues with specific names. so I want to process these queues with one broker. is it possible?
the queue names is stored in mysql db so I should fetch theme and run the broker for each one. and of course it should run asynchronously and should be able to pass the queued item to a idle broker. is this possible? or I should make 1000 files with specific queue names as brokers?
Update:
this is a picture of my queues. the queues should run in a parallel manner not a serial one. so the users are producer and the worker is consumer that runs the send_message() method;

I can show you how to it with enqueue library. I must warn you, there is no way to consume messages asynchronously in one process. Though you can run a few processes that serve a set of queues. They could be divided into groups by the queue importance.
Install the AMQP transport and consumption library:
composer require enqueue/amqp-ext enqueue/enqueue
Create a consumption script. I assume that you have an array of queue names already fetched from DB. They are stored in $queueNames var. The example bound the same processor to all queues but you can set different ones, of course.
<?php
use Enqueue\AmqpExt\AmqpConnectionFactory;
use Enqueue\Consumption\QueueConsumer;
use Enqueue\Psr\PsrMessage;
use Enqueue\Psr\PsrProcessor;
// here's the list of queue names which you fetched from DB
$queueNames = ['foo_queue', 'bar_queue', 'baz_queue'];
$factory = new AmqpConnectionFactory('amqp://');
$context = $factory->createContext();
// create queues at RabbitMQ side, you can remove it if you do not need it
foreach ($queueNames as $queueName) {
$queue = $context->createQueue($queueName);
$queue->addFlag(AMQP_DURABLE);
$context->declareQueue($queue);
}
$consumer = new QueueConsumer($context);
foreach ($queueNames as $queueName) {
$consumer->bind($queueName, function(PsrMessage $psrMessage) use ($queueName) {
echo 'Consume the message from queue: '.$queueName;
// your processing logic.
return PsrProcessor::ACK;
});
}
$consumer->consume();
More in doc

Release Queued Laravel Job without increasing Attempts count

Sometimes I need to release a Laravel job and have it rejoin the queue. However when doing this, the attempts count is increased. It becomes 2 and, if your queue worker is limited to 1 try, it will never be run.
How can I release without increasing attempts?
To release I am using:
$this->release(30);
Prior to this line I have tried the following code:
$payload = json_decode($this->payload, true);
if (isset($payload['attempts'])) {
$payload['attempts'] = 0;
}
$this->payload = json_encode($payload);
This does not work. The payload property is not available. It seems to be present in the Job class.
The code Laravel framework has to reset count is in the RetryCommand class. It is as follows:
protected function resetAttempts($payload)
{
$payload = json_decode($payload, true);
if (isset($payload['attempts'])) {
$payload['attempts'] = 0;
}
return json_encode($payload);
}
But I cannot work out how to access the $payload from my job class?
Is there a better way to release a job without increasing the attempt count?
I am using Laravel 5.4 and Redis queue driver.

So I just ended up deleting and requeuing a new job. Maybe not clean but it does work.
$this->delete();
$job = (new ProcessPage($this->pdf))->onQueue('converting');
dispatch($job);

PHP PECL Threads results order

I have a multithread, the main idea is to run nmap commands in the console and deliver the results in an orderly manner,
example:
Results after shell_exec
Command 4
Command 1
Command 2
Command 3
How can I get the results in an orderly manner?
Command 1
Command 2
Command 3
Command 4
public function __construct($arg) {
$this->arg = $arg;
}
public function run() {
$salida = shell_exec($comando);
}
`

If you're launching them in separate threads, the jobs are unlikely to finish in the same order that they were started. You'll need to track them and wait until all are done. You didn't include much of your code, but here's a generic example:
// create jobs
$jobs[0] = new nmapJob(args0);
$jobs[1] = new nmapJob(args1);
...
// start jobs
foreach ($jobs as $job)
{
$job.start();
}
// wait for jobs to finish
foreach ($jobs as $job)
{
$job.join();
}
// display results
foreach ($jobs as $job)
{
echo($job.salida);
}
But... I suggest using a different technique. Having a shell command dangle like that isn't the best of practices, especially if it can take a while to run (as nmap jobs often do). It's more complicated, but look into running the scans asynchronously. Spawn them as a separate process and have the results dumped into a directory. A different PHP script can be used to process the results in that directory once the scans are done.

run big loop with parallel threads in PHP CLI

I have a computation-expensive backend process in Symfony2 / PHP that I would like to run multi-threaded.
Since I iterate over thousands of objects, I think I shouldn't start one thread per object. I would like to have a $cores variable that defines how many threads I want in parallel, then iterate through the loop and keep that many threads running. So every time a thread finishes, a new one with the next object should be started, until all objects are done.
Looking at the pthreads documentation and doing some Google searches, I can't find a useable example for this situation. All examples I found have a fixed number of threads they run once, none of them iterate over thousands of objects.
Can someone point me into the right direction to get started? I understand the basics of setting up a thread and joining it, etc. but not how to do it in a loop with a wait condition.

The answer to the question is use Pool and Worker abstraction.
The basic idea is that you ::submit Threaded objects to the Pool, which it stacks onto the next available Worker, distributing your Threaded objects (round robin) across all Workers.
Follows is super simple code is for PHP7 (pthreads v3):
<?php
$jobs = [];
while (count($jobs) < 2000) {
$jobs[] = mt_rand(0, 1999);
}
$pool = new Pool(8);
foreach ($jobs as $job) {
$pool->submit(new class($job) extends Threaded {
public function __construct(int $job) {
$this->job = $job;
}
public function run() {
var_dump($this->job);
}
});
}
$pool->shutdown();
?>
The jobs are pointless, obviously. In the real world, I guess your $jobs array keeps growing, so you can just swap foreach for some do {} while, and keep calling ::submit for new jobs.
In the real world, you will want to collect garbage in the same loop (just call Pool::collect with no parameters for default behaviour).
Noteworthy, none of this would be possible if it really were the case that PHP wasn't intended to work in multi-threaded environments ... it definitely is.
That is the answer to the question, but it doesn't make it the best solution to your problem.
You have mentioned in comments that you assume 8 threads executing Symfony code will take up less memory than 8 processes. This is not the case, PHP is shared nothing, all the time. You can expect 8 Symfony threads to take up as much memory as 8 Symfony processes, in fact, a little bit more. The benefit of using threads over processes is that they can communicate, synchronize and (appear to) share with each other.
Just because you can, doesn't mean you should. The best solution for the task at hand is probably to use some ready made package or software intended to do what is required.
Studying this stuff well enough to implement a robust solution is something that will take a long time, and you wouldn't want to deploy that first solution ...
If you decide to ignore my advice, and give it a go, you can find many examples in the github repository for pthreads.

Joe has a good approach, but I found a different solution elsewhere that I am now using. Basically, I have two commands, one control and one worker command. The control command starts background processes and checks their results:
protected function process($worker, $entity, $timeout=60) {
$min = $this->em->createQuery('SELECT MIN(e.id) FROM BM2SiteBundle:'.$entity.' e')->getSingleScalarResult();
$max = $this->em->createQuery('SELECT MAX(e.id) FROM BM2SiteBundle:'.$entity.' e')->getSingleScalarResult();
$batch_size = ceil((($max-$min)+1)/$this->parallel);
$pool = array();
for ($i=$min; $i<=$max; $i+=$batch_size) {
$builder = new ProcessBuilder();
$builder->setPrefix($this->getApplication()->getKernel()->getRootDir().'/console');
$builder->setArguments(array(
'--env='.$this->getApplication()->getKernel()->getEnvironment(),
'maf:worker:'.$worker,
$i, $i+$batch_size-1
));
$builder->setTimeout($timeout);
$process = $builder->getProcess();
$process->start();
$pool[] = $process;
}
$this->output->writeln($worker.": started ".count($pool)." jobs");
$running = 99;
while ($running > 0) {
$running = 0;
foreach ($pool as $p) {
if ($p->isRunning()) {
$running++;
}
}
usleep(250);
}
foreach ($pool as $p) {
if (!$p->isSuccessful()) {
$this->output->writeln('fail: '.$p->getExitCode().' / '.$p->getCommandLine());
$this->output->writeln($p->getOutput());
}
}
}
where $this->parallel is a variable I set to 6 on my 8 core machine, it signifies the number of processes to start. Note that this method requires that I iterate over a specific entity (it splits by that), which is always true in my use cases.
It's not perfect, but it starts completely new processes instead of threads, which I consider the better solution.
The worker command takes min and max ID numbers and does the actual work for the set between those two.
This approach works as long as the data set is reasonably well distributed. If you have no data in the 1-1000 range but every ID between 1000 and 2000 is used, the first three processes would have nothing to do.

Create cronjob with Zend Framework

I am trying to write a cronjob controller, so I can call one website and have all modules cronjob.php executed. Now my problem is how do I do that?
Would curl be an option, so I also can count the errors and successes?
[Update]
I guess I have not explained it enough.
What I want to do is have one file which I can call like from http://server/cronjob and then make it execute every /application/modules/*/controller/CronjobController.php or have another way of doing it so all the cronjobs aren't at one place but at the same place the module is located. This would offer me the advantage, that if a module does not exist it does not try to run its cronjob.
Now my question is how would you execute all the modules CronjobController or would you do it a completly different way so it still stays modular?
And I want to be able to giveout how many cronjobs ran successfully and how many didn't

After some research and a lot procrastination I came to the simple conclusion that a ZF-ized cron script should contain all the functionality of you zend framework app - without all the view stuff. I accomplished this by creating a new cronjobfoo.php file in my application directory. Then I took the bare minimum from:
-my front controller (index.php)
-my bootstrap.php
I took out all the view stuff and focused on keeping the environment setup, db setup, autoloader, & registry setup. I had to take a little time to correct the document root variable and remove some of the OO functionality copied from my bootstrap.
After that I just coded away.. in my case it was compiling and emailing out nightly reports. It was great to use Zend_Mail. When I was confident that my script was working the way I wanted, I just added it my crontab.
good luck!

For Zend Framework I am currently using the code outlined bellow. The script only includes the portal file index.php, where all the paths, environment and other Zendy code is bootstrapped. By defining a constant in the cron script we cancel the final step , where the application is run.
This means the application is only setup, not even bootstrapped. At this point we start bootstraping the resources we need and that is that
//public/index.php
if(!defined('DONT_RUN_APP') || DONT_RUN_APP == false) {
$application->bootstrap()->run();
}
// application/../cron/cronjob.php
define("DONT_RUN_APP",true);
require(realpath('/srv/www/project/public/index.php'));
$application->bootstrap('config');
$application->bootstrap('db');
//cron code follows

I would caution putting your cronjobs accessible to the public because they could be triggered outside their normal times and, depending on what they do, cause problems (I know that is not what you intend, but by putting them into an actual controller it becomes reachable from the browser). For example, I have one cron that sends e-mails. I would be spammed constantly if someone found the cron URL and just began hitting it.
What I did was make a cron folder and in there created a heartbeat.php which bootstraps Zend Framework (minus MVC) for me. It checks a database which has a list of all the installed cron jobs and, if it is time for them to run, generates an instances of the cron job's class and runs it.
The cron jobs are just child classes from an abstract cron class that has methods like install(), run(), deactivate(), etc.
To fire off my job I just have a simple crontab entry that runs every 5 minutes that hits heartbeat.php. So far it's worked wonderful on two different sites.

Someone mentioned this blog entry a couple days ago on fw-general (a mailinglist which I recommend reading when you use the Zend Framework).
There is also a proposal for Zend_Controller_Request_Cli, which should address this sooner or later.

I have access to a dedicated server and I initially had a different bootstrap for the cron jobs. I eventually hated the idea, just wishing I could do this within the existing MVC setup and not have to bother about moving things around.
I created a file cron.sh, saved is within my site root (not public) and in it I put a series of commands I would like to run. As I wanted to run many commands at once I wrote the PHP within my controllers as usual and added curl calls to those urls within cron.sh. for example curl http://www.mysite.com/cron_controller/action Then on the cron interface I ran bash /path/to/cron.sh.
As pointed out by others your crons can be fired by anyone who guesses the url so there's always that caveat. You can find a solution to that in many different ways.

Take a look at zf-cli:
scripts at master from padraic/ZFPlanet - GitHub
This handles well all cron jobs.

Why not just create a crontab.php, including, or requiring the index.php bootstrap file?
Considering that the bootstrap is executing Zend_Loader::registerAutoload(), you can start working directly with the modules, for instance, myModules_MyClass::doSomething();
That way you are skipping the controllers. The Controller job is to control the access via http. In this case, you don't need the controller approach because you are accessing locally.

Do you have filesystem access to the modules' directories? You could iterate over the directories and determine where a CronjobController.php is available. Then you could either use Zend_Http_Client to access the controller via HTTP or use an approach like Zend_Test_PHPUnit: simulate the actual dispatch process locally.

You could set up a database table to hold references to the cronjob scripts (in your modules), then use a exec command with a return value on pass/fail.

I extended gregor answer with this post. This is what came out:
//public/index.php
// Run application, only if not started from command line (cli)
if (php_sapi_name() != 'cli' || !empty($_SERVER['REMOTE_ADDR'])) {
$application->run();
}
Thanks gregor!

My solution:
curl /cron
Global cron method will include_once all controllers
Check whether each of the controllors has ->cron method
If they have, run those.
Public cron url (for curl) is not a problem, there are many ways to avoid abuse. As said, checking remote IP is the easiest.

This is my way to run Cron Jobs with Zend Framework
In Bootstrap I will keep environment setup as it is minus MVC:
public static function setupEnvironment()
{
...
self::setupFrontController();
self::setupDatabase();
self::setupRoutes();
...
if (PHP_SAPI !== 'cli') {
self::setupView();
self::setupDbCaches();
}
...
}
Also in Bootstrap, I will modify setupRoutes and add a custom route:
public function setupRoutes()
{
...
if (PHP_SAPI == 'cli') {
self::$frontController->setRouter(new App_Router_Cli());
self::$frontController->setRequest(new Zend_Controller_Request_Http());
}
}
App_Router_Cli is a new router type which determines the controller, action, and optional parameters based on this type of request: script.php controller=mail action=send. I found this new router here: Setting up Cron with Zend Framework 1.11
:
class App_Router_Cli extends Zend_Controller_Router_Abstract
{
public function route (Zend_Controller_Request_Abstract $dispatcher)
{
$getopt = new Zend_Console_Getopt (array());
$arguments = $getopt->getRemainingArgs();
$controller = "";
$action = "";
$params = array();
if ($arguments) {
foreach($arguments as $index => $command) {
$details = explode("=", $command);
if($details[0] == "controller") {
$controller = $details[1];
} else if($details[0] == "action") {
$action = $details[1];
} else {
$params[$details[0]] = $details[1];
}
}
if($action == "" || $controller == "") {
die("Missing Controller and Action Arguments == You should have:
php script.php controller=[controllername] action=[action]");
}
$dispatcher->setControllerName($controller);
$dispatcher->setActionName($action);
$dispatcher->setParams($params);
return $dispatcher;
}
echo "Invalid command.\n", exit;
echo "No command given.\n", exit;
}
public function assemble ($userParams, $name = null, $reset = false, $encode = true)
{
throw new Exception("Assemble isnt implemented ", print_r($userParams, true));
}
}
In CronController I do a simple check:
public function sendEmailCliAction()
{
if (PHP_SAPI != 'cli' || !empty($_SERVER['REMOTE_ADDR'])) {
echo "Program cannot be run manually\n";
exit(1);
}
// Each email sent has its status set to 0;
Crontab runs a command of this kind:
* * * * * php /var/www/projectname/public/index.php controller=name action=send-email-cli >> /var/www/projectname/application/data/logs/cron.log

It doesn't make sense to run the bootstrap in the same directory or in cron job folder. I've created a better and easy way to implement the cron job work. Please follow the below things to make your work easy and smart:
Create a cron job folder such as "cron" or "crobjob" etc. whatever you want.
Sometimes we need the cron job to run on a server with different interval like for 1 hr interval or 1-day interval that we can setup on the server.
Create a file in cron job folder like I created an "init.php", Now let's say you want to send a newsletter to users in once per day. You don't need to do the zend code in init.php.
So just set up the curl function in init.php and add the URL of your controller action in that curl function. Because our main purpose is that an action should be called on every day. for example, the URL should be like this:
https://www.example.com/cron/newsletters
So set up this URL in curl function and call this function in init.php in the same file.
In the above link, you can see "cron" is the controller and newsletters is the action where you can do your work, in the same way, don't need to run the bootstrap file etc.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Where to put a Crawler script in Laravel project? - php

Related

rabbitmq and php - Process multiple queues with one worker (broker)

Release Queued Laravel Job without increasing Attempts count

PHP PECL Threads results order

run big loop with parallel threads in PHP CLI

Create cronjob with Zend Framework

Categories

Resources