Read from live data feed php - php

I am using something called DAP (https://github.com/rapid7/dap) which helps deal with large file handling and outputs an ever growing list of data.
For example:
curl -s https://scans.io/data/rapid7/sonar.http/20141209-http.gz | zcat | head -n 10 | dap json + select vhost + lines
This code correctly works and it will output 10 lines of IP addresses.
My question is how can I read this data from PHP - in effect where a data feed is continuous/live (it will end at some point) how can I process each line I'm given?
I've tried piping to it but I don't get passed the output. I don't want to use exec because the data is constantly growing. I think it could be a stream but not sure that is the case.

For anyone else that finds themselves in the same situation - here is the answer that works for me (can be run directly from the command line also):
curl -s 'https://scans.io/data/rapid7/sonar.http/20141209-http.gz' | zcat | head -n 1000 | dap json + select vhost + lines | while read line ; do php /your_script/path/file.php $line ; done
Then pull out $argv[1] and the data is all yours.

Related

Shell script to grep in a range of files

I'm setting up a script which takes some user data with the read command. Using this data I need to search the file range and then do some filtering.
Here's how it is,
Enter fromtime
read fromtime
Enter totime
read totime
Enter the ID
read id
Initially I SSH into a server and then there I have a directory, Records with path cd home/report/records here, I have:
REC_201901020345.gz (yyyymmddhhmm)
REC_201901120405.gz
REC_201903142543.gz
and so on.
These files have data along with the $id.
When the user inputs $fromtime and $totime it will be of format yyyymmddhh . Here, I need to go to that range of files and then grep for the $id and display. For example:
If $fromtime is 2019010103 and $totime is 2019031425. I need to go to only those specific range of files that is REC_201901020345.gz, REC_201901120405.gz, REC_201903142543.gz and perform the grep to find the id entered by the user.
I have tried this using an if condition but it doesn't seem to work. I am new to writing scripts like these. There might be mistakes when I have described everything here. Sorry for the same.
source config.sh
Enter fromtime
read fromtime
Enter totime
read totime
Enter the ID
read id
ssh $user#$ip
cd /home/report/records
# <-- need to know what to add here as described here, to navigate to the
# <-- specific range $fromtime-$totime. Then the command to find id will be
zfgrep $id *.gz
The result should be only the the data with the id's in the specified range of .gz files.
Try below command.
echo -e "$(ls -1 REC_????????????.gz 2>/dev/null)\nREC_${fromtime}##\nREC_${totime}##" | sort | sed -n "/##/,/##/p" | sed '1d;$d' | xargs zfgrep -a "$id"
Explanation:
'fromdate' and 'todate' along with a ## (say marker) is appended to the output of ls.
Sorted the input, resulting in desired file names enclosed with marker.
Both sed, prints only lines between marker.
Last one is the command, supposed to be executed for each file name.
You can omit pipes and all next commands, starting from end, and see how output is building.
To get the list of files within the given range (fromtime, totime), the following shell script may be used:
declare -i ta
for file in REC*.gz
do
ta=$(echo "${file}" | grep -oP 'REC_\K(.*)(?=[[:digit:]]{2}.gz)')
if [ "${ta}" ] ; then
if [ ${ta} -le ${totime} -a ${ta} -ge ${fromtime} ] ; then
echo -e "${file}"
fi
fi
done

Multiple url exists check

I have around 100k URLS in a database and I want to check if all the URLS are valid. I tried with PHP and curl but its very slow and gives script timeout. Is there any better way to do this using some other shell script?
so far I tried this:
// By default get_headers uses a GET request to fetch the headers. If you
// want to send a HEAD request instead, you can do so using a stream context:
stream_context_set_default(
array(
'http' => array(
'method' => 'HEAD'
)
)
);
$headers = get_headers('http://example.com');
It's running in for loop.
There is a lot of latency in servers replying, so this problem lends itself to parallelising. Try splitting the list into a number of sublists and running scripts in parallel, each one processing a different list.
Try looking at the split command to generate the lists.
So, you will get something like this:
#!/bin/bash
split -l 1000 urllist.txt tmpurl # split bigfile into 1000 line subfiles called tmpurl*
for p in tmpurl* # for all tmpurl* files
do
# Start a process to check the URLs in that list
echo start checking file $p in background &
done
wait # till all are finished
Where I have put "start checking file $p in background" you would need to supply a simple PHP or shell script that takes a filename as parameter (or reads from its stdin) and does the checking in a for loop of all the URLs in the file however you are already doing it.
Extra Information:
Just for fun, I made a list of 1,000 URLs and curled headers from each of them, with curl -I -s. In the sequential case, it took 4 minutes 19 seconds. When I used the above script to split the 1,000 URLs into sub-lists of 100 in each file and started 10 processes, the entire test took 22 seconds - so a 12x speedup. Splitting the list into sublists of 50 URLs, resulted in 20 processes that all completed in 14 seconds. So, as I said, the problem is readily parallelisable.
You can use mechanize python module to visit websites and take response from it
my bash solution:
#!/bin/bash
###############################################
# mailto: ggerman#gmail.com
# checkurls
# https://github.com/ggerman/checkurls
# require curl
###############################################
url() {
cat urls.csv |
replace |
show
}
replace() {
tr ',' ' '
}
show() {
awk '{print $1}'
}
url | \
while read CMD; do
echo $CMD
curl -Is $CMD | head -n 1
done

PHP exec very slow processing simple 3-pipe grep

I've read here and cannot really understand how to speed up my simple exec() which basically looks like this:
zcat access_log.201312011745.gz | grep 'id=6' | grep 'id2=10' | head -n10
I've added ini_set('memory_limit', 256); to the top of the PHP document, but the script still takes about 1 minute to run (contrasted with about near instant completion in Penguinet). What can I do to improve it?
I would try some of the following:
Change your exec to just run somethig simple, like
echo Hello
and see if it still takes so long - if it does, the problem is in the process creation and exec()ing area.
If that runs quickly, try changing the exec to something like:
zcat access_log.201312011745.gz > /dev/null
to see if it is the "zcat" slowing you down
Think about replacing the greps with a "sed" that quits (using "q") as soon as it finds what you are looking for rather than continuing all the way to end of file - since it seems (by your "head") you are only interested in the first few, not all occurrences of your strings. For example, you seem to be looking for lines that contain "id=6" and also "id2=10", so if you used "sed" like below, it may be faster because "sed" will print it and stop immediately the moment it finds a line with "id=6" followed by "id2=10"
zcat access_log.201312011745.gz | sed -n '/id=2.*id2=10/p;q'
The "-n" says "don't print, in general" and then it looks for "id=2" followed by any characters then "id2=10". If it finds that, it prints the line and the "q" makes it quit immediately without looking through to end of file. Note that I am assuming "id=2" comes before "id2=10" on the line. If that is not true, the "sed" will need additional work.

Return ranges of lines from text file PHP or LINUX

I have a small issue here, I need to be able to read a file of unknown size it could be a a few hundred lines or many more the log files change all the time and depending on when i check. I would like to have a method that is in php or in linux that i can read a range of lines from a file. I dont want to have to read the entire file in to php memory then remove the lines because the file may be larger then the allowed memory of php.
I also want it to be using default php modules or default linux tools dont want to need to install anything because it needs to be portable.
Edit:
For the linux based options I would like to be able to supply more then one range, i may need to get a few different ranges of lines I know how to do it in php by not in linux and to avoid reading past lines i have already read?
With awk:
awk 'NR>=10 && NR<=15' FILE
With awk (two ranges):
awk 'NR>=10 && NR<=15 || NR>=26 && NR<=28' FILE
With ed:
echo 2,5p | ed -s FILE
With ed and two ranges :
echo -e "2,5p\n7,8p" | ed -s FILE
Last but not least, a sed solution with two ranges (fastest solution, tested with time):
sed -n '2,5p;7,8p' FILE
What about something like
head -100 | tail -15
gives you lines 86-100
$ cat input.txt
q
w
e
r
t
y
u
i
$ sed -n 2,5p input.txt
w
e
r
t
while ($lines_read < last_line_desired) {
while ($line = fgets($filehandle, $buffersize) !== false)
if (line >= first_desired_line) {
push($interesting_lines, $line)
}
}
$lines_read++
}
Opening the file handle, selecting the appropriately large enough buffer size to cover any expected line lengths, etc. is up to you.
If you're reading files that are regularly appended to, you should look into the ftell and fseek functions to note where you are in your data and skip past all the old stuff before reading more.

php exec suggestions/alternatives

Can anyone give me some pointers with regard PHP command execution and best practice?
Im currently trying to parse some netbackup data, but i am running into issues related to the massive amount of data the system call is returning. In order to cut down the amount of data im retreiving I'm doing something like this:
$awk_command = "awk -F, '{print $1\",\"$2\",\"$3\",\"$4\",\"$5\",\"$6\",\"$7\",\"$9\",\"$11\",\"$26\",\"$32\",\"$33\",\"$34\",\"$35\",\"$36\",\"$37\",\"$38\",\"$39\",\"$40}'";
exec("sudo /usr/openv/netbackup/bin/admincmd/bpdbjobs -report -M $master_name -all_columns | $awk_command", $get_backups, $null);
foreach ($get_backups as $backup_detail)
{
process_the_data();
write_data_to_db();
}
Im using awk to limit the amount of data be received. Without it i end up receiving nearly ~150mb of data, and with it, i get a much more manageable ~800k of data.
You don't need to tell me that the awk shit is nasty - i know that already... But in the interests of bettering myself (and my code) can anyone suggest an alternative?
I was thinking of something like proc_open but really not sure if that is going to provide any benefits.
Use exec to write the data to a file instead of reading it whole into your script.
exec("sudo /usr/openv/netbackup/bin/admincmd/bpdbjobs -report -M $master_name -all_columns | $awk_command > /tmp/output.data");
Then use any memory efficient method to read the file in parts.
Have a look here:
Least memory intensive way to read a file in PHP

Categories