How to search for strings of webpage without saving content? - php

I know of one method where you can do this:
$url = "http://www.google.com/search?q=test";
$str = file_get_contents($url);
preg_match("title/tt\d{7}?/", $str, $matches);
print $matches[0];
But this reads the whole file and then scans for the match.Is there anyway I can reduce the time time taken for doing the above process of matching?

If you know where inside the webpage you need to look (i.e only the first 3000 characters or so), you can use the maxlen parameter in file_get_contents to limit the reading:
file_get_contents($url, false, NULL, -1, 3000);
UPDATE
If you don't know where to look in the webpage and you want to minimize http request length, I worked up a nice solution for you :))
$url = "www.google.com";
$step = 3000;
$found = false;
$addr = gethostbyname($url);
$client = stream_socket_client("tcp://$addr:80", $errno, $errorMessage);
if ($client === false) {
throw new UnexpectedValueException("Failed to connect: $errorMessage");
}
fwrite($client, "GET /search?q=test HTTP/1.0\r\nHost: $url\r\nAccept: */*\r\n\r\n");
$str = "";
while(!feof($client)){
$str .= stream_get_contents($client, $step, -1);
if(preg_match("/tt\d{7}?/", $str, $matches)){
$found = true;
break;
}
}
fclose($client);
if($found){
echo $matches[0];
} else {
echo "not found";
}
EXPLANATION:
set the $step variable to be the number of bytes to read each iteration, and change the "search?q=test" to your desired query (IMDB titles, judging by your regex? :) ). It will do the job wonderfully.
You can also do echo $str after the while loop to see exactly how much it has read until it found the requested string.
I believe this was what you were looking for.

Related

PHP Strip domain name from url

I know there is a LOT of info on the web regarding to this subject but I can't seem to figure it out the way I want.
I'm trying to build a function which strips the domain name from a url:
http://blabla.com blabla
www.blabla.net blabla
http://www.blabla.eu blabla
Only the plain name of the domain is needed.
With parse_url I get the domain filtered but that is not enough.
I have 3 functions that stips the domain but still I get some wrong outputs
function prepare_array($domains)
{
$prep_domains = explode("\n", str_replace("\r", "", $domains));
$domain_array = array_map('trim', $prep_domains);
return $domain_array;
}
function test($domain)
{
$domain = explode(".", $domain);
return $domain[1];
}
function strip($url)
{
$url = trim($url);
$url = preg_replace("/^(http:\/\/)*(www.)*/is", "", $url);
$url = preg_replace("/\/.*$/is" , "" ,$url);
return $url;
}
Every possible domain, url and extension is allowed. After the function is finished, it must return a array of only the domain names itself.
UPDATE:
Thanks for all the suggestions!
I figured it out with the help from you all.
function test($url)
{
// Check if the url begins with http:// www. or both
// If so, replace it
if (preg_match("/^(http:\/\/|www.)/i", $url))
{
$domain = preg_replace("/^(http:\/\/)*(www.)*/is", "", $url);
}
else
{
$domain = $url;
}
// Now all thats left is the domain and the extension
// Only return the needed first part without the extension
$domain = explode(".", $domain);
return $domain[0];
}
How about
$wsArray = explode(".",$domain); //Break it up into an array.
$extension = array_pop($wsArray); //Get the Extension (last entry)
$domain = array_pop($wsArray); // Get the domain
http://php.net/manual/en/function.array-pop.php
Ah, your problem lies in the fact that TLDs can be either in one or two parts e.g .com vs .co.uk.
What I would do is maintain a list of TLDs. With the result after parse_url, go over the list and look for a match. Strip out the TLD, explode on '.' and the last part will be in the format you want it.
This does not seem as efficient as it could be but, with TLDs being added all the time, I cannot see any other deterministic way.
Ok...this is messy and you should spend some time optimizing and caching previously derived domains. You should also have a friendly NameServer and the last catch is the domain must have a "A" record in their DNS.
This attempts to assemble the domain name in reverse order until it can resolve to a DNS "A" record.
At anyrate, this was bugging me, so I hope this answer helps :
<?php
$wsHostNames = array(
"test.com",
"http://www.bbc.com/news/uk-34276525",
"google.uk.co"
);
foreach ($wsHostNames as $hostName) {
echo "checking $hostName" . PHP_EOL;
$wsWork = $hostName;
//attempt to strip out full paths to just host
$wsWork = parse_url($hostName, PHP_URL_HOST);
if ($wsWork != "") {
echo "Was able to cleanup $wsWork" . PHP_EOL;
$hostName = $wsWork;
} else {
//Probably had no path info or malformed URL
//Try to check it anyway
echo "No path to strip from $hostName" . PHP_EOL;
}
$wsArray = explode(".", $hostName); //Break it up into an array.
$wsHostName = "";
//Build domain one segment a time probably
//Code should be modified not to check for the first segment (.com)
while (!empty($wsArray)) {
$newSegment = array_pop($wsArray);
$wsHostName = $newSegment . $wsHostName;
echo "Checking $wsHostName" . PHP_EOL;
if (checkdnsrr($wsHostName, "A")) {
echo "host found $wsHostName" . PHP_EOL;
echo "Domain is $newSegment" . PHP_EOL;
continue(2);
} else {
//This segment didn't resolve - keep building
echo "No Valid A Record for $wsHostName" . PHP_EOL;
$wsHostName = "." . $wsHostName;
}
}
//if you get to here in the loop it could not resolve the host name
}
?>
try with preg_replace.
something like
$domain = preg_replace($regex, '$1', $url);
regex
function test($url)
{
// Check if the url begins with http:// www. or both
// If so, replace it
if (preg_match("/^(http:\/\/|www.)/i", $url))
{
$domain = preg_replace("/^(http:\/\/)*(www.)*/is", "", $url);
}
else
{
$domain = $url;
}
// Now all thats left is the domain and the extension
// Only return the needed first part without the extension
$domain = explode(".", $domain);
return $domain[0];
}

PHP string comparison. Regex

We are trying to display whether a file contains a specific string or not:
Here we read the file:
$myFile = "filename.txt";
$fh = fopen($myFile,'r');
$theData = fread($fh, filesize("filename.txt"));
fclose($fh);
filename.txt contains "Offline"
Here we are trying to compare the strings:
if(strcmp($theData,"Online")==0){
echo "Online"; }
elseif(strcmp($theData,"Offline")==0) {
echo "Offline"; }
else {
echo "This IF is not working." }
We have tried using regular if without the strcomp, but it did not work either. I'm thinking that an IF cannot compare the results from the fread to a regular string. Perhaps we will need to try another method.
Any Ideas?
Use preg_match()
$string = "your-string";
$pattern = "/\boffline\b/i";
// The \b in the pattern indicates a word boundary, so only the distinct
// word "offline" is matched; if you want to match even partial word "offline"
// within some word, change the pattern to this /offline/i
if(preg_match($pattern, $string)) {
echo "A match was found.";
}
You can use strpos() as well (it is faster in this case)
$string = 'your-stringoffline';
$find = 'offline';
$pos = strpos($string, $find);
if($pos !== false){
echo "The string '$find' was found in the string '$string' at position $pos";
}else{
echo "The string '$find' was not found in the string '$string'";
}
regex is very slow when used to search in long strings. use strpos
$strFile = file_get_contents("filename.txt"); // load file
if(strpos($strFile, 'Online')!==false){ // check if "Online" exists
echo "We are Online";
}
elseif(strpos($strFile, 'Offline')!==false){ // check if "Offline" exists
echo "We are Offline";
}
else{ // other cases
echo "Status is unknown";
}
I put another way to do that (depending what it is inside the file), although it is not the best it may be useful in some circumstances
if (exec("grep Offline filename.txt") === 'Offline')
echo 'Offline';
else
echo 'Online';
Byee
Are you checked the value contains in $theData ?
Try something like this:
if(strcmp($theData,"Online") === 0)
echo $theData." is equal to string Online using case sensisive";
else if(strcmp($theData,"Offline") === 0)
echo $theData." is equal to string Offline using case sensisive";
else
echo $theData." This IF is not working.";
Here the doc for more infos: http://php.net//manual/en/function.strcmp.php
Or using the hex494D49's method: (Not tested)
function isStringAreTheSame($initialString, $stringToCompare) {
$pattern = "/\b".$initialString."\b/";
return preg_match($pattern, $stringToCompare);
}

Split string in to multiple parts in PHP

I'm writing an IRC bot in PHP and trying to split the below notice down in to multiple parts.
:irc.server.com NOTICE PHPServ :*** CONNECT: Client connecting on port 6667 (class users): Guest!Guest#127.0.0.1 (127.0.0.1) [Guest]<br />
So far I am using:
while(1) {
while($data = fgets($socket)) {
echo nl2br($data);
flush();
$ex = explode(' ', $data);
if($ex[0] == "PING"){
fputs($socket, "PONG ".$ex[1]."\n");
}
if($ex[1] == "NOTICE"){
if($ex[6] == "connecting"){
$userstring = $ex[12];
$usernick = strstr($userstring, '!', true);
$userip = strstr($userstring, '#');
}
}
}
}
?>
So $user.nick is working ok but $user.ip includes the # and the IP address. Why does this include the # but the nickname doesn't include the !?
Also how can I get $user.ident which is between the ! and the #?
try:
$userParts = explode('#', $userstring);
$userip = end($userParts);
The reason is includes the '#' is because (source: http://php.net/strstr):
Returns part of haystack string starting from and including the first
occurrence of needle to the end of haystack.
To solve this you could use substring like this:
$userstring = $ex[12];
$exPos = strpos($userstring, '!');
$atPos = strpos($userstring, '#');
$usernick = strstr($userstring, '!', true);
$userip = substr($userstring, $atPos + 1);
$userident = substr($userstring, ($exPos + 1), ($atPos - $exPos) - 1);
I left the first strstr because it's easier to read/understand than a substring call.

Get youtube id for all url types

The following code works with all YouTube domains except for youtu.be. An example would be: http://www.youtube.com/watch?v=ZedLgAF9aEg would turn into: ZedLgAF9aEg
My question is how would I be able to make it work with http://youtu.be/ZedLgAF9aEg.
I'm not so great with regex so your help is much appreciated. My code is:
$text = preg_replace("#[&\?].+$#", "", preg_replace("#http://(?:www\.)?youtu\.?be(?:\.com)?/(embed/|watch\?v=|\?v=|v/|e/|.+/|watch.*v=|)#i", "", $text)); }
$text = (htmlentities($text, ENT_QUOTES, 'UTF-8'));
Thanks again!
//$url = 'http://www.youtube.com/watch?v=ZedLgAF9aEg';
$url = 'http://youtu.be/ZedLgAF9aEg';
if (FALSE === strpos($url, 'youtu.be/')) {
parse_str(parse_url($url, PHP_URL_QUERY), $id);
$id = $id['v'];
} else {
$id = basename($url);
}
echo $id; // ZedLgAF9aEg
Will work for both versions of URLs. Do not use regex for this as PHP has built in functions for parsing URLs as I have demonstrated which are faster and more robust against breaking.
Your regex appears to solve the problem as it stands now? I didn't try it in php, but it appears to work fine in my editor.
The first part of the regex http://(?:www\.)?youtu\.?be(?:\.com)?/matches http://youtu.be/ and the second part (embed/|watch\?v=|\?v=|v/|e/|.+/|watch.*v=|) ends with |) which means it matches nothing (making it optional). In other words it would trim away http://youtu.be/ leaving only the id.
A more intuitive way of writing it would be to make the whole if grouping optional I suppose, but as far as I can tell your regex is already solving your problem:
#http://(?:www\.)?youtu\.?be(?:\.com)?/(embed/|watch\?v=|\?v=|v/|e/|.+/|watch.*v=)?#i
Note: Your regex would work with the www.youtu.be.com domain as well. It would be stripped away, but something to watch out for if you use this for validating input.
Update:
If you want to only match urls inside [youtube][/youtube] tags you could use look arounds.
Something along the lines of:
(?<=\[youtube\])(?:http://(?:www\.)?youtu\.?be(?:\.com)?/(?:embed/|watch\?v=|\?v=|v/|e/|[^\[]+/|watch.*v=)?)(?=.+\[/youtube\])
You could further refine it by making the .+ in the look ahead only match valid URL characters etc.
Try this, hope it'll help you
function YouTubeUrl($url)
{
if($url!='')
{
$newUrl='';
$videoLink1=$url;
$findKeyWord='youtu.be';
$toBeReplaced='www.youtube.com';
if(IsContain('watch?v=',$videoLink1))
{
$newUrl=tMakeUrl($videoLink1);
}
else if(IsContain($videoLink1, $findKeyWord))
{
$videoLinkArray=explode('/',$videoLink1);
$Protocol='';
if(IsContain('://',$videoLink1))
{
$protocolArray=explode('://',$videoLink1);
$Protocol=$protocolArray[0];
}
$file=$videoLinkArray[count($videoLinkArray)-1];
$newUrl='www.youtube.com/watch?v='.$file;
if($Protocol!='')
$newUrl.=$Protocol.$newUrl;
else
$newUrl=tMakeUrl($newUrl);
}
else
$newUrl=tMakeUrl($videoLink1);
return $newUrl;
}
return '';
}
function IsContain($string,$findKeyWord)
{
if(strpos($string,$findKeyWord)!==false)
return true;
else
return false;
}
function tMakeUrl($url)
{
$tSeven=substr($url,0,7);
$tEight=substr($url,0,8);
if($tSeven!="http://" && $tEight!="https://")
{
$url="http://".$url;
}
return $url;
}
You can use bellow function for any of youtube URL
I hope this will help you
function checkYoutubeId($id)
{
$youtube = "http://www.youtube.com/oembed?url=". $id ."&format=json";
$curl = curl_init($youtube);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$return = curl_exec($curl);
curl_close($curl);
return json_decode($return, true);
}
This function return Youtube video detail if Id match to youtube video ID
A little improvement to #rvalvik answer would be to include the case of the mobile links (I've noticed it while working with a customer who used an iPad to navigate, copy and paste links). In this case, we have a m (mobile) letter instead of www. Regex then becomes:
#(https?://)?(?:www\.)?(?:m\.)?(?:youtu\.be/|youtube\.com(?:/embed/|/v/|/watch?.*?v=))([\w\-]{10,12}).*#x
Hope it helps.
A slight improvement of another answer:
if (strpos($url, 'feature=youtu.be') === TRUE || strpos($url, 'youtu.be') === FALSE )
{
parse_str(parse_url($url, PHP_URL_QUERY), $id);
$id = $id['v'];
}
else
{
$id = basename($url);
}
This takes into account youtu.be still being in the URL, but not the URL itself (it does happen!) as it could be the referring feature link.
Other answers miss out on the point that some youtube links are part of a playlist and have a list paramater also which is required for embed code. So to extract the embed code from link one could try this JS code:
let urlEmbed = "https://www.youtube.com/watch?v=iGGolqb6gDE&list=PL2q4fbVm1Ik6DCzm9XZJbNwyHtHGclcEh&index=32"
let embedId = urlEmbed.split('v=')[1];
let parameterStringList = embedId.split('&');
if (parameterStringList.length > 1) {
embedId = parameterStringList[0];
let listString = parameterStringList.filter((parameterString) =>
parameterString.includes('list')
);
if (listString.length > 0) {
listString = listString[0].split('=')[1];
embedId = `${parameterStringList[0]}?${listString}`;
}
}
console.log(embedId)
Try it out here: https://jsfiddle.net/AMITKESARI2000/o62dwj7q/
try this :
$string = explode("=","http://www.youtube.com/watch?v=ZedLgAF9aEg");
echo $string[1];
would turn into: ZedLgAF9aEg

multi-thread, multi-curl crawler in PHP

Hi everyone once again!
We need some help to develop and implement a multi-curl functionality into our crawler. We have a huge array of "links to be scanned" and we loop throw them with a Foreach.
Let's use some pseudo code to understand the logic:
1) While ($links_to_be_scanned > 0).
2) Foreach ($links_to_be_scanned as $link_to_be_scanned).
3) Scan_the_link() and run some other functions.
4) Extract the new links from the xdom.
5) Push the new links into $links_to_be_scanned.
5) Push the current link into $links_already_scanned.
6) Remove the current link from $links_to_be_scanned.
Now, we need to define a maximum number of parallel connections and be able to run this process for each link in parallel.
I understand that we're gonna have to create a $links_being_scanned or some kind of queue.
I'm really not sure how to approach this problem to be honest, if anyone could provide some snippet or idea to solve it, it would be greatly appreciated.
Thanks in advance!
Chris;
Extended:
I just realized that is not the multi-curl itself the tricky part, but the amount of operations done with each link after the request.
Even after the muticurl, I would eventually have to find a way to run all this operations in parallel. The whole algorithm described below would have to run in parallel.
So now rethinking, we would have to do something like this:
While (There's links to be scanned)
Foreach ($Link_to_scann as $link)
If (There's less than 10 scanners running)
Launch_a_new_scanner($link)
Remove the link from $links_to_be_scanned array
Push the link into $links_on_queue array
Endif;
And each scanner does (This should be run in parallel):
Create an object with the given link
Send a curl request to the given link
Create a dom and an Xdom with the response body
Perform other operations over the response body
Remove the link from the $links_on_queue array
Push the link into the $links_already_scanned array
I assume we could approach this creating a new PHP file with the scanner algorithm, and using pcntl_fork() for each parallel proccess?
Since even using multi-curl, I would eventually have to wait looping on a regular foreach structure for the other processes.
I assume I would have to approach this using fsockopen or pcntl_fork.
Suggestions, comments, partial solutions, and even a "good luck" will be more than appreciated!
Thanks a lot!
DISCLAIMER: This answer links an open-source project with which I'm involved. There. You've been warned.
The Artax HTTP client is a socket-based HTTP library that (among other things) offers custom control over the number of concurrent open socket connections to individual hosts while making multiple asynchronous HTTP requests.
Limiting the number of concurrent connections is easily accomplished. Consider:
<?php
use Artax\Client, Artax\Response;
require dirname(__DIR__) . '/autoload.php';
$client = new Client;
// Defaults to max of 8 concurrent connections per host
$client->setOption('maxConnectionsPerHost', 2);
$requests = array(
'so-home' => 'http://stackoverflow.com',
'so-php' => 'http://stackoverflow.com/questions/tagged/php',
'so-python' => 'http://stackoverflow.com/questions/tagged/python',
'so-http' => 'http://stackoverflow.com/questions/tagged/http',
'so-html' => 'http://stackoverflow.com/questions/tagged/html',
'so-css' => 'http://stackoverflow.com/questions/tagged/css',
'so-js' => 'http://stackoverflow.com/questions/tagged/javascript'
);
$onResponse = function($requestKey, Response $r) {
echo $requestKey, ' :: ', $r->getStatus();
};
$onError = function($requestKey, Exception $e) {
echo $requestKey, ' :: ', $e->getMessage();
}
$client->requestMulti($requests, $onResponse, $onError);
IMPORTANT: In the above example the Client::requestMulti method is making all the specified requests asynchronously. Because the per-host concurrency limit is set to 2, the client will open up new connections for the first two requests and subsequently reuse those same sockets for the other requests, queuing requests until one of the two sockets become available.
you could try something like this, haven't checked it, but you should get the idea
$request_pool = array();
function CreateHandle($url) {
$handle = curl_init($url);
// set curl options here
return $handle;
}
function Process($data) {
global $request_pool;
// do something with data
array_push($request_pool , CreateHandle($some_new_url));
}
function RunMulti() {
global $request_pool;
$multi_handle = curl_multi_init();
$active_request_pool = array();
$running = 0;
$active_request_count = 0;
$active_request_max = 10; // adjust as necessary
do {
$waiting_request_count = count($request_pool);
while(($active_request_count < $active_request_max) && ($waiting_request_count > 0)) {
$request = array_shift($request_pool);
curl_multi_add_handle($multi_handle , $request);
$active_request_pool[(int)$request] = $request;
$waiting_request_count--;
$active_request_count++;
}
curl_multi_exec($multi_handle , $running);
curl_multi_select($multi_handle);
while($info = curl_multi_info_read($multi_handle)) {
$curl_handle = $info['handle'];
call_user_func('Process' , curl_multi_getcontent($curl_handle));
curl_multi_remove_handle($multi_handle , $curl_handle);
curl_close($curl_handle);
$active_request_count--;
}
} while($active_request_count > 0 || $waiting_request_count > 0);
curl_multi_close($multi_handle);
}
You should look for some more robust solution to your problem. RabbitMQ
is a very good solution I used. There is also Gearman but I think it is your choice.
I prefer RabbitMQ.
I will share with you my code which I have used to collect email addresses from certain website.
You can modify it to fit your needs.
There were some problems with relative URL's there.
And I do not use CURL here.
<?php
error_reporting(E_ALL);
$home = 'http://kharkov-reklama.com.ua/jborudovanie/';
$writer = new RWriter('C:\parser_13-09-2012_05.txt');
set_time_limit(0);
ini_set('memory_limit', '512M');
function scan_page($home, $full_url, &$writer) {
static $done = array();
$done[] = $full_url;
// Scan only internal links. Do not scan all the internet!))
if (strpos($full_url, $home) === false) {
return false;
}
$html = #file_get_contents($full_url);
if (empty($html) || (strpos($html, '<body') === false && strpos($html, '<BODY') === false)) {
return false;
}
echo $full_url . '<br />';
preg_match_all('/([A-Za-z0-9_\-]+\.)*[A-Za-z0-9_\-]+#([A-Za-z0-9][A-Za-z0-9\-]*[A-Za-z0-9]\.)+[A-Za-z]{2,4}/', $html, $emails);
if (!empty($emails) && is_array($emails)) {
foreach ($emails as $email_group) {
if (is_array($email_group)) {
foreach ($email_group as $email) {
if (filter_var($email, FILTER_VALIDATE_EMAIL)) {
$writer->write($email);
}
}
}
}
}
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
preg_match_all("/$regexp/siU", $html, $matches, PREG_SET_ORDER);
if (is_array($matches)) {
foreach($matches as $match) {
if (!empty($match[2]) && is_scalar($match[2])) {
$url = $match[2];
if (!filter_var($url, FILTER_VALIDATE_URL)) {
$url = $home . $url;
}
if (!in_array($url, $done)) {
scan_page($home, $url, $writer);
}
}
}
}
}
class RWriter {
private $_fh = null;
private $_written = array();
public function __construct($fname) {
$this->_fh = fopen($fname, 'w+');
}
public function write($line) {
if (in_array($line, $this->_written)) {
return;
}
$this->_written[] = $line;
echo $line . '<br />';
fwrite($this->_fh, "{$line}\r\n");
}
public function __destruct() {
fclose($this->_fh);
}
}
scan_page($home, 'http://kharkov-reklama.com.ua/jborudovanie/', $writer);

Categories