I have a problem that I need help fixing. I am trying to create a script that crawls websites for mailing addresses. Mostly German addresses, but I am unsure of how to create said script, I have created one already that extracts email addresses from said websites. But the address one is puzzling because there isn't a real format.. Here is a couple German addresses for examples on a way to possibly extract this data.
Ilona Mustermann
Hauptstr. 76
27852 Musterheim
Andreas Mustermann
Schwarzwaldhochstraße 1
27812 Musterhausen
D. Mustermann
Kaiser-Wilhelm-Str.3
27852 Mustach
Those are just a few examples of what I am looking to extract from the websites. Is this possible to do with PHP?
Edit:
This is what I have so far
function extract_address($str) {
$str = strip_tags($str);
$Name = null;
$zcC = null;
$Street = null;
foreach(preg_split('/([^A-Za-z0-9üß\-\#\.\(\) .])+/', $str) as $token) {
if(preg_match('/([A-Za-z\.])+ ([A-Za-z\.])+/', $token)){
$Name = $token;
}
if(preg_match('/ /', $token)){
$Street = $token;
}
if(preg_match('/[0-9]{5} [A-Za-zü]+/', $token)){
$zcC = $token;
}
if(isset($Name) && isset($zcC) && isset($Street)){
echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
$Name = null;
$Street = null;
$zcC = null;
}
}
}
It works to retrieve $Name(IE: Ilona Mustermann and City/zipcode(27852 Musterheim) but unsure of a regex to always retrieve streets?
Well this is what I have came up with so far, and it seems to be working about 60% of the time on streets, zip/city work 100% and so does name. But when it tries to extract the street occasionally it fails.. Any idea why?
function extract_address($str) {
$str = strip_tags($str);
$Name = null;
$zcC = null;
$Street = null;
foreach(preg_split('/([^A-Za-z0-9üß\-\#\.\(\)\& .])+/', $str) as $token) {
if(preg_match('/([A-Za-z\&.])+ ([A-Za-z.])+/', $token) && !preg_match('/([A-Za-zß])+ ([0-9])+/', $token)){
//echo("N:$token<br />");
$Name = $token;
}
if(preg_match('/(\.)+/', $token) || preg_match('/(ß)+/', $token) || preg_match('/([A-Za-zß\.])+ ([0-9])+/', $token)){
$Street = $token;
}
if(preg_match('/([0-9]){5} [A-Za-züß]+/', $token)){
$zcC = $token;
}
/*echo("<br />
N:$Name
<br />
S:$Street
<br />
Z:$zcC
<br />
");*/
if(isset($Name) && isset($zcC) && isset($Street)){
echo($Name."<br />".$Street."<br />".$zcC."<br /><br />");
$Name = null;
$Street = null;
$zcC = null;
}
}
}
Of course it is possible you need to use preg_match() function. It is all about making a good regex pattern.
For example to get post-code
<?php
$str = "YOUR ADRESSES STRING HERE";
preg_match('/([0-9]+) ([A-Za-z]+)/', $str, $matches);
print_r($matches);
?>
this regex matches adresses you've given you need to put in it also your native characters.
[A-Za-züß.]+ [A-Za-z.üß]+\s[A-Za-z. 0-9ß-]+\s[0-9]+ [A-Za-züß.]+
It's impossible to get a reliable answer with regex with such a complicated string. That's the only correct answer to this question.
Vlad Bondarenko is right.
In CS speak: Postal addresses do not form a regular language.
Extracting information is an active research topic. Regular expressions are not completely bogus, but will have a higher failure rate than approaches that use dictionaries ("gazetteers") or more advanced machine learning algorithms.
A nice stack overflow q/a is How to parse freeform street/postal address out of text, and into components
Related
I want to echo/print only a certain piece of input. For example i have this youtube url http://www.youtube.com/watch?v=p963CeTtJVM how would i be able to only echo the last piece of :"p963CeTtJVM" from the input. As far as i know their always 11 symbols.
Code:
if (empty($_POST["website"]))
{$website = "";}
else
{
$website = test_input($_POST["website"]);
// check if URL address syntax is valid (this regular expression also allows dashes in the URL)
if (!preg_match("/\b(?:(?:https?|ftp):\/\/|www\.)[-a-z0-9+&##\/%?=~_|!:,.;]*[-a-z0-9+&##\/%=~_|]/i",$website))
{
$websiteErr = "Invalid URL";
}
}
list ($void, $query_string) = split('?', $url); // or list(,$qs)
parse_str($query_string, $data);
var_dump($data);
For this specific string substr($str, -11) will take the last 11 chars, but that doesn't include other tags. Check out parse_str, it will probably save you a headache in the long run.
I hope it can help you.
<?php
$url = 'http://www.youtube.com/watch?v=p963CeTtJVM';
$urlParts = explode('v=', $url);
if (count($urlParts) == 2 && isset($urlParts[1])) {
echo "youtube code : {$urlParts[1]}";
} else {
echo "Invalid Youtube url.";
}
You can use substr method to return part of a string.
You can use the explode function to seperate the video ID and the rest of the link like this:
$array = explode("=", $website);
echo $array[1];
This parses the URL into its component parts, then parses the query string into an associative array.
$url = parse_url($url);
parse_str($url['query'], $params);
$v = $params['v'];
The following code works with all YouTube domains except for youtu.be. An example would be: http://www.youtube.com/watch?v=ZedLgAF9aEg would turn into: ZedLgAF9aEg
My question is how would I be able to make it work with http://youtu.be/ZedLgAF9aEg.
I'm not so great with regex so your help is much appreciated. My code is:
$text = preg_replace("#[&\?].+$#", "", preg_replace("#http://(?:www\.)?youtu\.?be(?:\.com)?/(embed/|watch\?v=|\?v=|v/|e/|.+/|watch.*v=|)#i", "", $text)); }
$text = (htmlentities($text, ENT_QUOTES, 'UTF-8'));
Thanks again!
//$url = 'http://www.youtube.com/watch?v=ZedLgAF9aEg';
$url = 'http://youtu.be/ZedLgAF9aEg';
if (FALSE === strpos($url, 'youtu.be/')) {
parse_str(parse_url($url, PHP_URL_QUERY), $id);
$id = $id['v'];
} else {
$id = basename($url);
}
echo $id; // ZedLgAF9aEg
Will work for both versions of URLs. Do not use regex for this as PHP has built in functions for parsing URLs as I have demonstrated which are faster and more robust against breaking.
Your regex appears to solve the problem as it stands now? I didn't try it in php, but it appears to work fine in my editor.
The first part of the regex http://(?:www\.)?youtu\.?be(?:\.com)?/matches http://youtu.be/ and the second part (embed/|watch\?v=|\?v=|v/|e/|.+/|watch.*v=|) ends with |) which means it matches nothing (making it optional). In other words it would trim away http://youtu.be/ leaving only the id.
A more intuitive way of writing it would be to make the whole if grouping optional I suppose, but as far as I can tell your regex is already solving your problem:
#http://(?:www\.)?youtu\.?be(?:\.com)?/(embed/|watch\?v=|\?v=|v/|e/|.+/|watch.*v=)?#i
Note: Your regex would work with the www.youtu.be.com domain as well. It would be stripped away, but something to watch out for if you use this for validating input.
Update:
If you want to only match urls inside [youtube][/youtube] tags you could use look arounds.
Something along the lines of:
(?<=\[youtube\])(?:http://(?:www\.)?youtu\.?be(?:\.com)?/(?:embed/|watch\?v=|\?v=|v/|e/|[^\[]+/|watch.*v=)?)(?=.+\[/youtube\])
You could further refine it by making the .+ in the look ahead only match valid URL characters etc.
Try this, hope it'll help you
function YouTubeUrl($url)
{
if($url!='')
{
$newUrl='';
$videoLink1=$url;
$findKeyWord='youtu.be';
$toBeReplaced='www.youtube.com';
if(IsContain('watch?v=',$videoLink1))
{
$newUrl=tMakeUrl($videoLink1);
}
else if(IsContain($videoLink1, $findKeyWord))
{
$videoLinkArray=explode('/',$videoLink1);
$Protocol='';
if(IsContain('://',$videoLink1))
{
$protocolArray=explode('://',$videoLink1);
$Protocol=$protocolArray[0];
}
$file=$videoLinkArray[count($videoLinkArray)-1];
$newUrl='www.youtube.com/watch?v='.$file;
if($Protocol!='')
$newUrl.=$Protocol.$newUrl;
else
$newUrl=tMakeUrl($newUrl);
}
else
$newUrl=tMakeUrl($videoLink1);
return $newUrl;
}
return '';
}
function IsContain($string,$findKeyWord)
{
if(strpos($string,$findKeyWord)!==false)
return true;
else
return false;
}
function tMakeUrl($url)
{
$tSeven=substr($url,0,7);
$tEight=substr($url,0,8);
if($tSeven!="http://" && $tEight!="https://")
{
$url="http://".$url;
}
return $url;
}
You can use bellow function for any of youtube URL
I hope this will help you
function checkYoutubeId($id)
{
$youtube = "http://www.youtube.com/oembed?url=". $id ."&format=json";
$curl = curl_init($youtube);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
$return = curl_exec($curl);
curl_close($curl);
return json_decode($return, true);
}
This function return Youtube video detail if Id match to youtube video ID
A little improvement to #rvalvik answer would be to include the case of the mobile links (I've noticed it while working with a customer who used an iPad to navigate, copy and paste links). In this case, we have a m (mobile) letter instead of www. Regex then becomes:
#(https?://)?(?:www\.)?(?:m\.)?(?:youtu\.be/|youtube\.com(?:/embed/|/v/|/watch?.*?v=))([\w\-]{10,12}).*#x
Hope it helps.
A slight improvement of another answer:
if (strpos($url, 'feature=youtu.be') === TRUE || strpos($url, 'youtu.be') === FALSE )
{
parse_str(parse_url($url, PHP_URL_QUERY), $id);
$id = $id['v'];
}
else
{
$id = basename($url);
}
This takes into account youtu.be still being in the URL, but not the URL itself (it does happen!) as it could be the referring feature link.
Other answers miss out on the point that some youtube links are part of a playlist and have a list paramater also which is required for embed code. So to extract the embed code from link one could try this JS code:
let urlEmbed = "https://www.youtube.com/watch?v=iGGolqb6gDE&list=PL2q4fbVm1Ik6DCzm9XZJbNwyHtHGclcEh&index=32"
let embedId = urlEmbed.split('v=')[1];
let parameterStringList = embedId.split('&');
if (parameterStringList.length > 1) {
embedId = parameterStringList[0];
let listString = parameterStringList.filter((parameterString) =>
parameterString.includes('list')
);
if (listString.length > 0) {
listString = listString[0].split('=')[1];
embedId = `${parameterStringList[0]}?${listString}`;
}
}
console.log(embedId)
Try it out here: https://jsfiddle.net/AMITKESARI2000/o62dwj7q/
try this :
$string = explode("=","http://www.youtube.com/watch?v=ZedLgAF9aEg");
echo $string[1];
would turn into: ZedLgAF9aEg
I have a .txt file where I would like to find an EXACT match of a single email entered in a form.
The present directives (see below) I used, work for a standard form. But when I use it in conjunction with an AJAX call and jQuery, it confirms it exists by just finding the first occurrence.
For example:
If that person enters "bobby#" it says not found, good.
If someone enters their full Email address and it exists in the file, it says "found", very good.
Now, if someone enters just "bobby", it says "found", not good.
I used the following three examples below with the same results.
if ( !preg_match("/\b{$email}\b/i", $emails )) {
echo "Sorry, not found";
}
and...
if ( !preg_match( "/(?:^|\W){$email}(?:\W|$)/", $emails )) {
echo "Sorry, not found";
}
and...
if ( !preg_match('/^'.$email.'$/', $emails )) {
echo "Sorry, not found";
}
my AJAX
$.ajax({
type: "POST",
url: "email_if_exist.php",
data: "email="+ usr,
success: function(msg){
my text file
Bobby Brown bobby#somewhere.com
Guy Slim guy#somewhere.com
Slim Jim slim#somewhere.com
I thought of using a jQuery function to only accept a full email address, but with no success partly because I didn't know where to put it in the script.
I've spent a lot of time in searching for a solution to this and I am now asking for some help.
Cheers.
Because your text file contains "bobby" in it, any regex such as you are suggesting will always find "bobby". I would suggest checking for the presence of the # symbol BEFORE you run the regex, as any valid email will always have # in it. Try something like this:
if (strpos($email,'#')) {
if ( !preg_match("/\b{$email}\b/i", $emails )) {
echo "Sorry, not found";
}
}
EDIT: Looking at this 4 years later... I would make the regex match to the end of the line, using the m modifier to specify multiline so the $ matches newline or EOF. The PHP line would be:
if ( !preg_match("/\b{$email}$/im", $emails )) {
If you're just checking to see if the user exists, this should work:
$users = trim(preg_replace('/\s\s+/', ' ', $users));
$userArray = explode(' ', $users);
$exists = in_array($email, $userArray);
Where $users is referencing to the example file and $email is referencing to the queried e-mail.
This replaces all newlines (and double spaces) with spaces and then splits by spaces into an array, then, if the e-mail exists in the array, the user exists.
Hope I helped!
'/^'.$email.'$/' is quite close. Since you want the check being "true" only if the full email address is on the file you should include in the pattern the "limits" of the email: Whitespace before and end_of_the_line after if:
'/ '.$email.'$/'
(Yes, I've just changed ^ -start of line- for a whitespace)
If your text file filled with lines that every line ending with the email,
so you can regex with testing and match by your "email + end od line"
like that:
if( preg_match("/.+{$email}[\n|\r\n|\r]/", $textFileEmails) )
{
/// code
}
The code would validate first using php core functions whether the email is correct or not and then check for the occurrence.
$email = 'bobby#somewhere.com';
$found = false;
//PHP has a built-in function to validate an email
if(filter_var($email, FILTER_VALIDATE_EMAIL)){
//Grab lines from the file
$lines = file('myfile.txt', FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);
foreach ($lines as $line) {
//Grab words from the line
$words = explode(" ", $line);
//If email found within the words set the flag as true.
if(in_array($email, $words)) {
$found = true;
//If the aim is only to find the email, we can break out here.
break;
}
}
}
if(false === $found) {
echo 'Not found!';
} else {
echo 'Found you!';
}
If you file is formatted as your example first_name, last_name, email#address.tdl
it's really easy to break it up on load to search.
I don't know why you would use preg_match for this bit your if you were advised to use preg use it to verify the email address. You're better off using indexOf method in php (strpos) to search the file but the below method works for your fixed file format.
Object Orientated File Reader and searcher
class Search{
private $users = array();
public function __construct($password_file){
$file = file_get_contents($password_file);
$lines = explode("\n", $file);
$users = array();
foreach($lines as $line){
$users = expode(" ", $line);
}
foreach($users as $user){
$this->users[] = array("first_name" => $user[0], "last_name" => $user[1], "email" => $user[2])
}
}
public function searchByEmail($email){
foreach($this->users as $key => $user){
if($user['email'] == $email){
// return user array
return $user;
// or you could return user id
//return $key;
}
}
return false;
}
}
Then to use
$search = new Search($passwdFile);
$user = $search->searchByEmail($_POST['email']);
echo ($user)? "found":"Sorry, not found";
Using preg_match to validate email then check
If you want to use preg and your own file search system.
function validateEmail($email) {
$v = "/[a-zA-Z0-9_-.+]+#[a-zA-Z0-9-]+.[a-zA-Z]+/";
return (bool)preg_match($v, $email);
}
then use like
if(validateEmail($_POST['email'])){
echo (strpos($_POST['email'], $emails) !== false)? "found":"Sorry, not found";
}
I am using this absolute amazing piece of code: https://github.com/plancake/official-library-php-email-parser/blob/master/PlancakeEmailParser.php
But the one thing it is missing is the ability to get the From email address.
I have simple added:
public function getFromEmail()
{
if (!isset($this->rawFields['from']))
{
return false;
}
return $this->rawFields['from'];
}
But how would I get only the email address part at the moment it returns: John Smith<john#gmail.com>?
Also I would need this to work if the From address was only john#gmail.com?
Thanks to the answers this was the finished code:
public function getFromEmail()
{
$email = trim($this->rawFields['from']);
if(substr($email, -1) == '>'){
$fromarr = explode("<",$email);
$mailarr1 = explode(">",$fromarr[1]);
$email = $mailarr1[0];
}
return $email;
}
This is a very simple regular expression:
$output = array();
preg_match("/.*<(.*?)>.*?/", $this->rawFields['from'], $output);
$email_address = $output[1];
Care though: If someone's name contains < or > it might cause a security vulnerability. The lazy operator (*.?) is used to ensure the last set of < > is used.
HTH
PS: Use http://gskinner.com/RegExr/ to test Regular Expressions!
$mailid='John Smith<john#gmail.com>';
$mailarr=explode("<",$mailid);
$mailarr1=explode(">",$mailarr[1]);
$just_emailid=$mailarr1[0];
I have a coming soon form at a website where user fills out an email form and it will be emailed to me. However, a spammer has hit the site and is spamming the form with goatse and so on. IP ban isn't helping so I need to stop the form sending it if it contains goatse or something. Here's the mailer.
<?php
$SPOSTI =$_POST[sposti];
if ($SPOSTI=="")
{
return false;
}
if ($SPOSTI=="goatse.fr")
{
return false;
}
if ($SPOSTI=="http://www.goatse.info/hello.jpg")
{
return false;
}
else
{
$to = "xxx#gmail.com";
$subject = "xxx";
$message = "$_POST[sposti] haluaa tiedon kun kotisivut.name avautuu.
$_POST[ip]";
$from = "$_POST[sposti]";
$headers = "From:" . $from;
mail($to,$subject,$message,$headers);
}
?>
Is there someway to block it from executing the code if the email contains a certain word (goatse in this case)
You need to use exit or die instead of return false which works inside functions/methods:
if ( $SPOSTI =="" || strpos('goatse', $SPOSTI) !== FALSE)
{
exit();
}
strpos() will let you find a substring, but I really recommend a captcha security system as the attacker could simply switch to another annoying word.
Goatse's arn't your problem here, it's the security.
You can use stristr http://php.net/manual/de/function.stristr.php to achive this. I would recommend to using a captcha, since it is more efficient. A popular solution is reCaptcha: https://developers.google.com/recaptcha/docs/php Another, weaker possibility is to add a security question to your form, for instance "What is five plus five in numbers?".
Try the following:
function is_spam($array, $block_pattern){
$block = false;
foreach($array as $k => $v){
if(preg_match('/.*' . $block_pattern . '.*/', $k) ||
preg_match('/.*' . $block_pattern . '.*/', $v)){
$block = true;
break;
}
}
return $block;
}
Usage: is_spam($_POST, 'goatse');
Returns: true if 'goatse' is found in $_POST
The function will search all keys and values of $array for the $block_pattern string and will return true if the pattern is found.