scrape email addresses

scrape email addresses - php

fff.html is an email with email addresses in it some have href mailto links and some don't, i want to scrape them and output them into the following format
Lorem#ipsum.com,dolor#sit.com,amet#consectetur.com
I have a simple scraper to get the ones that are href linked but something is wierd
<?php
$url = "fff.html";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,'<a href="mailto:');
$end = strpos($content,'"',$start) + 8;
$mail = substr($content,$start,$end-$start);
print "$mail<br />";
?>
I should get extra points for the original use of lorem ipsum

The problem is what if you have more than one email address in the HTML page. substr will only return the first instance. Here is a script that will parse all email addresses. You may need to tweak it some for your use. It will output the results in the CSV form you requested.
<?php
$url = "fff.html";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content, '<body>');
$end = strpos($content, '</body>');
$data = substr($content, $start, $end-$start);
$pattern = '#a[^>]+href="mailto:([^"]+)"[^>]*?>#is';
preg_match_all($pattern, $data, $matches);
foreach ($matches[1] as $key => $email) {
$emails[] = $email;
}
echo implode(', ', $emails );
?>

Related

preg_match_all How to get all links?

I'm trying to get all images links with preg_match_all those that begin with http://i.ebayimg.com/ and ends with .jpg , from page that I'm scraping.. I Can not do it correctly... :( I tried this but this is not what i need...:
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $contentas, $img_link);
Same problem is with normal links... I don't know how to write preg_match_all to this:
<a class="link--muted" href="http://suchen.mobile.de/fahrzeuge/details.html?id=218930381&daysAfterCreation=7&isSearchRequest=true&withImage=true&scopeId=C&categories=Limousine&damageUnrepaired=NO_DAMAGE_UNREPAIRED&zipcode=&fuels=DIESEL&ambitCountry=DE&maxPrice=11000&minFirstRegistrationDate=2006-01-01&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=20&pageNumber=1" data-touch="hover" data-touch-wrapper=".cBox-body--resultitem">
Thank you very much!!!
UPDATE
I'm trying from here:
http://suchen.mobile.de/fahrzeuge/search.html?isSearchRequest=true&scopeId=C&makeModelVariant1.makeId=1900&makeModelVariant1.modelId=10&makeModelVariant1.modelDescription=&makeModelVariantExclusions%5B0%5D.makeId=&categories=Limousine&minSeats=&maxSeats=&doorCount=&minFirstRegistrationDate=2006-01-01&maxFirstRegistrationDate=&minMileage=&maxMileage=&minPrice=&maxPrice=11000&minPowerAsArray=&maxPowerAsArray=&maxPowerAsArray=PS&minPowerAsArray=PS&fuels=DIESEL&minCubicCapacity=&maxCubicCapacity=&ambitCountry=DE&zipcode=&q=&climatisation=&airbag=&daysAfterCreation=7&withImage=true&adLimitation=&export=&vatable=&maxConsumptionCombined=&emissionClass=&emissionsSticker=&damageUnrepaired=NO_DAMAGE_UNREPAIRED&numberOfPreviousOwners=&minHu=&usedCarSeals= get cars links and image links and all information, with information is everything fine, my script works good, but i have problem with scraping images and links.. here is my script :
<?php
$id= $_GET['id'];
$user= $_GET['user'];
$login=$_COOKIE['login'];
$query = mysql_query("SELECT pavadinimas,nuoroda,kuras,data,data_new from mobile where vartotojas='$user' and id='$id'");
$rezultatas=mysql_fetch_row($query);
$url = "$rezultatas[1]";
$info = file_get_contents($url);
function scrape_between($data, $start, $end){
$data = stristr($data, $start);
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return $data;
}
//turinio iskirpimas
$turinys = scrape_between($info, '<div class="g-col-9">', '<footer class="footer">');
//filtravimas naikinami mokami top skelbimai
$contentas = preg_replace('/<div class="cBox-body cBox-body--topResultitem".*?>(.*?)<\/div>/', '' ,$turinys);
//filtravimas baigtas
preg_match_all('/<span class="h3".*?>(.*?)<\/span>/',$contentas,$pavadinimas);
preg_match_all('/<span class="u-block u-pad-top-9 rbt-onlineSince".*?>(.*?)<\/span>/',$contentas,$data);
preg_match_all('/<span class="u-block u-pad-top-9".*?>(.*?)<\/span>/',$contentas,$miestas);
preg_match_all('/<span class="h3 u-block".*?>(.*?)<\/span>/', $contentas, $kaina);
preg_match_all('/<a[A-z0-9-_:="\.\/ ]+href="(http:\/\/suchen.mobile.de\/fahrzeuge\/[^"]*)"[A-z0-9-_:="\.\/ ]\s*>\s*<div/s', $contentas, $matches);
print_r($pavadinimas);
print_r($data);
print_r($miestas);
print_r($kaina);
print_r($result);
print_r($matches);
?>

1. To capture src attribute starting by http://i.ebayimg.com/ of all img tags :
regex : /src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i
Here is an example :
$re = "/src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i";
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
Check it in live : here
If you want to be sure that you capture this url on an img tag then use this regex (keep in mind that performance will decrease if page is very long) :
$re = "/<img(?:.*?)src=\"((?:http|https):\\/\\/i.ebayimg.com\\/.+?.jpg)\"/i";
2. To capture href attribute starting by http://i.ebayimg.com/ of all a tags :
regex : /href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i
Here is an example :
$re = "/href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i;
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
Check it in live : here
If you want to be sure that you capture this url on an a tag then use this regex (keep in mind that performance will decrease if page is very long) :
$re = "/<a(?:.*?)href=\"((?:http|https):\\/\\/suchen.mobile.de\\/fahrzeuge\\/.+?.jpg)\"/i";

More handy with DOMDocument:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($yourURL);
$imgNodes = $dom->getElementsByTagName('img');
$result = [];
foreach ($imgNodes as $imgNode) {
$src = $imgNode->getAttribute('src');
$urlElts = parse_url($src);
$ext = strtolower(array_pop(explode('.', $urlElts['path'])));
if ($ext == 'jpg' && $urlElts['host'] == 'i.ebayimg.com')
$result[] = $src;
}
print_r($result);
To get "normal" links, use the same way (DOMDocument + parse_url).

How to post content between two tags PHP

I currently am using two PHP scripts; 1 to post content to a file and 1 to get content from the file from between two tags.
Post Content Script:
$content = $_POST["maintenancetext"];
$strNewContents = "$content";
$fileRefID = fopen("../../../maintenance.php", "w");
fwrite($fileRefID, $strNewContents);
fclose($fileRefID);
Get Content from between two tags script:
$start = '<p>';
$end = '</p>';
$string = file_get_contents("../../../maintenance.php");
$output = strstr( substr( $string, strpos( $string, $start) + strlen($start)), $end, true);
echo htmlentities($output, ENT_QUOTES);
I am currently posting from a text area to the file however I need this content to be changed between two tags only.
How can I achieve this? Thanks.

Assuming post content script is working, now to get content between two tags:
$string = file_get_contents("../../../maintenance.php");
$matches = array();
$pattern = "'<p>(.*?)</p>'si";
preg_match($pattern, $string, $matches);
$output = $matches[1];
echo $output;
To replace content:
$newstring = "i am new string";
$newouput = str_replace($output, $newstring, $output);
echo $newoutput;

Append title tag in whole content where it contain <a> tag

I have a dynamic content which have few tags starting with "https://" and "http://". I need to append title for each anchor tag starting with "https". code which I am using not able to append dynamic title. for ex:
$str = 'some text
some other text ';
The output string would be:
$str = 'some text
some text '
The code which I am using is:
$xml = '<foo>'. $content.'</foo>';
$doc = new DOMDocument;
$doc->loadXml($xml);
foreach ($doc->getElementsByTagName('a') as $anchor) {
if ($anchor->hasAttribute('title')) {
$anchor->removeAttribute('title');
}
}
$newcontent = $doc->saveHTML();
$pattern = '/(href=("|\')https)(:\\/\\/.*?("|\'))/';
$subject = $newcontent;
preg_match_all ($pattern, $subject, $matches);
for($i=0; $i< count($matches); $i++){
$complete_url = $matches[0][$i];
$get_prog_name = explode("//",$complete_url);
if (strpos($get_prog_name[1],'www.') !== false) {
$get_prog_name = explode("www.",$complete_url);
}
$prog_acro = explode(".", $get_prog_name[1]);
if($prog_acro[0] == "ccc") {
$progName = "Dynamic Title 1";
}
if($prog_acro[0] == "cec") {
$progName = "Dynamic Title 2";
}
$replacement[] = 'class="tooltip" title=Requires CEB '.$progName.'membership login $1$3';
} // end foreach loop
$newstr = preg_replace($pattern, $replacement[0], $subject, -1 );
I want to replace title dynamically here. the problem is that when put preg_replace in the loop it prints whole content many time as it contain anhor tags.

Error when get data when using regex in php

I have a sample code:
<?php
$adr = 'http://www.proxynova.com/proxy-server-list/country-gb/';
$c = file_get_contents($adr);
if ($c){
$regexp = '#<td>(.*?):(\d{1,4})</td>#';
$matches = array();
preg_match_all($regexp,$c,$matches);
print_r($matches);
if (count($matches) > 0){
foreach($matches[0] as $k => $m){
$port = intval($matches[2][$k]);
$ip = trim($matches[1][$k]);
}
}
}
I using $regex = '#<td>(.*?):(\d{1,4})</td>#'; to get data inculde ip and port, but result is null, how to fix it !

You can only see it properly in the browser, but in the source it's actually scrambled; you need something like this to decode it:
function decode($str)
{
return long2ip(strtr($str, array(
'fgh' => 2,
'iop' => 1,
'ray' => 0,
)));
}
Then use it together with a DOMDocument solution like this:
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML(file_get_contents('http://www.proxynova.com/proxy-server-list/country-gb/'));
$xp = new DOMXPath($doc);
foreach ($xp->query('//table[#id="tbl_proxy_list"]//tr') as $row) {
$ip = $xp->query('./td/span[#class="row_proxy_ip"]/script', $row);
$port = $xp->query('./td/span[#class="row_proxy_port"]/a', $row);
if ($ip->length && $port->length) {
if (preg_match('/decode\("([^"]+)"\)/', $ip->item(0)->textContent, $matches)) {
echo decode($matches[1]) . ':' . $port->item(0)->textContent, PHP_EOL;
}
}
}

The html source code contains ip adresses and ports separated in two columns, so that's why your regex doens't worK.

PHP: Converting text links to anchor tags

I am pulling in RSS feeds and using DOMXPath to convert all existing anchor tags to custom tags that look like this for various reasons:
[webserviceLink]{$url}[/webserviceLink][webserviceLinkName]{$text}[/webserviceLinkName]
This works great, but I'd also like to covert all non-html text links to this same format, but am having some issues.
Here's my code for converting the text links:
$pattern = '(?xi)(?<![">])\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
$desc = preg_replace_callback("#$pattern#i", function($matches)
{
$input = $matches[0];
$url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
if (strlen($input) > 20 && !strpos($input, " "))
$input = substr($input, 0, 18)."... ";
return "[webserviceLink]{$url}[/webserviceLink][webserviceLinkName]{$input}[/webserviceLinkName]";
}, $desc);
I'm not sure how to do the negative callback in this regex to check that the link I am converting is not in an existing html tag, like an img, or in my custom link tags above.

I was able to use xpath to get this working.
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($desc, 'HTML-ENTITIES', 'UTF-8'));
$xp = new DOMXPath($dom);
foreach ($xp->query('//text()[not(ancestor::a)]') as $node)
{
$pattern = '((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
$replaced = preg_replace_callback("#$pattern#i", function($matches)
{
$input = $matches[0];
$url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
if (strlen($input) > 20 && !strpos($input, " "))
$input = substr($input, 0, 18)."... ";
return "{$input}";
}, $node->wholeText);
$newNode = $dom->createDocumentFragment();
$newNode->appendXML($replaced);
$node->parentNode->replaceChild($newNode, $node);
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

scrape email addresses - php

Related

preg_match_all How to get all links?

How to post content between two tags PHP

Append title tag in whole content where it contain <a> tag

Error when get data when using regex in php

PHP: Converting text links to anchor tags

Categories

Resources