I am trying to scrape an ebay page such as this one: http://www.ebay.co.uk/sch/Cars-/9801/i.html?_nkw=vw+golf
Everything works great except one of my regular expressions just isn't matching the content and therefore the matches aren't being pushed to $linksArray I have outputted the contents to make sure what I am trying to match is infact there - and it is. I then go print_r($linksArray) where all the matches should be. but it's not. It is an empty multi dimensional array. You can see my live example here: http://www.mycommunity.co.za/marcksack/index.php
Here is my PHP code:
<?php
echo '<form method="POST">
<input type="text" id="url" name="url" size="120" value="' . (isset($_REQUEST["url"]) && !empty($_REQUEST["url"]) ? $_REQUEST["url"] : "") . '"/>
<input type="submit" value="Submit" />
</form>';
flush();
if (isset($_REQUEST["url"]) && !empty($_REQUEST["url"])) {
$url = $_REQUEST["url"];
$phones = array();
for ($page = 1; $page <= 1; $page++) {
// get page contents
$contents = file_get_contents($url . "&_pgn=" . $page);
echo(htmlentities($contents));
// find all links patterns
// HERE IS THE PROBLEM
$pattern = '/class="lvtitle"><a href="(.*)" class="vip"/';
$linksArray = array();
preg_match_all($pattern, $contents, $linksArray);
print_r($linksArray);
$links = $linksArray[0];
foreach($links as $link) {
$pureLink = str_replace("class=\"lvtitle\"><a href=\"", "", $link);
$pureLink = str_replace("\" class=\"vip\"", "", $pureLink);
// getting sub page contents
$subContents = file_get_contents($pureLink);
// find all links patterns
$subContents = str_replace(" ", "", $subContents);
$phonePattern = '/07[0-9]{9}/';
$phonesArray = array();
preg_match_all($phonePattern, $subContents, $phonesArray);
foreach($phonesArray[0] as $element) {
// check if phone not added previousely to the phones array
if (!in_array($element, $phones)) {
// add it to the phones array
array_push($phones, $element);
echo $element . "<br />";
flush();
}
}
}
}
// print results
foreach($phones as $phone){
echo $phone."<br/>";
}
}
?>
So obviously my question is what am I doing wrong? Why are the matches not being pushed to my $linksArray variable. I really appreciate your help!
This regex works:
"/ class=\"lvtitle\"><a href=\"([^\"]*)\" class=\"vip\"/"
A few issues with your's:
You were trying to capture the URL using (.*), which will match the entire line.
It was not matching the entire line because ebay has two spaces in between the class and href attributes.
Also, as has already been mentioned, you should use the API or DOMDocument for this. But in case you are curious, this is why it wasn't working. I hope that helps!
Related
I need to verify a text to show it in the page of a website. I need to transform all urls links of the the same website(not others urls of other websites) in links. I need to involve all them with the tag <a>. The problem is is the property href, that I need to put the correct url inside it. I am trying to verify all the the text and if I find a url, I need to verify if it contains the substring "http://". If not, I must put it in the href property. I did some attempt, but all their aren't working yet :( . Any idea how can I do this?
My function is below:
$string = "This is a url from my website: http://www.mysite.com.br and I have a article interesting there, the link is http://www.mysite.com.br/articles/what-is-psychology/205967. I need that the secure url link works too https://www.mysite.com.br/articles/what-is-psychology/205967. the following urls must be valid too: www.mysite.com.br and mysite.com.br";
function urlMySite($string){
$verifyUrl = '';
$urls = array("mysite.com.br");
$text = explode(" ", $string);
$alltext = "";
for($i = 0; $i < count($texto); $i++){
foreach ($urls as $value){
$pos = strpos($text[$i], $value);
if (!($pos === false)){
$verifyUrl = " <a href='".$text[$i]."' target='_blank'>".$text[$i]."</a> ";
if (strpos($verifyUrl, 'http://') !== true) {
$verifyUrl = " <a href='http://".$text[$i]."' target='_blank'>".$text[$i]."</a> ";
}
$alltext .= $verifyUrl;
} else {
$alltext .= " ".$text[$i]." ";
}
}
}
return $alltext;
}
You should use PREG_MATCH_ALL to find all occurances of the URL and replace each of the Matches with a clickable Link.
You could use this function:
function augmentText($text){
$pattern = "~(https?|file|ftp)://[a-z0-9./&?:=%-_]*~i";
preg_match_all($pattern, $text, $matches);
if( count($matches[0]) > 0 ){
foreach($matches[0] as $match){
$text = str_replace($match, "<a href='" . $match . "' target='_blank'>" . $match . "</a>", $text);
}
}
return $text;
}
Change the reguylar expression pattern to match only the URL's you want to make clickable.
Good luck
This is my entire code
// include the scrapper
include('simple_html_dom.php');
// connect the page for scrapping
$html = file_get_html('http://www.niagarafallsreview.ca/news/local');
// make empty arrays
$headlines = array();
$links = array();
// look for 'h' headings on page
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
// look for 'a' links that start with 'http://www.niagarafallsreview.ca/2016/04/'
foreach($html->find('a[href^="http://www.niagarafallsreview.ca/2016/04/"]') as $link) {
$links[] = $link->href;
}
// trim the headlines because one on top and bottom were not needed
$output = array_slice($headlines, 1, -1);
// for each header output a nice list of the headers
foreach ($output as $headers){
echo "< a href='#'>$headers</a>" . "<br />";
}
// make sure the links are unique and no doubles are found
$result = array_unique($links);
// for each link output it in a nice list
foreach ($result as $linkk){
echo "<a href='$linkk'>$linkk</a>" . "<br />";
}
this code will produce the headings in a nice list, and will also produce a nice list of the links.
My problem is that i need to combine them, i would like the $header to be the text of the href, and the link in the href to be the $linkk
like this..
< a href ='$linkk'>$headers</a>
I dont know how to do this as i have two foreach statements. I tried to combine them but i was unsuccessful.
Any help will be greatly appreciated.
Thanks.
Try this:
// include the scrapper
include('simple_html_dom.php');
// connect the page for scrapping
$html = file_get_html('http://www.niagarafallsreview.ca/news/local');
// make empty arrays
$headlines = array();
$links = array();
// look for 'h' headings on page
foreach($html->find('h1') as $header) {
$headlines[] = $header->plaintext;
}
// look for 'a' links that start with 'http://www.niagarafallsreview.ca/2016/04/'
foreach($html->find('a[href^="http://www.niagarafallsreview.ca/2016/04/"]') as $link) {
$links[] = $link->href;
}
// trim the headlines because one on top and bottom were not needed
$output = array_slice($headlines, 1, -1);
// make sure the links are unique and no doubles are found
$result = array_unique($links);
// for each link output it in a nice list
foreach ($result as $i=>$linkk) {
$headline = isset($output[$i]) ? $output[$i] : '(empty)';
echo "<a href='$linkk'>$headline</a>" . "<br />";
}
Here is the foreach you are looking for:
foreach($output as $i=>$headers) {
$linkk = $result[$i];
echo "< a href='$linkk'>$headers</a>" . "<br />";
}
This assumes the arrays have the same length and also the correct order.
I've been searching around for this but all I could find was broken scripts and plus, I might have a method that is quite simple.
I'm trying to use a for () loop for this one.
This is what I've got:
<?php
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
$makerepstring = "Here is a link: http://youtube.com and another: http://google.com";
if(preg_match_all($reg_exUrl, $makerepstring, $url)) {
// make the url into link
for($i=0; $i < count(array_keys($url[0])); $i++){
$makerepstring = preg_replace($reg_exUrl, ''.$url[0][$i].' ', $makerepstring);
}
}
echo $makerepstring;
?>
However this fails brutally for some reason I can't comprehend.
The output from echo $makerepstring; as follows(from source code):
http://google.com " target="_blank" rel="nofollow">http://google.com </a> http://google.com " target="_blank" rel="nofollow">http://google.com </a>
I'd really like to do it with a for()... Could somebody try and figure out how to get this to work with me?
Thanks in advance!
/J
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
$makerepstring = "http://youtube.com http://google.com";
$url = array();
$instances = preg_match_all($reg_exUrl, $makerepstring, $url);
if ($instances > 0) {
// make the url into link
for($i=0; $i < count(array_keys($url[0])); $i++){
$makerepstring = preg_replace($reg_exUrl, ''.$url[0][$i].' ', $makerepstring);
/*echo $url[0][$i]."<br />";
echo $i."<br />";
print_r($url);
echo "<br />";*/
}
}
echo $makerepstring;
This does not work either, although I'm not quite sure how you meant I should do this.
EDIT:
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
$makeurl = "http://google.com http://youtube.com";
if(preg_match($reg_exUrl, $makeurl, $url)) {
echo preg_replace($reg_exUrl, ''.$url[0].' ', $makeurl);
} else {
echo $makeurl;
}
Would give:
http://google.com http://google.com
that's not how preg_match_all works. http://php.net/manual/en/function.preg-match-all.php shows you that the matches go in a passed-along array, and the function returns the number of matches, instead. So first call
...
$matches = array();
$instances = preg_match_all(..., $matches);
if ($instances > 0) {
// and then your code
}
...
And then iterate over the $matches array, which now has content.
You are performing the match twice:
in preg_match_all function
then you are matching again in preg_replace, which should not happen here
Use string concatation instead:
$makerepstring = "Here is a link: http://youtube.com and another: http://google.com";
$new_str = '';
if(preg_match_all($reg_exUrl, $makerepstring, $url)) {
var_dump($url[0]);
// make the url into link
for($i=0; $i < count(array_keys($url[0])); $i++){
$new_str .= ''.$url[0][$i].' ';
}
}
echo $new_str;
I'm fetching data from a website and the below mentioned script works fine when i parse single words like "math,chemistry,science" etc. However, if i try to parse a keyword which contains space in-between like "business math" etc the browser just loads forever it doesn't seem to work. Please guide me..
<?php
include("simple_html_dom.php");
$keywords = "business math,chemistry,science";
$keywords = explode(',', $keywords);
foreach($keywords as $keyword) {
echo '<br><b><font color="red">Keyword: </font><font color="blue">'.$keyword.'</font></b><br>';
$html = file_get_html('http://www.tutorvista.com/search/'.$keyword);
$i = 1;
foreach($html->find('div[style=padding:20px; border-top:thin solid #DDDDDD; border-bottom:none;]') as $element) {
foreach($element->find('div[class=entry-abstract]') as $div) {
$title[$i] = $div->plaintext.'<br><br>';
}
$i++;
}
print_r($title);
}
?>
The problem is in the line:
$html = file_get_html('http://www.tutorvista.com/search/'.$keyword);
That function internally uses file_get_contents(), which doesn't accept spaces and need the URI to be encoded with urlencode().
Try this out:
$html = file_get_html( urlencode('http://www.tutorvista.com/search/'.$keyword) );
Ref:
http://sourceforge.net/p/simplehtmldom/code/208/tree/trunk/simple_html_dom.php#l76
http://php.net/manual/en/function.file-get-contents.php
Im working on a commenting web application and i want to parse user mentions (#user) as links. Here is what I have so far:
$text = "#user is not #user1 but #user3 is #user4";
$pattern = "/\#(\w+)/";
preg_match_all($pattern,$text,$matches);
if($matches){
$sql = "SELECT *
FROM users
WHERE username IN ('" .implode("','",$matches[1]). "')
ORDER BY LENGTH(username) DESC";
$users = $this->getQuery($sql);
foreach($users as $i=>$u){
$text = str_replace("#{$u['username']}",
"<a href='#' class='ct-userLink' rel='{$u['user_id']}'>#{$u['username']}</a> ", $text);
}
$echo $text;
}
The problem is that user links are being overlapped:
<a rel="11327" class="ct-userLink" href="#">
<a rel="21327" class="ct-userLink" href="#">#user</a>1
</a>
How can I avoid links overlapping?
Answer Update
Thanks to the answer picked, this is how my new foreach loop looks like:
foreach($users as $i=>$u){
$text = preg_replace("/#".$u['username']."\b/",
"<a href='#' title='{$u['user_id']}'>#{$u['username']}</a> ", $text);
}
Problem seems to be that some usernames can encompass other usernames. So you replace user1 properly with <a>user1</a>. Then, user matches and replaces with <a><a>user</a>1</a>. My suggestion is to change your string replace to a regex with a word boundary, \b, that is required after the username.
The Twitter widget has JavaScript code to do this. I ported it to PHP in my WordPress plugin. Here's the relevant part:
function format_tweet($tweet) {
// add #reply links
$tweet_text = preg_replace("/\B[#@]([a-zA-Z0-9_]{1,20})/",
"#<a class='atreply' href='http://twitter.com/$1'>$1</a>",
$tweet);
// make other links clickable
$matches = array();
$link_info = preg_match_all("/\b(((https*\:\/\/)|www\.)[^\"\']+?)(([!?,.\)]+)?(\s|$))/",
$tweet_text, $matches, PREG_SET_ORDER);
if ($link_info) {
foreach ($matches as $match) {
$http = preg_match("/w/", $match[2]) ? 'http://' : '';
$tweet_text = str_replace($match[0],
"<a href='" . $http . $match[1] . "'>" . $match[1] . "</a>" . $match[4],
$tweet_text);
}
}
return $tweet_text;
}
instead of parsing for '#user' parse for '#user ' (with space in the end) or ' #user ' to even avoid wrong parsing of email addresses (eg: mailaddress#user.com) maybe ' #user: ' should also be allowed. this will only work, if usernames have no whitespaces...
You can go for a custom str replace function which stops at first replace.. Something like ...
function str_replace_once($needle , $replace , $haystack){
$pos = strpos($haystack, $needle);
if ($pos === false) {
// Nothing found
return $haystack;
}
return substr_replace($haystack, $replace, $pos, strlen($needle));
}
And use it like:
foreach($users as $i=>$u){
$text = str_replace_once("#{$u['username']}",
"<a href='#' class='ct-userLink' rel='{$u['user_id']}'>#{$u['username']}</a> ", $text);
}
You shouldn’t replace one certain user mention at a time but all at once. You could use preg_split to do that:
// split text at mention while retaining user name
$parts = preg_split("/#(\w+)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
$n = count($parts);
// $n is always an odd number; 1 means no match found
if ($n > 1) {
// collect user names
$users = array();
for ($i=1; $i<$n; $i+=2) {
$users[$parts[$i]] = '';
}
// get corresponding user information
$sql = "SELECT *
FROM users
WHERE username IN ('" .implode("','", array_keys($users)). "')";
$users = array();
foreach ($this->getQuery($sql) as $user) {
$users[$user['username']] = $user;
}
// replace mentions
for ($i=1; $i<$n; $i+=2) {
$u = $users[$parts[$i]];
$parts[$i] = "<a href='#' class='ct-userLink' rel='{$u['user_id']}'>#{$u['username']}</a>";
}
// put everything back together
$text = implode('', $parts);
}
I like dnl solution of parsing ' #user', but maybe is not suitable for you.
Anyway, did you try to use strip_tags function to remove the anchor tags? That way you have the string without the links, and you can parse it building the links again.
strip_tags