I have got a PHP code to get all images in a URL. It will display the URL of the image. But there are some problems with the URL of the image obtained.
1: If the images are hosted in the root directory, it will not show the domain. For example, if the given URL is www.google.com, the URL of the image obtained is
"/logos/doodles/2014/womens-day-2014-6253511574552576.3-hp.png"
and what I need is
"www.google.com/logos/doodles/2014/womens-day-2014-6253511574552576.3-hp.png"
2: The URL obtained is always between " ". How can I remove it??
Here the PHP code
<?php
$url_image = $_GET['url'];
$homepage = file_get_contents($url_image);
preg_match_all("{<img\\s*(.*?)src=('.*?'|\".*?\"|[^\\s]+)(.*?)\\s*/?>}ims", $homepage, $matches, PREG_SET_ORDER);
foreach ($matches as $val) {
echo $val[2];
echo "<br>";
}
?>
try this:
$url_image = $_GET['url'];
$homepage = file_get_contents($url_image);
preg_match_all("{<img\\s*(.*?)src=('.*?'|\".*?\"|[^\\s]+)(.*?)\\s*/?>}ims", $homepage, $matches, PREG_SET_ORDER);
foreach ($matches as $val) {
$pos = strpos($val[2],"/");
$link = substr($val[2],1,-1);
if($pos == 1)
echo "http://domain.com" . $link;
else
echo $link;
echo "<br>";
}
You can remove the "" using substr($url,1,-1);
Related
I want to remove URLs of certain sites within a string
I used this:
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$LinksToCheck = in_array('google.com' , $LinksToRemove);
if (strpos($URLContent, $LinksToCheck) !== 0) {
$URLContent = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $URLContent);
}
echo $URLContent;
?>
In this example, I want to remove URLs of google.com, yahoo.com and msn.com websites only if any of them found in string $URLContent, but keep any other links.
The result of the previous code is:
<p>Google</p><p>AnotherSite</p>
but I want it to be:
<p>Google</p><p>AnotherSite</p>
One solution would be to explode your $URLContent and compare for each value in $LinksToCheck.
It could be like this :
<?php
$URLContent = '<p>Google</p><p>AnotherSite</p>';
$urlList = explode('</p>', $URLContent);
$LinksToRemove = array('google.com', 'yahoo.com', 'msn.com');
$urlFormat = [];
foreach ($urlList as $url) {
foreach ($LinksToRemove as $link) {
if (str_contains($url, $link)) {
$url = '<p>' . ucfirst(str_replace('.com', '', $link)) . '</p>';
break;
}
}
$urlFormat[] = $url;
}
$result = implode('', $urlFormat);
Recently i've been having issues with the code of not parsing only 1 result from the google search url. Instead it parses 10 results which, I am not sure if it is changable.
<?php
include('simple_html_dom.php');
$html = file_get_html('https://www.google.com/search?q=raspberry&oq=raspberry&aqs=chrome.0.69i59j0l5.1144j0j7&sourceid=chrome&ie=UTF-8');
$linkObjs = $html->find('div[class=jfp3ef] a');
foreach ($linkObjs as $linkObj) {
$title = trim($linkObj->plaintext);
$link = trim($linkObj->href);
//if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
echo '<p>Title:' . $title . '<br />';
echo 'Link: ' . $link . '</p>';
}
?>
All, I need is 1 result being scraped off a google search. Thats all. The code i've included is the web scraper written in PHP.
I need to verify a text to show it in the page of a website. I need to transform all urls links of the the same website(not others urls of other websites) in links. I need to involve all them with the tag <a>. The problem is is the property href, that I need to put the correct url inside it. I am trying to verify all the the text and if I find a url, I need to verify if it contains the substring "http://". If not, I must put it in the href property. I did some attempt, but all their aren't working yet :( . Any idea how can I do this?
My function is below:
$string = "This is a url from my website: http://www.mysite.com.br and I have a article interesting there, the link is http://www.mysite.com.br/articles/what-is-psychology/205967. I need that the secure url link works too https://www.mysite.com.br/articles/what-is-psychology/205967. the following urls must be valid too: www.mysite.com.br and mysite.com.br";
function urlMySite($string){
$verifyUrl = '';
$urls = array("mysite.com.br");
$text = explode(" ", $string);
$alltext = "";
for($i = 0; $i < count($texto); $i++){
foreach ($urls as $value){
$pos = strpos($text[$i], $value);
if (!($pos === false)){
$verifyUrl = " <a href='".$text[$i]."' target='_blank'>".$text[$i]."</a> ";
if (strpos($verifyUrl, 'http://') !== true) {
$verifyUrl = " <a href='http://".$text[$i]."' target='_blank'>".$text[$i]."</a> ";
}
$alltext .= $verifyUrl;
} else {
$alltext .= " ".$text[$i]." ";
}
}
}
return $alltext;
}
You should use PREG_MATCH_ALL to find all occurances of the URL and replace each of the Matches with a clickable Link.
You could use this function:
function augmentText($text){
$pattern = "~(https?|file|ftp)://[a-z0-9./&?:=%-_]*~i";
preg_match_all($pattern, $text, $matches);
if( count($matches[0]) > 0 ){
foreach($matches[0] as $match){
$text = str_replace($match, "<a href='" . $match . "' target='_blank'>" . $match . "</a>", $text);
}
}
return $text;
}
Change the reguylar expression pattern to match only the URL's you want to make clickable.
Good luck
I created regex which gives image url from the source code of the page.
<?php
function get_logo($html, $url)
{
//preg_match_all('', $html, $matches);
//preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches);
if (preg_match_all('/\bhttps?:\/\/\S+(?:png|jpg)\b/', $html, $matches)) {
echo "First";
return $matches[0][0];
} else {
if (preg_match_all('~\b((\w+ps?://)?\S+(png|jpg))\b~im', $html, $matches)) {
echo "Second";
return url_to_absolute($url, $matches[0][0]);
//return $matches[0][0];
} else
return null;
}
}
But for wikipedia page image url is like this
http://en.wikipedia.org/wiki/File:Nelson_Mandela-2008_(edit).jpg
which always fails in my regex.
How can I get rid of this?
Why try to parse HTML with regex when this can easily be done with the DOMDocument class in PHP.
<?php
$doc = new DOMDocument();
#$doc->loadHTMLfile( "http://www.wikipedia.org/" );
$images = $doc->getElementsByTagName("img");
foreach( $images as $image ) {
echo $image->getAttribute("src");
echo "<br>";
}
?>
Inside my webpage "http://www.my_web_page" I have something like that: img border="0" class="borderx" id="pic_15132"
I wanna know which parameter did I have to use inside preg_match_all("") to retrieve only the number.
Here is what I tried but my result is: _15132
$webpage = file_get_contents("http://www.my_web_page");
preg_match_all("#_([^>]*)[0-9]{6}#", $webpage , $matches)
foreach($matches[0] as $value)
{
echo $value . "<br>";
}
Thanks,
$webpage = file_get_contents("http://www.my_web_page");
preg_match_all("#_[^>]*([0-9]{6})#", $webpage , $matches)
foreach($matches[1] as $value)
{
echo $value . "<br>";
}