Here is what I want to do..
Lets say I am looking for the link "example.com" in a file at http://example.com/test.html".
I want to take a PHP script that looks for an in the mentioned website. However, I also need it to work if there is a class or ID tag in the <A>.
See below url
How can I check if a URL exists via PHP?
or try it
$file = 'http://www.domain.com/somefile.jpg';
$file_headers = #get_headers($file);
if($file_headers[0] == 'HTTP/1.1 404 Not Found') {
$exists = false;
}
else {
$exists = true;
}
From here: http://www.php.net/manual/en/function.file-exists.php#75064
...and right below the above post, there's a curl solution:
function url_exists($url) {
if (!$fp = curl_init($url)) return false;
return true;
}
Update code:-
You can use SimpleHtmlDom Class for find id or class in tag
see the below URL
http://simplehtmldom.sourceforge.net/
http://simplehtmldom.sourceforge.net/manual_api.htm
http://sourceforge.net/projects/simplehtmldom/files/
http://davidwalsh.name/php-notifications
Here is what I have found in case anyone else needs it also!
$url = "http://example.com/test.html";
$searchFor = "example.com"
$input = #file_get_contents($url) or die("Could not access file: $url");
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
echo $match[2];
if ($match[2] == $searchFor)
{
$isMatch = 1;
} else {
$isMatch= 0;
}
// $match[0] = A tag
// $match[2] = link address
// $match[3] = link text
}
}
if ($isMatch)
{
echo "<p><font color=red size=5 align=center>The page specified does contain your link. You have been credited the award amount!</font></p>";
} else {
echo "<p><font color=red size=5 align=center>The specified page does not have your referral link.</font></p>";
}
Related
I would like to scrape the google search result up to page 2 but i'm having trouble on the result of blank page of my website or timeout.
for($j=0; $j<$acount; $j++){
sleep(60);
for($sp = 0; $sp <= 10; $sp+=10){
$url = 'http://www.google.'.$lang.'/search?q='.$in.'&start='.$sp;
if($sp == 10){
$datenbank = "proxy_work.php";
$datei = fopen($datenbank,"a+");
fwrite($datei, $data);
fwrite ($datei,"\r\n");
fclose($datei);
} else {
$datenbank = "proxy_work.php";
$datei = fopen($datenbank,"w+");
fwrite($datei, $data);
fwrite ($datei,"\r\n");
fclose($datei);
}
}
$html = file_get_html("proxy_work.php");
foreach($html->find('a') as $e){
// $title = $h3->innertext;
$link = $e->href;
if(in_array($endomain, $approveurl)){
}
// if it is not a direct link but url reference found inside it, then extract
if (!preg_match('/^https?/', $link) && preg_match('/q=(.+)&sa=/U', $link, $matches) && preg_match('/^https?/', $matches[1])) {
$link = $matches[1];
} else if (!preg_match('/^https?/', $link)) { // skip if it is not a valid link
continue;
}
}
}
Google search result pages (SERP) are not like a common website with static html. Google preserves its data from web scraping. Consider its data as a business directory and see the following tips for business directory scrape:
IP-proxying.
Imitating human behaviour by using some browser automation tools (Selenium, iMacros and others).
Read more here.
I'm trying to get the title of a website that is entered by the user.
Text input: website link, entered by user is sent to the server via AJAX.
The user can input anything: an actual existing link, or just single word, or something weird like 'po392#*#8'
Here is a part of my PHP script:
// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
$url = "http://".$url;
}
// Extra confirmation for security
if (filter_var($url, FILTER_VALIDATE_URL, FILTER_FLAG_HOST_REQUIRED)) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
// Retrieve title if no title is entered
if($title == "" AND $urlIsValid == "1") {
function get_http_response_code($theURL) {
$headers = get_headers($theURL);
if($headers) {
return substr($headers[0], 9, 3);
} else {
return 'error';
}
}
if(get_http_response_code($url) != "200") {
$urlIsValid = "0";
} else {
$file = file_get_contents($url);
$res = preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches);
if($res === 1) {
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
$title = addslashes($title);
}
// If title is still empty, make title the url
if($title == "") {
$title = $url;
}
}
}
However, there are still errors occuring in this script.
It works perfectly if an existing url as 'https://www.youtube.com/watch?v=eB1HfI-nIRg' is entered and when a non-existing page is entered as 'https://www.youtube.com/watch?v=NON-EXISTING', but it doesn't work when the users enters something like 'twitter.com' (without http) or something like 'yikes'.
I tried literally everthing: cUrl, DomDocument...
The problem is that when an invalid link is entered, the ajax call never completes (it keeps loading), while it should $urlIsValid = "0" whenever an error occurs.
I hope someone can help you - it's appreciated.
Nathan
You have a relatively simple problem but your solution is too complex and also buggy.
These are the problems that I've identified with your code:
// Make sure the url is on another host
if(substr($url, 0, 7) !== "http://" AND substr($url, 0, 8) !== "https://") {
$url = "http://".$url;
}
You won't make sure that that possible url is on another host that way (it could be localhost). You should remove this code.
// Make sure there is a dot in the url
if (strpos($url, '.') !== false) {
$urlIsValid = "1";
} else {
$urlIsValid = "0";
}
This code overwrites the code above it, where you validate that the string is indeed a valid URL, so remove it.
The definition of the additional function get_http_response_code is pointless. You could use only file_get_contents to get the HTML of the remote page and check it against false to detect the error.
Also, from your code I conclude that, if the (external to context) variable $title is empty then you won't execute any external fetch so why not check it first?
To sum it up, your code should look something like this:
if('' === $title && filter_var($url, FILTER_VALIDATE_URL))
{
//# means we suppress warnings as we won't need them
//this could be done with error_reporting(0) or similar side-effect method
$html = getContentsFromUrl($url);
if(false !== $html && preg_match("/<title>(.*)<\/title>/siU", $file, $title_matches))
{
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
$title = addslashes($title);
}
// If title is still empty, make title the url
if($title == "") {
$title = $url;
}
}
function getContentsFromUrl($url)
{
//if not full/complete url
if(!preg_match('#^https?://#ims', $url))
{
$completeUrl = 'http://' . $url;
$result = #file_get_contents($completeUrl);
if(false !== $result)
{
return $result;
}
//we try with https://
$url = 'https://' . $url;
}
return #file_get_contents($url);
}
How to add statement, when I search and it doesnt exist on the url, it will show nothing.html?
$url1 = "http://www.pengadaan.net/tend_src_cont2.php?src_nm=";
$url2 = $_GET['src_nm']."&src_prop=";
$url3 = $_GET['src_prop'];
$url = $url1.$url2.$url3;
$html = file_get_html($url);
if (method_exists($html,"find")) {
echo "<ul>";
foreach($html->find('div[class=pengadaan-item] h1[] a[]') as $element ) {
echo ("<li>".$element."</li>");
}
echo "</ul>";
echo $url;
}
else {
}
There are two ways to move to another page in PHP. you can do header("Location: http://www.yourwebsite.com/nothing.php"); or you can have PHP echo JavaScript to do a reidrect (if you already defined your headers):
if (method_exists($html,"find")) { // If 'find exist'
...
} else { // Otherwise it does not exist
header("Location: http://www.pengadaan.net/nothing.php"); // redirect here
}
Or if you already sent you headers you can get around it using JavaScript:
...
} else {
echo '<script>window.location.replace("http://www.pengadaan.net/nothing.php")</script>';
}
I am trying to make a script to check if a webpage has a back link to my page. I have found this script but the problem is that it returns the error message "No back link found" even if there is a backlink. Could someone tell me what is wrong with this script?
Here is the script I am using:
require('simple_html_dom.php');
function CheckReciprocal( $targetUrl, $checkLinkUrl, $checkNofollow = true )
{
$html = file_get_html($targetUrl);
if (empty($html))
{
//# Could not load file
return false;
}
$link = $html->find('a[href^='.$checkLinkUrl.']',0);
if (empty($link))
{
//# Link not found
return false;
}
if ( $checkNofollow && $link->hasAttribute('rel') )
{
$attr = $link->getAttribute('rel');
return (preg_match("/\bnofollow\b/is", $attr) ? false : true);
}
return true;
}
$targetUrl = 'http://example.com/test.html';
$checkLinkUrl = 'http://mysite.com';
if ( CheckReciprocal($test, $checkLinkUrl) )
{
echo 'Link found';
}
else { echo 'Link not found or marked as nofollow'; }
Thank you!
I don't know how does that simple_html_dom.php's $html->find() works cos never used it, but it seems that your problem is there. I would trust the good ol' DOMDocument + regex.
Just wrote a function and tested it, just use on the $url the plain domain + whatever you want, don't worry about the http(s) or www and stuff like that:
function checkBackLink($link, $url, $checkNoFollow = true){
$dom = new DOMDocument();
$dom->loadHTMLFile($link);
foreach($dom->getElementsByTagName('a') as $item){
if($checkNoFollow){
if(preg_match('/nofollow/is', $item->getAttribute('rel'))) continue;
}
if($item->hasAttribute('href') === false) continue;
if(preg_match("#^(https?\://)?(www\.)?$url.*#i", $item->getAttribute('href'))) return true;
}
}
if(checkBacklink('the link', 'example.com')){
echo "link found";
} else {
echo "Link not found or marked as nofollow";
}
If you don't like it and still want to use the simple_html_dom just make sure how that find() works, because if it only match exact values that could be troublesome.
I'm looking for a way to extract both (partials) youtube urls and single ids from a user input string.
This article How do I find all YouTube video ids in a string using a regex? got me going quite well but still i'm struggling a bit.
Is there a way to find both playlist and/or video ids from a strings from:
E4uySuFiCis
PLBE0103048563C552
Through:
?v=4OfUVmfNk4E&list=PLBE0103048563C552&index=5
http://www.youtube.com/watch?v=4OfUVmfNk4E&list=PLBE0103048563C552&index=5
use:
$urlInfo = parse_url($url); // to get url components (scheme:host:query)
$urlVars = array();
parse_str($queryString, $urlVars); // to get the query vars
check out the youtube api for more details on the format
I wrote a script to do this once where the YouTube URL is posted via "POST" under the key "l" (lowercase "L").
Unfortunately I never got round to incorporating it into my project so it's not been extensively tested to see how it does. If it fails it calls invalidURL with the URL as a parameter, if it succeeds it calls validURL with the ID from the URL.
This script may not be exactly what you're after because it ONLY retrieves the ID of the video currently playing - but you should be able to modify it easily.
if (isset($_POST['l'])) {
$ytIDLen = 11;
$link = $_POST['l'];
if (preg_match('|^http(s)?://[a-z0-9-]+(.[a-z0-9-]+)*(:[0-9]+)?(/.*)?$|i', $link)) {
$urlParts = parse_url($link);
//$scheme
//$host
//$path
//$query["v"]
if (isset($urlParts["scheme"])) {
if ( ($urlParts["scheme"] == "http" ) || ($urlParts["scheme"] == "https") ) {
//$scheme = "http";
} else invalidURL($link);
} //else $scheme = "http";
if (isset($urlParts["host"])) {
if ( ($urlParts["host"] == "www.youtube.com") || ($urlParts["host"] == "www.youtube.co.uk") || ($urlParts["host"] == "youtube.com") || ($urlParts["host"] == "youtube.co.uk")) {
//$host = "www.youtube.com";
if (isset($urlParts["path"])) {
if ($urlParts["path"] == "/watch") {
//$path = "/watch";
if (isset($urlParts["query"])) {
$query = array();
parse_str($urlParts["query"],$query);
if (isset($query["v"])) {
$query["v"] = preg_replace("/[^a-zA-Z0-9\s]/", "", $query["v"]);
if (strlen($query["v"]) == $ytIDLen) {
validUrl($query["v"]);
} else invalidURL($link);
} else invalidURL($link);
} else invalidURL($link);
} else invalidURL($link);
} else invalidURL($link);
} else invalidURL($link);
} else invalidURL($link);
} else invalidURL($link);
}