I am trying to grab content from another one of my site which is working fine, apart from all the links are incorrect.
include_once('../simple_html_dom.php');
$page = file_get_html('http://www.website.com');
$ret = $page->find('div[id=header]');
echo $ret[0];
Is there anyway instead of all links showing link to have the full link? using preg replace.
$ret[0] = preg_replace('#(http://([\w-.]+)+(:\d+)?(/([\w/_.]*(\?\S+)?)?)?)#',
'http://fullwebsitellink.com$1', $ret[0]);
I guess it would be something like above but I dont understand?
Thanks
Your question doesn't really explain what is "incorrect" about the links, but I'm guessing you have something like this:
<div id="header">Home | Sitemap</div>
and you want to embed it in another site, where those links need to be fully-qualified with a domain name, like this:
<div id="header">Home | Sitemap</div>
Assuming this is the case, the replacement you want is so simple you don't even need a regex: find all href attributes beginning "/", and add the domain part (I'll use "http://example.com") to their beginning to make them absolute:
$scraped_html = str_replace('href="/', 'href="http://example.com/', $scraped_html);
Related
I have seen on most online newspaper websites that when i click on a headline link, e.g. two thieves caught red handed, it normally opens a url like this: www.example.co.uk/news/two-thieves-caught-red-handed.
How do I deal with this url in php code, so that I can only pick the last part in the url. e.g. two-thieves-caught-red-handed. After that I want to work with this string.
I know how to deal with GET parameters like "www.example.co.uk/news/headline=two thieves caught red handed".
But I do not want to do it that way. Could you show me another way.
You can use the combination of explode and end functions for that
for example:
<?php
$url = "www.example.co.uk/news/two-thieves-caught-red-handed";
$url = explode('/', $url);
$end = end($url);
echo "$end";
?>
The code will result
two-thieves-caught-red-handed
You have several options in php to get the current url. For a detailed overview look here.
One would be to use $_SERVER[REQUEST_URI] and the use a string manipulation function for extraction of the parts you need.
Maybe this thread will help you too.
I want to apply the page HTML title in the URL
for example in here (stackoverflow) the url is something like that:
http://stackoverflow.com/questions/10000000/get-the-title-of-a-page-url
you can see the "get-the-title-of-a-page-url" part which is the page title
what i mean is when the user go to spowpost.php?post=1
the actual url that shows up when the pages load will be
spowpost.php?post=1$title=..the_title..
how can i do that?
EDIT: i was thinking about htaccess , but i don't know this very well so tutorial would help for this CERTAIN case..
You can use .htaccess for this. For example:
RewriteEngine On
RewriteRule ^questions/(\d+)/([a-z-]+) index.php?id=$1&title=$2 [L]
Your PHP page (index.php) receives the id and title as parameters in $_GET[]:
echo $_GET['title'];
// get-the-title-of-a-page-url
You can then use that (or the id, which is easier) to retrieve the correct item from your data source:
// Assuming you didn't store the - in the database, replace them with spaces
$real_title = str_replace("-", " ", $_GET['title']);
$real_title = mysql_real_escape_string($real_title);
// Query it with something like
SELECT * FROM tbl WHERE LOWER(title) = '$real_title';
Assuming you do have an id parameter in the URL of some sort, it's easier to query based on that value. The title portion can be used really only to make a readable URL, without needing to act on it in PHP.
In reverse, to convert the title to the format-like-this-to-use-in-urls, do:
$url_title = strtolower(str_replace(' ', '-', $original_title));
The above assumes your titles don't include any characters that are illegal in a URL...
$article_link = "http://example.com/spowpost.php?post=$postid&$title=$url_title";
Or to feed to .htaccess:
$article_link = "http://example.com/spowpost$postid/$url_title";
From what I understand, your title will be passed to the page as part of the URL. To show it in the title bar, put this in the section:
<?php $title=urldecode($_GET["title"]); echo "<title>$title</title>"; ?>
You might need to change parts of this, for instance dashes to spaces or something. If that is the case, use PHP's str_replace function: http://php.net/str_replace
Not sure about what's the problem you are facing but just according to what you say in your post the anwser would be:
1.You take the ID from the URL.
2.You search in your database for the original title
3. And then display it in the tag in the of your HTML.
Clarify if you have problem with any of the previous points.
I have tried searching through related answers but can't quite find something that is suitable for my specific needs. I have quite a few affiliate links within 1,000s of articles on one of my wordpress sites - which all start with the same url format and sub-domain structure:
http://affiliateprogram.affiliates.com/
However, after the initial url format, the query string appended changes for each individual url in order to send visitors to specific pages on the destination site.
I am looking for something that will scan a string of html code (the article body) for all href links that include the specific domain above and then replace THE WHOLE LINK (whatever the query string appended) with another standard link of my choice.
href="http://affiliateprogram.affiliates.com/?random=query_string&page=destination"
gets replaced with
href="http://www.mylink.com"
I would ideally like to do this via php as I have a basic grasp, but if you have any other suggestions I would appreciate all input.
Thanks in advance.
<?php
$html = 'href="http://affiliateprogram.affiliates.com/?random=query_string&page=destination"';
echo preg_replace('#http://affiliateprogram.affiliates.com/([^"]+)#is', 'http://www.mylink.com', $html);
?>
http://ideone.com/qaEEM
Use a regular expression such as:
href="(https?:\/\/affiliateprogram.affiliates.com\/[^"]*)"
$data =<<<EOT
bar
foo
<a name="zz" href="http://affiliateprogram.affiliates.com/?query=random&page=destination&string">baz</a>
EOT;
echo (
preg_replace (
'#href="(https?://affiliateprogram.affiliates.com/[^"]*)"#i',
'href="http://www.mylink.com"',
$data
)
);
output
bar
foo
<a name="zz" href="http://www.mylink.com">baz</a>
$a = '<a class="***" href="http://affiliateprogram.affiliates.com/?random=query_string&page=destination" attr="***">';
$b = preg_replace("/<a([^>]*)href=\"http:\/\/affiliateprogram\.affiliates\.com\/[^\"]*\"([^>]*)>/", "<a\\1href=\"http://www.mylink.com/\"\\2>", $a);
var_dump($b); // <a class="***" href="http://www.mylink.com/" attr="***">
That's quite simple, as you only need a single placeholder for the querystring. .*? would normally do, but you can make it more specific by matching anything that's not a double quote:
$html =
preg_replace('~ href="http://affiliateprogram\.affiliates\.com/[^"]*"~i',
' href="http://www.mylink.com"', $html);
People will probably come around and recomend a longwinded domdocument approach, but that's likely overkill for such a task.
I'm trying to find a away to extract a site title from a URL entered into a field in PHP. For example, if the user were to enter the URL http://www.nytimes.com/2009/11/05/sports/baseball/05series.html, I would want "New York Times" or "NY Times" or something along those lines.
I know it's fairly easy to extract the title of the WINDOW... for example, the URL I linked would have the title "Yankees 7, Phillies 3 - Back on Top....", but this is exactly what I don't want.
For clarification, this is for adding sources to a quote. I want to be able to add a source to quotes without a huge page URL and not just a link that says "Source".
Can anyone help me with this? Thanks in advance.
$source = parse_url('http://www.nytimes.com/....', PHP_URL_HOST); // www.nytimes.com
There is no such thing as a "site title" , you can get
the domain name (and then the owner name)
the page's title
I see you have the meta tag "cre" with the value "The New York Times" but you won't find it everywhere
You can do one thing : extract the domain name from the URL, and then get the first page's title
"http://www.nytimes.com/" will give you "The New York Times - Breaking News, World News & Multimedia"
Build a list of URL prefixes to site names, and check for each prefix in turn from longest to shortest.
You'd surely need a lookup table mapping domains (nytimes.com) to your titles "NY Times" in which case it would be easy to do.
If you want to have a method that will work on any link from any domain, then it is a bit harder as PHP in itself is not going to be able to work out what is a uniform title as it will vary from site to site.
You can explode the URL easily enough, but how then would you be able to dissect nytimes into "NY" and "TIMES".
You may be able to find a web service that allows you to feed in a domain and get back a site title, but I do not know of one.
You are best off simply quoting the domain, trimmed like "NYTIMES.COM" as the source, or "NYTIMES".
You would want to use file_get_contents() then run a match to check the text between any <title></title> tags - that then would be your title that you display.
Using parse_url wouldn't return the actual page title.
Something like:
<?php
$x = file_get_contents("http://google.com");
preg_match("/<title>(.+?)<\/title>/", $x, $match);
echo $match[1];
?>
Use the Simple HTML DOM Parser. Here is an example:
require "simple_html_dom.php";
$url = "http://www.google.com";
$html = file_get_html( $url );
list( $title ) = $html->find( 'title' );
echo strip_tags( $title ); // Output: "Google"
I have a file that contains a bunch of links:
site 1
site 2
site 3
I want to get the URL to a link with specific text. For example, search for "site 2" and get back "http://site2.com"
I tried this:
preg_match("/.*?[Hh][Rr][Ee][Ff]=\"(.*?)\">site 2<\/[Aa]>.*/", $contents, $match)
(I know the HREF= will be the last part of the anchor)
But it returns
http://site1.com">site 1</a><a href="http://site2.com
Is there a way to do a search backwards, or something? I know I can do preg_match_all and loop over everything, but I'm trying to avoid that.
Try this:
preg_match("(<a.*?href=[\"']([^\"']+)[\"'][^>]?>site 2</a>)i",$contents,$match);
$result = $match[1];
Hope this helps!
Or you can try using phpQuery.