Extract entire url content using Regex - php

Okay, I am using (PHP) file_get_contents to read some websites, these sites have only one link for facebook... after I get the entire site I will like to find the complete Url for facebook
So in some part there will be:
<a href="http://facebook.com/username" >
I wanna get http://facebook.com/username, I mean from the first (") to the last ("). Username is variable... could be username.somethingelse and I could have some attributes before or after "href".
Just in case i am not being very clear:
<a href="http://facebook.com/username" > //I want http://facebook.com/username
<a href="http://www.facebook.com/username" > //I want http://www.facebook.com/username
<a class="value" href="http://facebook.com/username. some" attr="value" > //I want http://facebook.com/username. some
or all example above, could be with singles quotes
<a href='http://facebook.com/username' > //I want http://facebook.com/username
Thanks to all

Don't use regex on HTML. It's a shotgun that'll blow off your leg at some point. Use DOM instead:
$dom = new DOMDocument;
$dom->loadHTML(...);
$xp = new DOMXPath($dom);
$a_tags = $xp->query("//a");
foreach($a_tags as $a) {
echo $a->getAttribute('href');
}

I would suggest using DOMDocument for this very purpose rather than using regex. Here is a quick code sample for your case:
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
$hrefTags = $dom->getElementsByTagName("a");
foreach ($hrefTags as $hrefTag)
$links[] = $hrefTag->getAttribute("href");
print_r($links); // dump all links

Related

How do I use str_replace with DomDocument

I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION
You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.

manipulate PHP domdocument string

I want to remove the element tag in my domdocument html.
I have something like
this is the <a href='#'>test link</a> here and <a href='#'>there</a>.
I want to change my html to
this is the test link here and there.
My code
$dom = new DomDocument();
$dom->loadHTML($html);
$atags=$dom->getElementsByTagName('a');
foreach($atags as $atag){
$value = $atag->nodeValue;
//I can get the test link and there value but I don't know how to remove the a tag.
}
Thanks for the help!
You are looking for a method called DOMNode::replaceChild().
To make use of that you need to create a DOMText of the $value (DOMDocument::createTextNode()) and also getElementsByTagName return a self-updating list, so when you replace the first element and then you go to the second, there is no second any longer, there is only one a element left.
Instead you need a while on the first item:
$atags = $dom->getElementsByTagName('a');
while ($atag = $atags->item(0))
{
$node = $dom->createTextNode($atag->nodeValue);
$atag->parentNode->replaceChild($node, $atag);
}
Something along those lines should do it.
You could just use strip_tags - it should do what you've asked.
<?php
$string = "this is the <a href='#'>test link</a> here and <a href='#'>there</a>.";
echo strip_tags($string);
// output: this is the test link here and there.

Regex to find target="_blank" links and add text before closing </a> tag

I need to be able to parse some text and find all the instances where an tag has target="_blank".... and for each match, add (for example): This link opens in a new window before the closeing tag.
For example:
Before:
Go here now
After:
Go here now<span>(This link opens in a new window)</span>
This is for a PHP site, so i assume preg_replace() will be the method... i just dont have the skills to write the regex properly.
Thanks in advance for any help anyone can offer.
You should never use a regex to parse HTML, except maybe in extremely well-defined and controlled circumstances.
Instead, try a built-in parser:
$dom = new DOMDocument();
$dom->loadHTML($your_html_source);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a[#target='_blank']");
foreach($links as $link) {
$link->appendChild($dom->createTextNode(" (This link opens in a new window)"));
}
$output = $dom->saveHTML();
Aternatively, if this is being output to the browser, you can just use CSS:
a[target='_blank']:after {
content: ' (This link opens in a new window)';
}
This will work for anchor tag replacement....
$string = str_replace('<a ','<a target="_blank" ',$string);
Well #Kolink is right, but there's my RegExp version.
$string = '<p>mess</p>Google<p>mess</p>';
echo preg_replace("/(\<a.*?target=\"_blank\".*?>)(.*?)(\<\/a\>)/miU","$1$2(This link opens in a new window)$3",$string);
This does the job:
$newText = '<span>(This link opens in a new window)</span>';
$pattern = '~<a\s[^>]*?\btarget\s*=(?:\s*([\'"])_blank\1|_blank\b)[^>]*>[^<]*(?:<(?!/a>)[^<]*)*\K~i';
echo preg_replace($pattern, $newText, $html);
However this direct string approach may replace also commented html parts, strings or comments in css or javascript code and eventually inside javascript literal regexes, that is at best unneeded and at worst unwanted at all. That's why you should use a DOM approach if you want to avoid these pitfalls. All you have to do is to append a new node to each link with the desired attribute:
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);
$nodeList = $xp->query('//a[#target="_blank"]');
foreach($nodeList as $node) {
$newNode = dom->createElement('span', '(This link opens in a new window)');
$node->appendChild($newNode);
}
$html = $dom->saveHTML();
To finish, a last alternative consists to not change the html at all and to play with css:
a[target="_blank"]::after {
content: " (This link opens in a new window)";
font-style: italic;
color: red;
}
You won't be able to write a regex that will evaluate an infinitely long string. I suggest:
$h = explode('>', $html);
This will give you the chance to traverse it like any other array and then do:
foreach($h as $k){
if(!preg_match('/^<a href=/', $k){
continue;
}elseif(!preg_match(/target="_blank")/, $k){
continue;
}else{
$h[$k + 1] .= '(open in new window);
}
}
$html = implode('>', $h);
This is how I would approach such a problem. of course, I just threw this out off the top of my head and is note guaranteed to work as is, but with a few possible tweaks to your exact logic, and you will have what you need.

Geting links from a context using preg_match_all

I'm trying to grab all the links and their content from a text, but my problem is that the links might also have other attributes like class or id. What would be the pattern for this?
What i tried so far is:
/<a href="(.*)">(.*)<\/a\>/
Thank You,
Radu
As the comment to your question states, avoid using regex for HTML. The correct way to do it is using DOMDocument
$dom = new DOMDocument;
$dom->load($html);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//*/a');
foreach ($links as $link) {
/* do something with this */
$href = $link->getAttribute('href');
$text = $link->nodeValue;
}
Edit:
An even better answer on the subject
This should do it:
/<a .*?href="(.*?)"[^>]*>([^<]*)<\/a>/i
Read this and see if you still want to use it.

Grabbing links using xpath in php

i am trying to grab links from the Google search page. i am using the be below xpath to
//div[#id='ires']/ol[#id='rso']/li/h3/a/#href
grab the links. xPather evaluates it and gives the result. But when i use it with my php it doesn't show any result. Can someone please tell me what I am doing wrong? There is nothing wrong with the cURL.
below is my code
$dom = new DOMDocument();
#$dom->loadHTML($result);
$xpath=new DOMXPath($dom);
$elements = $xpath->evaluate("//div[#id='ires']/ol[#id='rso']/li/h3/a");
foreach ($elements as $element)
{
$link = $element->getElementsByTagName("href")->item(0)->nodeValue;
echo $link."<br>";
}
Sample Html provided by Robert Pitt
<li class="g w0">
<h3 class="r">
<em>LINK</em>
</h3>
<button class="ws" title=""></button>
<div class="s">
META
</div>
</li>
You can make life simpler by using the original XPath expression that you quoted:
//div[#id='ires']/ol[#id='rso']/li/h3/a/#href
Then, loop over the matching attributes like:
$hrefs = $xpath->evaluate(...);
foreach ($hrefs as $href) {
echo $href->value . "<br>";
}
Be sure to check whether any attributes were matched (var_dump($hrefs->length) would suffice).
Theres no element called href, thats an attribute:
$link = $element->getElementsByTagName("href")->item(0)->nodeValue;
You can just use
$link = $element->getAttribute('href');
did you try
$element->getElementsByTagName("a")
instead of
$element->getElementsByTagName("href")

Categories