I'm trying to grab all the links and their content from a text, but my problem is that the links might also have other attributes like class or id. What would be the pattern for this?
What i tried so far is:
/<a href="(.*)">(.*)<\/a\>/
Thank You,
Radu
As the comment to your question states, avoid using regex for HTML. The correct way to do it is using DOMDocument
$dom = new DOMDocument;
$dom->load($html);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//*/a');
foreach ($links as $link) {
/* do something with this */
$href = $link->getAttribute('href');
$text = $link->nodeValue;
}
Edit:
An even better answer on the subject
This should do it:
/<a .*?href="(.*?)"[^>]*>([^<]*)<\/a>/i
Read this and see if you still want to use it.
Related
I'm using preg_match_all method to get the urls from inside the anchor tag on a page. It works but when i'm getting them, before adding them to the array i would like to wrap them with '(like this 'url'):
preg_match_all('!<a href="(.*?)">!', $anchors, $urls);
Is there a way to do that? If yes, can you point me towards the right direction and the proper way that this could be done?
Thank you! :D
Instead of using a regex to parse html you could use DOMDocument and getElementsByTagName
$dom = new DOMDocument;
$dom->loadHTMLFile("yourfile.html");
$anchors= $dom->getElementsByTagName("a");
$hrefs = [];
foreach ($anchors as $anchor) {
if ($anchor->hasAttribute("href")) {
$hrefs[] = "'{$anchor->getAttribute('href')}'";
}
}
This is my Regex to fetch all tags with class:
preg_match_all('/<\s*\w*\s*class\s*=\s*"?\s*([\w\s%#\/\.;:_-]*)\s*"?.*?>/',file,$matches);
It matches all tags with class like <a class="abc">
The problem is that if any tag contains extra attribute before class than this Regex are unable to get it.
E.g.: <a id="fig_3_1" class="figure-contents">
I want <a class="figure-contents"> by ignore fig_3_1
Any idea to exclude it?
<\s*\w*.*?\s*class\s*=\s*"?\s*([\w\s%#\/\.;:_-]*)\s*"?.*?>
Probably this works
but you better use simple_html_dom
Take a look at this amazing SO post and reconsider.
You will most likely be better of using a html parser instead. You can do so using the DOM model.
A simple sample of how it can be used below.
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$image->setAttribute('src', 'http://example.com/' .$image->getAttribute('src'));
}
$html = $dom->saveHTML();
I am using DomDocument to pull content from a specific div on a page.
I would then like to replace all instances of links with a path equal to http://example.com/test/ with http://example.com/test.php.
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$doc = new DomDocument('1.0', 'UTF-8');
$doc->loadHtml(file_get_contents($url));
$div = $doc->getElementById('upcoming_league_dates');
foreach ($div->getElementsByTagName('a') as $item) {
$item->setAttribute('href', 'http://example.com/test.php');
}
echo $doc->saveHTML($div);
As you can see in the example above, str_replace causes problems after I target the upcoming_league_dates div with getElementById. I understand this but unfortunately I don't know where to go from here!
I've tried several different ways including executing the str_replace above the getElementById function (I figured I could replace the strings first and then target the specific div), with no luck.
What am I missing here?
EDIT: UPDATED CODE TO SHOW WORKING SOLUTION
You can't just use str_replace on that node. You need to access it properly first. Thru the DOMElement class you can use the method ->setAttribute() and make the replacement.
Example:
$url = "http://pugetsoundbasketball.com/stackoverflow_sample.php";
$dom = new DOMDocument('1.0', 'UTF-8');
$dom->loadHTMLFile($url);
$xpath = new DOMXpath($dom); // use xpath
$needle = 'http://example.com/test/';
$replacement = 'http://example.com/test.php';
// target the link
$links = $xpath->query("//div[#id='upcoming_league_dates']/a[contains(#href, '$needle')]");
foreach($links as $anchor) {
// replacement of those href values
$anchor->setAttribute('href', $replacement);
}
echo $dom->saveHTML();
Update: After your revision, your code is now working anyway. This is just to answer your logic replacement (ala str_replace search/replace) on your previous question.
Okay, I am using (PHP) file_get_contents to read some websites, these sites have only one link for facebook... after I get the entire site I will like to find the complete Url for facebook
So in some part there will be:
<a href="http://facebook.com/username" >
I wanna get http://facebook.com/username, I mean from the first (") to the last ("). Username is variable... could be username.somethingelse and I could have some attributes before or after "href".
Just in case i am not being very clear:
<a href="http://facebook.com/username" > //I want http://facebook.com/username
<a href="http://www.facebook.com/username" > //I want http://www.facebook.com/username
<a class="value" href="http://facebook.com/username. some" attr="value" > //I want http://facebook.com/username. some
or all example above, could be with singles quotes
<a href='http://facebook.com/username' > //I want http://facebook.com/username
Thanks to all
Don't use regex on HTML. It's a shotgun that'll blow off your leg at some point. Use DOM instead:
$dom = new DOMDocument;
$dom->loadHTML(...);
$xp = new DOMXPath($dom);
$a_tags = $xp->query("//a");
foreach($a_tags as $a) {
echo $a->getAttribute('href');
}
I would suggest using DOMDocument for this very purpose rather than using regex. Here is a quick code sample for your case:
$dom = new DOMDocument();
$dom->loadHTML($content);
// To hold all your links...
$links = array();
$hrefTags = $dom->getElementsByTagName("a");
foreach ($hrefTags as $hrefTag)
$links[] = $hrefTag->getAttribute("href");
print_r($links); // dump all links
i want get all link in page by class "page1" in php.
the same code in jquery
$("a#page1").echo(function()
{
});
can do that in php?
$pattern = '`.*?((http|ftp)://[\w#$&+,\/:;=?#%.-]+)[^\w#$&+,\/:;=?#%.-]*?`i';
preg_match_all($pattern,$page_g,$matches);
this code get all href in the $page_g but its not work for class="page1".
i want only all href in $page_g by class="page1"
can help me for optimize reqular ex or other way?
for example
$page_g="the <strong>office</strong> us s01 05 xvid mu asd";
i want return only /?s=cache:16001429:office+s01e02
tnx
You lack the expertise to use a regular expression for that. Hencewhy using DOMdocument is the advisable solution here. If you want to have a simpler API then use the jQuery-lookalikes phpQuery or QueryPath:
$link = qp($html)->find("a#page1")->attr("href");
print $link;
Edit Edited since you clarified the question.
To get all <a> links with the class .page1:
// Load the HTML from a file
$your_HTML_string = file_get_contents("html_filename.html");
$doc = new DOMDocument();
$doc->loadHTML($your_HTML_string);
// Then select all <a> tags under #page1
$a_links = $doc->getElementsByTagName("a");
foreach ($a_links as $link) {
// If they have more than one class,
// you'll need to use (strpos($link->getAttribute("class"), "page1") >=0)
// instead of == "page1"
if ($link->getAttribute("class") == "page1") {
// do something
}
}
Use DomDocument to parse HTML page, here's a tutorial:
Tutorial
DOM is preferred to be used here, as regex is difficult to maintain if underlying HTML changes, besides, DOM can deal with invalid HTML and provides you access to other HTML parsing related tools.
So, assuming that have a file that contains HTML, and you are searching for classes, this could be the way to go:
$doc = new DOMDocument;
$doc->load(PATH_TO_YOUR_FILE);
//we will use Xpath to find all a containing your class, as a tag can have more than one class and it's just easier to do it with Xpath.
$xpath = new DOMXpath($doc);
$list = $xpath->query("//a[contains(#class, 'page1')]");
foreach ($list as $a_tag) {
$href = $a_tag->getAttribute('href');
//do something
}