PHP String Manipulation: Extract hrefs - php

I have a string of HTML that I would like to check to see if there are any links inside of it and, if so, extract them and put them in an array. I can do this in jQuery with the simplicity of its selectors but I cannot find the right methods to use in PHP.
For example, the string may look like this:
<h1>Doctors</h1>
<a title="C - G" href="linkl.html">C - G</a>
<a title="G - K" href="link2.html">G - K</a>
<a title="K - M" href="link3.html">K - M</a>
How (in PHP) can i turn it into an array that looks something like:
[1]=>"link1.html"
[2]=>"link2.html"
[3]=>"link3.html"
Thanks,
Ian

You can use PHPs DOMDocument library to parse XML and/or HTML. Something like the following should do the trick, to get the href attribute from a string of HTML.
$html = '<h1>Doctors</h1>
<a title="C - G" href="linkl.html">C - G</a>
<a title="G - K" href="link2.html">G - K</a>
<a title="K - M" href="link3.html">K - M</a>';
$hrefs = array();
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
foreach ($tags as $tag) {
$hrefs[] = $tag->getAttribute('href');
}

Your question is diffucult to understand but i believe you want a PHP DOM Parser, you can find simple dom parser here: http://simplehtmldom.sourceforge.net/ and a small usage example is:
$array = array();
foreach($html->find('a') as $a)
{
$array[] = $a->href;
}
you you can use jQuery then you should be able to use this no problem as its selecting system is the same as jQuery aswell as CSS, as jQuery derives from CSS

One line solution
$href = (string)( new SimpleXMLElement($your_html_tag))['href'];

if the format is always the same, u can probably sort it out with a combination of explode and strip_tags something like
$html="<span class="field-content">whatever</span>"
$href=end(explode('"',strip_tags($html)));

Related

PHP: Simple HTML DOM Parser - how to get the element which has certain tag name?

In PHP I'm using the Simple HTML DOM Parser class.
I have a HTML file which has multiple and diferents tags.
In this HTML there is an element like this:
<a name="10418"><b> Hospitalist (Family Practitioner)</b></a>
So I would like to find that 'a' element with has name="10418"
I've tried this with no luck, because I only want to get that string.
$html_object = str_get_html($url);
$html_object=$html_object->find('a');
foreach ($html_object as $o) {
$a= $o->find("b");
echo $a[0];
}
Try:
$anchor = $html_object->find('a[name=10418]', 0);
echo $anchor->plaintext;
Working DEMO
Try another library called Tag Parse.It's simple and efficient.
$dom = new TagParse\TagDomRoot($html);
$a = $dom->find('a[name="10418"]');
I think it's fast and cost less memory than simple_html_dom.

How do I cut this string in PHP?

I'm currently working on a script to archive an imageboard.
I'm kinda stuck on making links reference correctly, so I could use some help.
I receive this string:
>>10028949<br><br>who that guy???
In said string, I need to alter this part:
<a href="10028949#p10028949"
to become this:
<a href="#p10028949"
using PHP.
This part may appear more than once in the string, or might not appear at all.
I'd really appreciate it if you had a code snippet I could use for this purpose.
Thanks in advance!
Kenny
Disclaimer: as it'll be said in the comments, using a DOM parser is better to parse HTML.
That being said:
"/(<a[^>]*?href=")\d+(#[^"]+")/"
replaced by $1$2
So...
$myString = preg_replace("/(<a[^>]*?href=\")\d+(#[^\"]+\")/", "$1$2", $myString);
try this
>>10028949<br><br>who that guy???
Although you have the question already answered I invite you to see what would (approximately xD) be the correct approach, parsing it with DOM:
$string = '>>10028949<br><br>who that guy???';
$dom = new DOMDocument();
$dom->loadHTML($string);
$links = $dom->getElementsByTagName('a'); // This stores all the links in an array (actually a nodeList Object)
foreach($links as $link){
$href = $link->getAttribute('href'); //getting the href
$cut = strpos($href, '#');
$new_href = substr($href, $cut); //cutting the string by the #
$link->setAttribute('href', $new_href); //setting the good href
}
$body = $dom->getElementsByTagName('body')->item(0); //selecting everything
$output = $dom->saveHTML($body); //passing it into a string
echo $output;
The advantages of doing it this way is:
More organized / Cleaner
Easier to read by others
You could for example, have mixed links, and you only want to modify some of them. Using Dom you can actually select certain classes only
You can change other attributes as well, or the selected tag's siblings, parents, children, etc...
Of course you could achieve the last 2 points with regex as well but it would be a complete mess...

Replace an element with Dom Document PHP

I load a html page with PHP Dom Document :
$doc = new DOMDocument();
#$doc->loadHTMLFile($url);
I search in my page all "a" elements, and if they realize my condition i need to replace for example My link is beautiful by just My link is beautiful
Here my loop :
$liens = $div->getElementsByTagName('a');
foreach($liens as $lien){
if($lien->hasAttribute('href')){
if (preg_match("/metz2/i", $lien->getAttribute('href'))) {
//HERE I NEED TO REPLACE </a>
}
$cpt++;
}
}
Do you have any ideas ? Suggestions ? Thanks :)
Every time i need to manage DOM with PHP, i use a framework called PHP Simple HTLM DOM parser. (Link here)
It's very easy to use, something like this might work for you:
// Create DOM from URL or file
$html = file_get_html('http://www.page.com/');
// Find all links
foreach($html->find('a') as $element) {
//Do your custom logic here if you need it, for example this extracts the inner contents of the a-tag, and puts it freely.
$inner = $element->innertext;
$element->outertext($inner);
}
//To echo modified html again:
echo $html;
Could be done with preg_replace as well:
$sText = 'Stackoverflow';
$sText = preg_replace( '/<a.*>(.*)<\/a>/', '$1', $sText );
echo $sText;

How to extract only certain tags from HTML document using PHP?

I'm using a crawler to retrieve the HTML content of certain pages on the web. I currently have the entire HTML stored in a single PHP variable:
$string = "<PRE>".htmlspecialchars($crawler->results)."</PRE>\n";
What I want to do is select all "p" tags (for example) and store their in an array. What is the proper way to do that?
I've tried the following, by using xpath, but it doesn't show anything (most probably because the document itself isn't an XML, I just copy-pasted the example given in its documentation).
$xml = new SimpleXMLElement ($string);
$result=$xml->xpath('/p');
while(list( , $node)=each($result)){
echo '/p: ' , $node, "\n";
}
Hopefully someone with (a lot) more experience in PHP will be able to help me out :D
Try using DOMDocument along with DOMDocument::getElementsByTagName. The workflow should be quite simple. Something like:
$doc = DOMDocument::loadHTML(htmlspecialchars($crawler->results));
$pNodes = $doc->getElementsByTagName('p');
Which will return a DOMNodeList.
I vote for use regexp. For tag p
preg_match_all('/<p>(.*)<\/p>/', '<p>foo</p><p>foo 1</p><p>foo 2</p>', $arr, PREG_PATTERN_ORDER);
if(is_array($arr))
{
foreach($arr as $value)
{
echo $value."</br>";
}
}
Check out Simple HTML Dom. It will grab external pages and process them with fairly accurate detail.
http://simplehtmldom.sourceforge.net/
It can be used like this:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';

Getting content of href value

I need to catch the content of href using regex. For example, when I apply the rule to
href="www.google.com", I'd like to get www.google.com. Also, I would like to ignore all hrefs which have only # in their value.
Now, I was playing around for some time, and I came up with this:
href=(?:\"|\')((?:[^#]|.#.|.#|#.)+)(?:\"|\')
When I try it out in http://www.rubular.com/ it works like a charm, but I need to use it with preg_replace_callback in PHP, and there I don't get the expected result (for testing it in PHP, I was using this site: http://www.pagecolumn.com/tool/pregtest.htm).
What's my mistake here?
Since parsing HTML using regular expressions is a Bad Thing™, I suggest a less crude method:
$dom = new DomDocument;
$dom->loadHTML($pageContent);
$elements = $dom->getElementsByTagName('a');
for ($n = 0; $n < $elements->length; $n++) {
$item = $elements->item($n);
$href = $item->getAttribute('href');
// here's your href attribute
}
How about:
href\s*=\s*"([^#"]+#?[^"]*)"
First and foremost: DON'T USE REGEX TO PARSE HTML
I would go with something like:
href=("|')?([^\s"'])+("|')?

Categories