I need to catch the content of href using regex. For example, when I apply the rule to
href="www.google.com", I'd like to get www.google.com. Also, I would like to ignore all hrefs which have only # in their value.
Now, I was playing around for some time, and I came up with this:
href=(?:\"|\')((?:[^#]|.#.|.#|#.)+)(?:\"|\')
When I try it out in http://www.rubular.com/ it works like a charm, but I need to use it with preg_replace_callback in PHP, and there I don't get the expected result (for testing it in PHP, I was using this site: http://www.pagecolumn.com/tool/pregtest.htm).
What's my mistake here?
Since parsing HTML using regular expressions is a Bad Thing™, I suggest a less crude method:
$dom = new DomDocument;
$dom->loadHTML($pageContent);
$elements = $dom->getElementsByTagName('a');
for ($n = 0; $n < $elements->length; $n++) {
$item = $elements->item($n);
$href = $item->getAttribute('href');
// here's your href attribute
}
How about:
href\s*=\s*"([^#"]+#?[^"]*)"
First and foremost: DON'T USE REGEX TO PARSE HTML
I would go with something like:
href=("|')?([^\s"'])+("|')?
Related
I'm trying to load an HTML page by using a URL. This is what I'm doing now to find the count of images on a page:
$html = "http://stackoverflow.com/";
$doc = new DOMDocument();
#$doc->loadHTML($html);
$tags = $doc->getElementsByTagName('*');
$count = 0;
foreach ($tags as $tag) {
if (strcmp($tag->tagName, "img") == 0) {
$count++;
}
}
echo $count;
I know this isn't an efficient way to do this, I just set it up as an example. Each time, count is 0. But there are images on the page. Which brings me to believe the page isn't loading right. What am I doing wrong? Thanks.
Tag names in HTML are canonically in upper-case, however you can avoid the issue by using strcasecmp instead of strcmp.
Or avoid both problems by doing it properly:
$count = $doc->getElementsByTagName('img')->length;
From the docs
DOMDocument::loadHTML — Load HTML from a string
It's signature is quite clear about this, too:
public bool DOMDocument::loadHTML ( string $source [, int $options = 0 ] )
You could try using DOMDocument::loadHTMLFile, or simply get the markup of the given url using file_get_contents or a cURL request (whichever works best for you).
And please don't use the error-suppression operator # of death if something emits a notice/warning/error, there's a problem. Don't ignore it, fix it!
In PHP I'm using the Simple HTML DOM Parser class.
I have a HTML file which has multiple and diferents tags.
In this HTML there is an element like this:
<a name="10418"><b> Hospitalist (Family Practitioner)</b></a>
So I would like to find that 'a' element with has name="10418"
I've tried this with no luck, because I only want to get that string.
$html_object = str_get_html($url);
$html_object=$html_object->find('a');
foreach ($html_object as $o) {
$a= $o->find("b");
echo $a[0];
}
Try:
$anchor = $html_object->find('a[name=10418]', 0);
echo $anchor->plaintext;
Working DEMO
Try another library called Tag Parse.It's simple and efficient.
$dom = new TagParse\TagDomRoot($html);
$a = $dom->find('a[name="10418"]');
I think it's fast and cost less memory than simple_html_dom.
I would like to use Simple HTML DOM to remove all images in an article so I can easily create a small snippet of text for a news ticker but I haven't figured out how to remove elements with it.
Basically I would do
Get content as HTML string
Remove all image tags from content
Limit content to x words
Output.
Any help?
There is no dedicated methods for removing elements. You just find all the img elements and then do
$e->outertext = '';
when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.
here is an example function:
public function removeNode($selector)
{
foreach ($this->find($selector) as $node)
{
$node->outertext = '';
}
$this->load($this->save());
}
put this function inside the simple_html_dom class and you're good.
I think you have some difficulties because you forgot to save(dump the internal DOM tree back into string).
Try this:
$html = file_get_html("http://example.com");
foreach($html ->find('img') as $item) {
$item->outertext = '';
}
$html->save();
echo $html;
I could not figure out where to put the function so I just put the following directly in my code:
$html->load($html->save());
It basically locks changes made in the for loop back into the html per above.
The supposed solutions are quite expensive and practically unusable in a big loop or other kind of repetition.
I prefer to use "soft deletes":
foreach($html->find('somecondition'),$item){
if (somecheck) $item->setAttribute('softDelete', true); //<= set marker to check in further code
$item->outertext='';
foreach($foo as $bar){
if(!baz->getAttribute('softDelete'){
//do something
}
}
}
This is working for me:
foreach($html->find('element') as $element){
$element = NULL;
}
Adding new answer since removeNode is definitely a better way of removing it:
$html->removeNode('img');
This method probably was not available when accepted answer was marked. You do not need to loop the html to find each one, this will remove them.
Use outerhtml instead of outertext
<div id='your_div'>the contents of your div</div>
$your_div->outertext = '';
echo $your_div // echoes <div id='your_div'></div>
$your_div->outerhtml= '';
echo $your_div // echoes nothing
Try this:
$dom = new Dom();
$dom->loadStr($text);
foreach ($dom->find('element') as $element) {
$element->delete();
}
This works now:
$element->remove();
You can see the documentation for the method here.
Below I remove the HEADER and all SCRIPT nodes of the incoming url by using 2 different methods of the FIND() function. Remove the 2nd parameter to return an array of all matching nodes then just loop through the nodes.
$clean_html = file_get_html($url);
// Find and remove 1st instance of node.
$node = $clean_html->find('header', 0);
$node->remove();
// Find and remove all instances of Nde.
$nodes = $clean_html->find('script');
foreach($nodes as $node) {
$node->remove();
}
I'm currently working on a script to archive an imageboard.
I'm kinda stuck on making links reference correctly, so I could use some help.
I receive this string:
>>10028949<br><br>who that guy???
In said string, I need to alter this part:
<a href="10028949#p10028949"
to become this:
<a href="#p10028949"
using PHP.
This part may appear more than once in the string, or might not appear at all.
I'd really appreciate it if you had a code snippet I could use for this purpose.
Thanks in advance!
Kenny
Disclaimer: as it'll be said in the comments, using a DOM parser is better to parse HTML.
That being said:
"/(<a[^>]*?href=")\d+(#[^"]+")/"
replaced by $1$2
So...
$myString = preg_replace("/(<a[^>]*?href=\")\d+(#[^\"]+\")/", "$1$2", $myString);
try this
>>10028949<br><br>who that guy???
Although you have the question already answered I invite you to see what would (approximately xD) be the correct approach, parsing it with DOM:
$string = '>>10028949<br><br>who that guy???';
$dom = new DOMDocument();
$dom->loadHTML($string);
$links = $dom->getElementsByTagName('a'); // This stores all the links in an array (actually a nodeList Object)
foreach($links as $link){
$href = $link->getAttribute('href'); //getting the href
$cut = strpos($href, '#');
$new_href = substr($href, $cut); //cutting the string by the #
$link->setAttribute('href', $new_href); //setting the good href
}
$body = $dom->getElementsByTagName('body')->item(0); //selecting everything
$output = $dom->saveHTML($body); //passing it into a string
echo $output;
The advantages of doing it this way is:
More organized / Cleaner
Easier to read by others
You could for example, have mixed links, and you only want to modify some of them. Using Dom you can actually select certain classes only
You can change other attributes as well, or the selected tag's siblings, parents, children, etc...
Of course you could achieve the last 2 points with regex as well but it would be a complete mess...
I have a string of HTML that I would like to check to see if there are any links inside of it and, if so, extract them and put them in an array. I can do this in jQuery with the simplicity of its selectors but I cannot find the right methods to use in PHP.
For example, the string may look like this:
<h1>Doctors</h1>
<a title="C - G" href="linkl.html">C - G</a>
<a title="G - K" href="link2.html">G - K</a>
<a title="K - M" href="link3.html">K - M</a>
How (in PHP) can i turn it into an array that looks something like:
[1]=>"link1.html"
[2]=>"link2.html"
[3]=>"link3.html"
Thanks,
Ian
You can use PHPs DOMDocument library to parse XML and/or HTML. Something like the following should do the trick, to get the href attribute from a string of HTML.
$html = '<h1>Doctors</h1>
<a title="C - G" href="linkl.html">C - G</a>
<a title="G - K" href="link2.html">G - K</a>
<a title="K - M" href="link3.html">K - M</a>';
$hrefs = array();
$dom = new DOMDocument();
$dom->loadHTML($html);
$tags = $dom->getElementsByTagName('a');
foreach ($tags as $tag) {
$hrefs[] = $tag->getAttribute('href');
}
Your question is diffucult to understand but i believe you want a PHP DOM Parser, you can find simple dom parser here: http://simplehtmldom.sourceforge.net/ and a small usage example is:
$array = array();
foreach($html->find('a') as $a)
{
$array[] = $a->href;
}
you you can use jQuery then you should be able to use this no problem as its selecting system is the same as jQuery aswell as CSS, as jQuery derives from CSS
One line solution
$href = (string)( new SimpleXMLElement($your_html_tag))['href'];
if the format is always the same, u can probably sort it out with a combination of explode and strip_tags something like
$html="<span class="field-content">whatever</span>"
$href=end(explode('"',strip_tags($html)));