Say I have the following string:
<a name="anchor" title="anchor title">
Currently I can extract name and title with strpos and substr, but I want to do it right. How can I do this with regex? And what if I wanted to extract from many of these tags within a block of text?
I've tried this regex:
/name="([A-Z,a-z])\w+/g
But it gets the name=" part as well, I just want the value.
The regex (\S+)=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']? can be used to extract all attributes
DOMDocument example:
<?php
$titles = array();
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Test<br><a name="anchor" title="anchor title"></body></html>");
$links = $doc->getElementsByTagName('a');
if ($links->length!=0) {
foreach ($links as $a) {
$titles[] = $a->getAttribute('title');
}
}
?>
You commented: "I'm actually parsing the data before the page is rendered so DOM is not possible, right?"
We're working with the scraped HTML, so we construct a DOM with these functions and parse like XML.
Good examples in the comments here: http://php.net/manual/en/domdocument.getelementsbytagname.php
Related
This is my Regex to fetch all tags with class:
preg_match_all('/<\s*\w*\s*class\s*=\s*"?\s*([\w\s%#\/\.;:_-]*)\s*"?.*?>/',file,$matches);
It matches all tags with class like <a class="abc">
The problem is that if any tag contains extra attribute before class than this Regex are unable to get it.
E.g.: <a id="fig_3_1" class="figure-contents">
I want <a class="figure-contents"> by ignore fig_3_1
Any idea to exclude it?
<\s*\w*.*?\s*class\s*=\s*"?\s*([\w\s%#\/\.;:_-]*)\s*"?.*?>
Probably this works
but you better use simple_html_dom
Take a look at this amazing SO post and reconsider.
You will most likely be better of using a html parser instead. You can do so using the DOM model.
A simple sample of how it can be used below.
$dom = new DOMDocument;
$dom->loadHTML($html);
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
$image->setAttribute('src', 'http://example.com/' .$image->getAttribute('src'));
}
$html = $dom->saveHTML();
I need to remove some tags (e.g. <div></div>) in HTML document and keep inner tags and text.
I managed to do that with Simple HTML Dom Parser. But it can't process big files due to huge memory requirements.
I would prefer to use native PHP tools like DOMDocument cause I read that it's more optimized and quicker in processing HTML documents.
But I struggle at the first stage - how to remove some tags while keeping inner text and tags.
Source HTML sample is:
<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>
I try this code:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
foreach ($oldnodes as $node) {
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
}
echo $doc->saveHTML();
It produces the output:
<html><body>00000aaaaa<div>bbbbbbccc<a>link</a>cccdddddd</div>eeeee<div>1111</div></body></html>
I need the following:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
Could someone please help me with proper code for the task?
You can use strip_tags function in PHP.
$thmltext = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
strip_tags($htmltext, '<html>,<body>,<a>');
This remove all tags except html,body,a
And output is:
<html><body>00000aaaaabbbbbbccc<a>link</a>cccddddddeeeee1111</body></html>
EDIT:
If it is input from user, it's better for security reason to use whitelist tags and not blacklist.
If your code only contains simple HTML tags without any attributes you can keep it simple like:
$value = '<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
$pattern = '/<[\/]*(div|h1)>/';
$removedTags = preg_replace($pattern, '', $value);
Since you wrote in your comment that there are more than just div tags you want to remove, I added a h1 tag to the pattern in case you also want to remove h1 tags.
This code snippet is only for simple code, but fits to your HTML input and output example.
Try this..
Just replace the for loop with the below code.
foreach ($oldnodes as $node) {
$children = $node->childNodes;
$string = "";
foreach($children as $child) {
$childString = $doc->saveXML($child);
$string = $string."".$childString;
}
$fragment = $doc->createDocumentFragment();
$fragment->appendXML($string);
$node->parentNode->insertBefore($fragment,$node);
$node->parentNode->removeChild($node);
}
I found a way to make it work.
The reason code in question not working is the manipulation with nodes in nodelist ruin nodelist. So "foreach" function wents through only 2 out of 4 items in nodelist - the rest 2 become distorted.
So I had to deal with only the 1st element of the list and then rebuild list until there are some items in the list left.
The code is:
$htmltext='<html><body><div>00000</div>aaaaa<div>bbbbbb<div>ccc<a>link</a>ccc</div>dddddd</div>eeeee<div>1111</div></body></html>';
echo "<!--
".$htmltext."
-->
";
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTML($htmltext);
$oldnodes = $doc->getElementsByTagName('div');
while ($oldnodes->length>0){
$node=$oldnodes->item(0);
$fragment = $doc->createDocumentFragment();
while($node->childNodes->length > 0) {
$fragment->appendChild($node->childNodes->item(0));
}
$node->parentNode->replaceChild($fragment, $node);
$oldnodes = $doc->getElementsByTagName('div');
}
echo $doc->saveHTML();
I hope that will be helpful for someone who finds same difficulties.
I'm trying to grab all the links and their content from a text, but my problem is that the links might also have other attributes like class or id. What would be the pattern for this?
What i tried so far is:
/<a href="(.*)">(.*)<\/a\>/
Thank You,
Radu
As the comment to your question states, avoid using regex for HTML. The correct way to do it is using DOMDocument
$dom = new DOMDocument;
$dom->load($html);
$xpath = new DOMXPath($dom);
$links = $xpath->query('//*/a');
foreach ($links as $link) {
/* do something with this */
$href = $link->getAttribute('href');
$text = $link->nodeValue;
}
Edit:
An even better answer on the subject
This should do it:
/<a .*?href="(.*?)"[^>]*>([^<]*)<\/a>/i
Read this and see if you still want to use it.
i want get all link in page by class "page1" in php.
the same code in jquery
$("a#page1").echo(function()
{
});
can do that in php?
$pattern = '`.*?((http|ftp)://[\w#$&+,\/:;=?#%.-]+)[^\w#$&+,\/:;=?#%.-]*?`i';
preg_match_all($pattern,$page_g,$matches);
this code get all href in the $page_g but its not work for class="page1".
i want only all href in $page_g by class="page1"
can help me for optimize reqular ex or other way?
for example
$page_g="the <strong>office</strong> us s01 05 xvid mu asd";
i want return only /?s=cache:16001429:office+s01e02
tnx
You lack the expertise to use a regular expression for that. Hencewhy using DOMdocument is the advisable solution here. If you want to have a simpler API then use the jQuery-lookalikes phpQuery or QueryPath:
$link = qp($html)->find("a#page1")->attr("href");
print $link;
Edit Edited since you clarified the question.
To get all <a> links with the class .page1:
// Load the HTML from a file
$your_HTML_string = file_get_contents("html_filename.html");
$doc = new DOMDocument();
$doc->loadHTML($your_HTML_string);
// Then select all <a> tags under #page1
$a_links = $doc->getElementsByTagName("a");
foreach ($a_links as $link) {
// If they have more than one class,
// you'll need to use (strpos($link->getAttribute("class"), "page1") >=0)
// instead of == "page1"
if ($link->getAttribute("class") == "page1") {
// do something
}
}
Use DomDocument to parse HTML page, here's a tutorial:
Tutorial
DOM is preferred to be used here, as regex is difficult to maintain if underlying HTML changes, besides, DOM can deal with invalid HTML and provides you access to other HTML parsing related tools.
So, assuming that have a file that contains HTML, and you are searching for classes, this could be the way to go:
$doc = new DOMDocument;
$doc->load(PATH_TO_YOUR_FILE);
//we will use Xpath to find all a containing your class, as a tag can have more than one class and it's just easier to do it with Xpath.
$xpath = new DOMXpath($doc);
$list = $xpath->query("//a[contains(#class, 'page1')]");
foreach ($list as $a_tag) {
$href = $a_tag->getAttribute('href');
//do something
}
How can i replace this <p><span class="headline"> with this <p class="headline"><span>
easiest with PHP.
$data = file_get_contents("http://www.ihr-apotheker.de/cs1.html");
$clean1 = strstr($data, '<p>');
$str = preg_replace('#(<a.*>).*?(</a>)#', '$1$2', $clean1);
$ausgabe = strip_tags($str, '<p>');
echo $ausgabe;
Before I alter the html from the site I want to get the class declaration from the span to the <p> tag.
dont parse html with regex!
this class should provide what you need
http://simplehtmldom.sourceforge.net/
The reason not to parse HTML with regex is if you can't guarantee the format. If you already know the format of the string, you don't have to worry about having a complete parser.
In your case, if you know that's the format, you can use str_replace
str_replace('<p><span class="headline">', '<p class="headline"><span>', $data);
Well, answer was accepted already, but anyway, here is how to do it with native DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile("http://www.ihr-apotheker.de/cs1.html");
$xPath = new DOMXpath($dom);
// remove links but keep link text
foreach($xPath->query('//a') as $link) {
$link->parentNode->replaceChild(
$dom->createTextNode($link->nodeValue), $link);
}
// switch classes
foreach($xPath->query('//p/span[#class="headline"]') as $node) {
$node->removeAttribute('class');
$node->parentNode->setAttribute('class', 'headline');
}
echo $dom->saveHTML();
On a sidenote, HTML has elements for headings, so why not use a <h*> element instead of using the semantically superfluous "headline" class.
Have you tried using str_replace?
If the placement of the <p> and <span> tags are consistent, you can simply replace one for the other with
str_replace("replacement", "part to replace", $string);