here I'm looking for a regular expression in PHP which would match the anchor with a specific "target="_parent" on it.I would like to get anchors with text like:
preg_match_all('Text here', subject, matches, PREG_SET_ORDER);
HTML:
<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
To be honest, the best way would be not to use a regular expression at all. Otherwise, you are going to be missing out on all kinds of different links, especially if you don't know that the links are always going to have the same way of being generated.
The best way is to use an XML parser.
<?php
$html = 'Text here';
function extractTags($html) {
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html); // because dom will complain about badly formatted html
$sxe = simplexml_import_dom($dom);
$nodes = $sxe->xpath("//a[#target='_parent']");
$anchors = array();
foreach($nodes as $node) {
$anchor = trim((string)dom_import_simplexml($node)->textContent);
$attribs = $node->attributes();
$anchors[$anchor] = (string)$attribs->href;
}
return $anchors;
}
print_r(extractTags($html))
This will output:
Array (
[Text here] => http://
)
Even using it on your example:
$html = '<a href="http://" target="_parent">
<FONT style="font-size:10pt" color=#000000 face="Tahoma">
<DIV><B>Text</B> - Text </DIV>
</FONT>
</a>
</DIV>
';
print_r(extractTags($html));
will output:
Array (
[Text - Text] => http://
)
If you feel that the HTML is still not clean enough to be used with DOMDocument, then I would recommend using a project such as HTMLPurifier (see http://htmlpurifier.org/) to first clean the HTML up completely (and remove unneeded HTML) and use the output from that to load into DOMDocument.
You should be making using DOMDocument Class instead of Regex. You would be getting a lot of false positive results if you handle HTML with Regex.
<?php
$html='Text here';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $tag) {
if ($tag->getAttribute('target') === '_parent') {
echo $tag->nodeValue;
}
}
OUTPUT :
Text here
Related
I have a part of HTML string like below which I get from web page scraping.
$scraping_html = "<html><body>
....
<div id='ctl00_ContentPlaceHolder1_lblHdr'>some text here with &. some text here.</div>
....</body></html>";
I want to take count of & between the particular div by using PHP. Is it possible to get using any of the PHP preg functions? Thanks in advance.
The hard part is getting the text nodes (I assume that's where you're stuck). Depending on how reliable it needs to be you have two alternatives (just sample code, not actually tested):
Good old strip_tags():
$plain_text = strip_tags($scraping_html);
Proper DOM parser:
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($scraping_html);
libxml_use_internal_errors(false);
$xpath = new DOMXPath($dom);
$plain_text = '';
foreach ($xpath->query('//text()') as $textNode) {
$plain_text .= $textNode->nodeValue;
}
To count, you have e.g. substr_count().
To get the number of & in the given example, use DOMDocument:
$html = <<<EOD
<html><body>
<div id='ctl00_ContentPlaceHolder1_lblHdr'>some text here with &. some text here.</div>
</body></html>
EOD;
$dom = new DOMDocument;
$dom->loadHTML($html);
$ele = $dom->getElementById('ctl00_ContentPlaceHolder1_lblHdr');
echo substr_count($ele->nodeValue, '&');
I want to get all the href links in the html. I came across two possible ways. One is the regex:
$input = urldecode(base64_decode($html_file));
$regexp = "href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)\s*";
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
echo $match[2] ;//= link address
echo $match[3]."<br>" ;//= link text
}
}
And the other one is creating DOM document and parsing it:
$html = urldecode(base64_decode($html_file));
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The # is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
#$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
I dont know which one of this is efficient. But The code will be used many times. So i want to clarify which is the better one to go with. Thank You!
i try to get value between html tag :
preg_match(/<span class=\"value\">(.*)<\/span>/i', $file_string, $title);
html :
<p class="upc">
<label>UPC/EAN/ISBN:</label>
<span class="value">746775319571</span>
</p>
You do not parse HTML with regular expressions, but use php DOM extension instead:
$html = '<p class="upc">
<label>UPC/EAN/ISBN:</label>
<span class="value">746775319571</span>
</p>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$spans = $dom->getElementsByTagName('span');
if ($spans->length > 0) {
echo $spans->item(0)->nodeValue; // outputs 746775319571
}
Online demo: http://ideone.com/9W8gsv
If having a particular class value is a required constraint, then you can either perform the check manually by iterating over $spans and checking class attribute (using DOMElement::getAttributeNode). Or using DOMXPath instead.
Either way, I'm leaving it as a homework, because we all know how satisfactory it is to solve issues yourself!
Is this even possible...
Say I have some text with a link with a class of 'click':
<p>I am some text, i am some text, i am some text, i am some text
<a class="click" href="http://www.google.com">I am a link</a>
i am some text, i am some text, i am some text, i am some text</p>
Using PHP, get the link with class name, 'click', then get the href value?
There are a few ways to do this, the quickest is to use XPath:
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodeList = $xpath->query('//a[#class="click"]');
foreach ($nodeList as $node) {
$href = $node->getAttribute('href');
$text = $node->textContent;
}
You actually don't need to complicate your life at all:
$string='that html code with links';
// while matches found
while(preg_match('/<a class="click" href="([^"]*)">/', $string, $matches)){
// print captured group that's actually the url your searching for
echo $matches[1];
}
How can i replace this <p><span class="headline"> with this <p class="headline"><span>
easiest with PHP.
$data = file_get_contents("http://www.ihr-apotheker.de/cs1.html");
$clean1 = strstr($data, '<p>');
$str = preg_replace('#(<a.*>).*?(</a>)#', '$1$2', $clean1);
$ausgabe = strip_tags($str, '<p>');
echo $ausgabe;
Before I alter the html from the site I want to get the class declaration from the span to the <p> tag.
dont parse html with regex!
this class should provide what you need
http://simplehtmldom.sourceforge.net/
The reason not to parse HTML with regex is if you can't guarantee the format. If you already know the format of the string, you don't have to worry about having a complete parser.
In your case, if you know that's the format, you can use str_replace
str_replace('<p><span class="headline">', '<p class="headline"><span>', $data);
Well, answer was accepted already, but anyway, here is how to do it with native DOM:
$dom = new DOMDocument;
$dom->loadHTMLFile("http://www.ihr-apotheker.de/cs1.html");
$xPath = new DOMXpath($dom);
// remove links but keep link text
foreach($xPath->query('//a') as $link) {
$link->parentNode->replaceChild(
$dom->createTextNode($link->nodeValue), $link);
}
// switch classes
foreach($xPath->query('//p/span[#class="headline"]') as $node) {
$node->removeAttribute('class');
$node->parentNode->setAttribute('class', 'headline');
}
echo $dom->saveHTML();
On a sidenote, HTML has elements for headings, so why not use a <h*> element instead of using the semantically superfluous "headline" class.
Have you tried using str_replace?
If the placement of the <p> and <span> tags are consistent, you can simply replace one for the other with
str_replace("replacement", "part to replace", $string);