i made a code like this and how to make it short ? i mean i don't want to use foreach all the time for regex match, thank you.
<?php
preg_match_all('#<article [^>]*>(.*?)<\/article>#sim', $content, $article);
foreach($article[1] as $posts) {
preg_match_all('#<img class="images" [^>]*>#si', $posts, $matches);
$img[] = $matches[0];
}
$result = array_filter($img);
foreach($result as $res) {
preg_match_all('#src="(.*?)" data-highres="(.*?)"#si', $res[0], $out);
$final[] = array(
'src' => $proxy.base64_encode($out[1][0]),
'highres' => $proxy.base64_encode($out[2][0])
);
?>
If you want a robust code (that always works), avoid to parse html using regex, because html is more complicated and unpredictable than you think. Instead use build-in tools available for these particular tasks, i.e DOMxxx classes.
$dom = new DOMDocument;
$state = libxml_use_internal_errors(true);
$dom->loadHTML($content);
libxml_use_internal_errors($state);
$xp = new DOMXPath($dom);
$imgList = $xp->query('//article//img[#src][#data-highres]');
foreach($imgList as $img) {
$final[] = [
'src' => $proxy.base64_encode($img->getAttribute('src')),
'highres' => $proxy.base64_encode($img->getAttribute('data-highres'))
];
}
Related
I am using a simple html dom to parsing html file.
I have a dynamic array called links2, it can be empty or maybe have 4 elements inside or more depending on the case
<?php
include('simple_html_dom.php');
$url = 'http://www.example.com/';
$html = file_get_html($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
//////////////////////////////////////////////////////////////////////////////
foreach ($doc->getElementsByTagName('p') as $link)
{
$intro2 = $link->nodeValue;
$links2[] = array(
'value' => $link->textContent,
);
$su=count($links2);
}
$word = 'document.write(';
Assuming that the two elements contain $word in "array links2", when I try to filter this "array links2" by removing elements contains matches
unset( $links2[array_search($word, $links2 )] );
print_r($links2);
the filter removes only one element and array_diff doesn't solve the problem. Any suggestion?
solved by adding an exception
foreach ($doc->getElementsByTagName('p') as $link)
{
$dont = $link->textContent;
if (strpos($dont, 'document') === false) {
$links2[] = array(
'value' => $link->textContent,
);
}
$su=count($links2);
echo $su;
I am trying to extract specific type of links from the webpage using php
links are like following..
http://www.example.com/pages/12345667/some-texts-available-here
I want to extract all links like in the above format.
maindomain.com/pages/somenumbers/sometexts
So far I can extract all the links from the webpage, but the above filter is not happening. How can i acheive this ?
Any suggestions ?
<?php
$html = file_get_contents('http://www.example.com');
//Create a new DOM document
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
?>
You can use DOMXPath and register a function with DOMXPath::registerPhpFunctions to use it after in an XPATH query:
function checkURL($url) {
$parts = parse_url($url);
unset($parts['scheme']);
if ( count($parts) == 2 &&
isset($parts['host']) &&
isset($parts['path']) &&
preg_match('~^/pages/[0-9]+/[^/]+$~', $parts['path']) ) {
return true;
}
return false;
}
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($filename);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPhpFunctions('checkURL');
$links = $xp->query("//a[php:functionString('checkURL', #href)]");
foreach ($links as $link) {
echo $link->getAttribute('href'), PHP_EOL;
}
In this way you extract only the links you want.
This is a slight guess, but if I got it wrong you can still see the way to do it.
foreach ($links as $link){
//Extract and show the "href" attribute.
If(preg_match("/(?:http.*)maindomain\.com\/pages\/\d+\/.*/",$link->getAttribute('href')){
echo $link->nodeValue;
echo $link->getAttribute('href'), '<br>';
}
}
You already use a parser, so you might step forward and use an xpath query on the DOM. XPath queries offer functions like starts-with() as well, so this might work:
$xpath = new DOMXpath($dom);
$links = $xpath->query("//a[starts-with(#href, 'maindomain.com')]");
Loop over them afterwards:
foreach ($links as $link) {
// do sth. with it here
// after all, it is a DOMElement
}
what I want is use a html snippet as template with placeholders and load this template, fill with content and return the new html:
$html = '<table>
<tr id="extra">
<td>###EXTRATITLE###</td>
<td>###EXTRATOTAL###</td>
</tr>
</table>';
$temp = new DOMDocument();
$temp->loadHTML($html);
$str = $temp->saveHTML($temp->getElementById('extra'));
$dom = new DOMDocument();
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$element = $dom->getElementById('extra');
$element->parentNode->removeChild($element);
$data = [
"key1" => "value1",
"key2" => "value2",
];
foreach ($data as $key => $row) {
$search = [ '###EXTRATITLE###', '###EXTRATOTAL###' ];
$replace = [ $key, $row ];
$el = $dom->createTextNode(str_replace($search, $replace, $str));
$foo = $dom->documentElement->firstChild;
$foo->appendChild($el);
}
echo preg_replace('~<(?:!DOCTYPE|/?(?:html|body))[^>]*>\s*~i', '', $dom->saveHTML());
problem are the entities and the wrong placement of the childs - could anyone fix this?
Assuming you have a data mapping array like this:
$data = array(
'PLACEHOLDER1' => 'data 1',
'PLACEHOLDER2' => 'data 2',
);
Here is what you could do:
$html = '<table>
<tr id="myselector">
<td>###PLACEHOLDER1###</td>
<td>###PLACEHOLDER2###</td>
</tr>
</table>';
foreach( array_keys( $data ) as $key )
{
$html = str_replace( '###'.$key.'###', $data[ $key ], $html );
}
Here is an alternative approach:
$html = '<table><tr id="myselector"></tr></table>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$tr = $doc->getElementById('myselector');
foreach ($data as $row) {
$td = $doc->createElement('td', $row['value']);
$tr->appendChild($td);
//repeat as necessary
}
It does not use placeholders, but it should produce the desired result.
If the goal is to create a more complex templating system from scratch, it might make more sense to peruse the XPath documentation and leverage the associated XPath class.
I'm wondering why you're using PHP for this instead of JQuery - which makes this SUPER easy. I mean, I know you can, but i'm not sure you're doing yourself any favors.
I had a similar requirement where i wanted to use a template and get server-side values. What i ended up doing was creating a server-side array and converting this to JSON - in php json_encode($the_data_array);
then i had a standing script portion of my template that used jQuery selectors to get and set the values. This was really clean and much easier to debug that trying to generate the entire HTML payload from php. it also meant i could more easily separate the data file from the template for better caching and updates later.
i can try to update this with a fiddler example if you're not familiar with jQuery and need some guidance. just let me know.
I wrote a small helper function to do basic search replace using xpath, because I found it easy to write manipulations very short and at the same time easy to read and understand.
Code:
<?php
function xml_search_replace($dom, $search_replace_rules) {
if (!is_array($search_replace_rules)) {
return;
}
$xp = new DOMXPath($dom);
foreach ($search_replace_rules as $search_pattern => $replacement) {
foreach ($xp->query($search_pattern) as $node) {
$node->nodeValue = $replacement;
}
}
}
The problem is that now I need to do different "search/replace" on different parts of the XML dom. I had hoped something like the following would work, but DOMXPath can't use DOMDocumentFragment :(
The first part (until the foreach loop) of the example below works like a charm. I'm looking for inspiration for an alternative way to go around it which is still short and readable (without to much boiler plate).
Code:
<?php
$dom = new DOMDocument;
$dom->loadXml(file_get_contents('container.xml'));
$payload = $dom->getElementsByTagName('Payload')->item(0);
xml_search_replace($dom, array('//MessageReference' => 'SRV4-ID00000000001'));
$payloadXmlTemplate = file_get_contents('payload_template.xml');
foreach (array(array('id' => 'some_id_1'),
array('id' => 'some_id_2')) as $request) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($payloadXmlTemplate);
xml_search_replace($fragment, array('//PayloadElement' => $request['id']));
$payload->appendChild($fragment);
}
Thanks to Francis Avila I came up with the following:
<?php
function xml_search_replace($node, $search_replace_rules) {
if (!is_array($search_replace_rules)) {
return;
}
$xp = new DOMXPath($node->ownerDocument);
foreach ($search_replace_rules as $search_pattern => $replacement) {
foreach ($xp->query($search_pattern, $node) as $matchingNode) {
$matchingNode->nodeValue = $replacement;
}
}
}
$dom = new DOMDocument;
$dom->loadXml(file_get_contents('container.xml'));
$payload = $dom->getElementsByTagName('Payload')->item(0);
xml_search_replace($dom->documentElement, array('//MessageReference' => 'SRV4-ID00000000001'));
$payloadXmlTemplate = file_get_contents('payload_template.xml');
foreach (array(array('id' => 'some_id_1'),
array('id' => 'some_id_2')) as $request) {
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($payloadXmlTemplate);
xml_search_replace($payload->appendChild($fragment),
array('//PayloadElement' => $request['id']));
}
$tags = array(
"applet" => 1,
"script" => 1
);
$html = file_get_contents("test.html");
$dom = new DOMdocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$body = $xpath->query("//body")->item(0);
I'm about looping through the "body" of the web page and remove all unwanted tags listed in the $tags array but I can't find a way. So how can I do it?
Had you considered HTML Purifier? starting with your own html sanitizing is just re-inventing the wheel, and isn't easy to accomplish.
Furthermore, a blacklist approach is also bad, see SO/why-use-a-whitelist-for-html-sanitizing
You may also be interested in reading how to cinfigure allowed tags & attributes or testing HTML Purifier demo
$tags = array(
"applet" => 1,
"script" => 1
);
$html = file_get_contents("test.html");
$dom = new DOMdocument();
#$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
for($i=0; $i<count($tags); ++$i) {
$list = $xpath->query("//".$tags[$i]);
for($j=0; $j<$list->length; ++$j) {
$node = $list->item($j);
if ($node == null) continue;
$node->parentNode->removeChild($node);
}
}
$string = $dom->saveXML();
Something like that.