php DOMDocument extract links with anchor or alt - php

I which to extract all the link include on page with anchor or alt attribute on image include in the links if this one come first.
$html = 'Anchor';
Must return "lien.fr;Anchor"
$html = '<img alt="Alt Anchor">Anchor';
Must return "lien.fr;Alt Anchor"
$html = 'Anchor<img alt="Alt Anchor">';
Must return "lien.fr;Anchor"
I did:
$doc = new DOMDocument();
$doc->loadHTML($html);
$out = "";
$n = 0;
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
$href = $img_alt = $anchor = "";
$href = $element->getAttribute('href');
$n++;
if (!strrpos($href, "panier?")) {
if ($element->firstChild->nodeName == "img") {
$imgs = $element->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($anchor = $img->getAttribute('alt')) {
break;
}
}
}
if (($anchor == "") && ($element->nodeValue)) {
$anchor = $element->nodeValue;
}
$out[$n]['link'] = $href;
$out[$n]['anchor'] = $anchor;
}
}
This seems to work but if there some space or indentation it doesn't
as
$html = '<a href="link.fr">
<img src="ceinture-gris" alt="alt anchor"/>
</a>';
the $element->firstChild->nodeName will be text

Something like this:
$doc = new DOMDocument();
$doc->loadHTML($html);
// Output texts that will later be joined with ';'
$out = [];
// Maximum number of items to add to $out
$max_out_items = 2;
// List of img tag attributes that will be parsed by the loop below
// (in the order specified in this array!)
$img_attributes = ['alt', 'src', 'title'];
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
if ($href = trim($element->getAttribute('href'))) {
$out []= $href;
if (count($out) >= $max_out_items)
break;
}
foreach ($element->childNodes as $child) {
if ($child->nodeType === XML_TEXT_NODE &&
$text = trim($child->nodeValue))
{
$out []= $text;
if (count($out) >= $max_out_items)
break;
} elseif ($child->nodeName == 'img') {
foreach ($img_attributes as $attr_name) {
if ($attr_value = trim($child->getAttribute($attr_name))) {
$out []= $attr_value;
if (count($out) >= $max_out_items)
goto Result;
}
}
}
}
}
Result:
echo $out = implode(';', $out);

Related

Why is the "Learn More" link not linking to the page?

I'm trying to understand how this code (from another developer) is written. It has a bug but I can't seem to fix it. The learn more link doesn't link to the post in the custom field.
I've tried to remove the learn more lines but it then it changes the slide link to link to the image itself and not what's present in the custom link field.
$slides = ONS_Slide_Custom_Post_Type::find_all('DESC');
if (isset($slides) && count($slides > 0)) {
$items = array();
foreach ($slides as $slide) {
//echo '<tt><pre>' . var_export($slide, true) . '</pre></tt>';
$item = new stdClass();
if (isset($slide->custom_data) && count($slide->custom_data) > 0) {
if (isset($slide->custom_data['ons_slide_image'])) {
$item->src = $slide->custom_data['ons_slide_image'];
}
if (isset($slide->custom_data['ons_slide_heading'])) {
$item->heading = $slide->custom_data['ons_slide_heading'];
$item->heading .= '<span class="punctuation">.</span><span class="learn_more"> »</span>';
}
if (isset($slide->custom_data['ons_slide_caption'])) {
$item->caption = $slide->custom_data['ons_slide_caption'];
$item->caption .= ' Learn more »';
}
if (isset($slide->custom_data['ons_slide_href'])) {
$item->href = $slide->custom_data['ons_slide_href'];
} else {
$item->href = "#";
}
}
$items[] = $item;
}
$carousel = new ONS_Bootstrap_Carousel($items);
echo $carousel;
}
You are already doing something with $slide->custom_data['ons_slide_href']; but AFTER the lines of code that are outputting your anchor tag.
So try switching the processing about a bit like this
$slides = ONS_Slide_Custom_Post_Type::find_all('DESC');
if (isset($slides) && count($slides > 0)) {
$items = array();
foreach ($slides as $slide) {
//echo '<tt><pre>' . var_export($slide, true) . '</pre></tt>';
$item = new stdClass();
if (isset($slide->custom_data) && count($slide->custom_data) > 0) {
if (isset($slide->custom_data['ons_slide_image'])) {
$item->src = $slide->custom_data['ons_slide_image'];
}
if (isset($slide->custom_data['ons_slide_heading'])) {
$item->heading = $slide->custom_data['ons_slide_heading'];
$item->heading .= '<span class="punctuation">.</span><span class="learn_more"> »</span>';
}
// moved this code above the anchor tag line
if (isset($slide->custom_data['ons_slide_href'])) {
$item->href = $slide->custom_data['ons_slide_href'];
} else {
$item->href = "#";
}
// Now concatenate $item->href in the anchor tag line
if (isset($slide->custom_data['ons_slide_caption'])) {
$item->caption = $slide->custom_data['ons_slide_caption'];
$item->caption .= ' Learn more »';
}
}
$items[] = $item;
}
$carousel = new ONS_Bootstrap_Carousel($items);
echo $carousel;
}

PHP Simple HTML DOM parser

I am working with simple web crawler. Below is simple html code i used to learn.
input.php
<ul id="nav">
<li>
Google
<ul>
<li>
Gmail
</li>
</ul>
</li>
<li>
Yahoo
<ul>
<li>
Yahoo Mail
</li>
</ul>
</li>
</ul>
I need to crawl the first anchor tag in ul[id=nav]->li. The code i used to crawl input.php is
<?php
include 'simple_html_dom.php';
$html = file_get_html('input.php');
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
echo $navUL_LI->find('a',0)->outertext."<br>";
}
}
?>
It Displays all the anchor tag in my input.php. I need to display only google and yahoo. How can i achieve this?
In this case you can directly point it out with children() method. Example:
foreach($html->find('ul#nav') as $ul) {
foreach($ul->children() as $li) {
echo $li->children(0)->outertext . '<br/>';
}
}
Alternatively, you can use DOMDocument + DOMXpath for this too:
$dom = new DOMDocument();
$dom->loadHTML($str);
$xpath = new DOMXpath($dom);
// directly target those links
$links = $xpath->query('//ul[#id="nav"]/li/a');
foreach($links as $a) {
echo $a->nodeValue . '<br/>';
}
<?php
include 'simple_html_dom.php';
$html = file_get_html('input.php');
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
if(strpos($navUL_LI,'google')||strpos($navUL_LI,'google')){
echo $navUL_LI->find('a',0)->outertext."<br>";
}
}
}
?>
i have done the same work in Objective-c.
You can use the XML or HTML api's to serialize your html object.
If you want to do this form cold hand... find open tag and the close tag.
After this get first child, then the second and so on...
Try this:
// get the children of the element #nav, i.e. the top level lis
$lis = $html->getElementById("#nav")->childNodes();
// for each child, find the first 'a' element
foreach ($lis as $li) {
$a = $li->find('a',0);
// retrieve the link text itself.
echo "link text: " . $a->innertext() . "\n";
}
See the simple-html-dom manual for details of all these methods.
you can simply achieve that by:
<?php
foreach ($html->find('ul[id=nav]') as $navUL){
foreach ($navUL->find('li') as $navUL_LI){
echo $navUL_LI->find('a',-2)->outertext."<br>";
}
}
?>
<?php
$in = '<style> .catalog-product-view .product.attribute.overview ul { margin-top: 10px; } </style><img src="/media/wysiwyg/img/misc/made-in-the-usa-doh-blue4.png"><ul><li>Ships as (12) 40 fl oz bottles</li></ul>';
function parseTags($input, $callback) {
$len = strlen($input);
$stack = [];
$tag = "";
$data = "";
$isTag = false;
$isString = false;
for ($i=0; $i<$len; $i++) {
$char = $input[$i];
if ($char == '<') {
$isTag = true;
$tag .= $char;
} else if ($char == '>') {
$tag .= $char;
if (substr($tag, 0, 2) == '</') {
$close = str_replace('>', '', str_replace('</', '', explode(' ', $tag, 1)[0]));
$open = str_replace('>', '', str_replace('<', '', explode(' ', end($stack), 1)[0]));
if ($open == $close) {
$callback($tag, $data, $stack, $i, false);
array_pop($stack);
}
} else if (substr($tag, -2) == '/>') {
$callback($tag, $data, $stack, $i, false);
} else {
$callback($tag, $data, $stack, $i, true);
$stack[] = $tag;
}
$tag = "";
$data = "";
$isTag = false;
} else if ($char == '"' || $char == "'") {
if ($isString == false) {
$isString = $char;
} else if ($isString == $char && $input[$i-1] != '\\') {
$isString = false;
}
} else if ($isTag) {
$tag .= $char;
} else {
$data .= $char;
}
}
}
parseTags($in, function($tag, $data, $stack, $position, $isOpen) use (&$out) {
print_r(func_get_args());
});

How to refactor my php methods?

I have a question regarding simplify my codes.
I have
public function getText($text){
if(!empty($text)){
$dom = new DomDocument();
$dom->loadHTML($text);
$xpath=new DOMXpath($dom);
$result = $xpath->query('//a');
if($result->length > 0){
$atags=$dom->getElementsByTagName('a');
foreach($atags as $atag){
$style = $atag ->getAttribute('style');
$atag->setAttribute('style',$style.' text-decoration:none;color:black;');
}
$returnText .= $dom->saveHTML();
return $returnText;
}
$result = $xpath->query('//table');
if($result->length > 0){
$tables = $dom->getElementsByTagName('table');
$inputs = $dom->getElementsByTagName('input');
foreach ($inputs as $input) {
$input->setAttribute('style','text-align:center;');
}
foreach ($tables as $table) {
$table->setAttribute('width',500);
$table->setAttribute('style','border:2px solid #8C8C8C;text-align:center;table-layout:fixed;');
}
$returnText .= $dom->saveHTML();
return $returnText;
}
}
return $text;
}
public function getTextwithIndex($text,$index=''){
if(!empty($text[$index])){
$dom = new DomDocument();
$dom->loadHTML($text[$index]);
$xpath=new DOMXpath($dom);
$result = $xpath->query('//a');
if($result->length > 0){
$atags=$dom->getElementsByTagName('a');
foreach($atags as $atag){
$style = $atag ->getAttribute('style');
$atag->setAttribute('style',$style.' text-decoration:none;color:black;');
}
$returnText .= $dom->saveHTML();
return $returnText;
}
$result = $xpath->query('//tbody');
if($result->length > 0){
$tbodies = $dom->getElementsByTagName('tbody');
$cells = $dom->getElementsByTagName('td');
$inputs = $dom->getElementsByTagName('input');
foreach ($inputs as $input) {
$input->setAttribute('style','text-align:center;');
}
foreach ($cells as $cell) {
$cell->setAttribute('style','border:1px solid black;');
}
foreach ($tbodies as $tbody) {
$table = $dom->createElement('table');
$table->setAttribute('width',500);
$table->setAttribute('style','border:2px solid #8C8C8C;text-align:center;table-layout:fixed;');
$tbody->parentNode->replaceChild($table, $tbody);
$table->appendChild($tbody);
}
$returnText .= $dom->saveHTML();
return $returnText;
}
}
return $text;
}
The difference between the method is $index and some modification of my domdocument. I feel like it's really cumbersome and could use some refactoring. Does anyone have any good suggestions? Thanks!
How about something like this:
public function getTextwithIndex($text,$index='') {
if (empty($index))
return getText($text); //not sure how $text works, so this line might be different.
return getText($text[$index]);
}
Or something like this:
public function getText($text, $index = false){
if ($index)
$text = $text[$index];
if(!empty($text)){
$dom = new DomDocument();
$dom->loadHTML($text);
$xpath=new DOMXpath($dom);
$result = $xpath->query('//a');
if($result->length > 0){
$atags=$dom->getElementsByTagName('a');
foreach($atags as $atag){
$style = $atag ->getAttribute('style');
$atag->setAttribute('style',$style.' text-decoration:none;color:black;');
}
$returnText .= $dom->saveHTML();
return $returnText;
}
$result = $xpath->query('//table');
if($result->length > 0){
if ($index) {
//do 'getTextWithIndex' dom stuff
} else {
$tables = $dom->getElementsByTagName('table');
$inputs = $dom->getElementsByTagName('input');
}
foreach ($inputs as $input) {
$input->setAttribute('style','text-align:center;');
}
foreach ($tables as $table) {
$table->setAttribute('width',500);
$table->setAttribute('style','border:2px solid #8C8C8C;text-align:center;table-layout:fixed;');
}
$returnText .= $dom->saveHTML();
return $returnText;
}
}
return $text;
}

XML parsing in php

I am parsing a xml and but there is a tag which contain image and text both and i want to seprate both image and text in diffrent columns of table in my design layout but i dont know how to do it. please help me. my php file is :
<?php
$RSS_Content = array();
function RSS_Tags($item, $type)
{
$y = array();
$tnl = $item->getElementsByTagName("title");
$tnl = $tnl->item(0);
$title = $tnl->firstChild->textContent;
$tnl = $item->getElementsByTagName("link");
$tnl = $tnl->item(0);
$link = $tnl->firstChild->textContent;
$tnl = $item->getElementsByTagName("description");
$tnl = $tnl->item(0);
$img = $tnl->firstChild->textContent;
$y["title"] = $title;
$y["link"] = $link;
$y["description"] = $img;
$y["type"] = $type;
return $y;
}
function RSS_Channel($channel)
{
global $RSS_Content;
$items = $channel->getElementsByTagName("item");
// Processing channel
$y = RSS_Tags($channel, 0); // get description of channel, type 0
array_push($RSS_Content, $y);
// Processing articles
foreach($items as $item)
{
$y = RSS_Tags($item, 1); // get description of article, type 1
array_push($RSS_Content, $y);
}
}
function RSS_Retrieve($url)
{
global $RSS_Content;
$doc = new DOMDocument();
$doc->load($url);
$channels = $doc->getElementsByTagName("channel");
$RSS_Content = array();
foreach($channels as $channel)
{
RSS_Channel($channel);
}
}
function RSS_RetrieveLinks($url)
{
global $RSS_Content;
$doc = new DOMDocument();
$doc->load($url);
$channels = $doc->getElementsByTagName("channel");
$RSS_Content = array();
foreach($channels as $channel)
{
$items = $channel->getElementsByTagName("item");
foreach($items as $item)
{
$y = RSS_Tags($item, 1);
array_push($RSS_Content, $y);
}
}
}
function RSS_Links($url, $size = 15)
{
global $RSS_Content;
$page = "<ul>";
RSS_RetrieveLinks($url);
if($size > 0)
$recents = array_slice($RSS_Content, 0, $size + 1);
foreach($recents as $article)
{
$type = $article["type"];
if($type == 0) continue;
$title = $article["title"];
$link = $article["link"];
$img = $article["description"];
$page .= "$title\n";
}
$page .="</ul>\n";
return $page;
}
function RSS_Display($url, $click, $size = 8, $site = 0, $withdate = 0)
{
global $RSS_Content;
$opened = false;
$page = "";
$site = (intval($site) == 0) ? 1 : 0;
RSS_Retrieve($url);
if($size > 0)
$recents = array_slice($RSS_Content, $site, $size + 1 - $site);
foreach($recents as $article)
{
$type = $article["type"];
if($type == 0)
{
if($opened == true)
{
$page .="</ul>\n";
$opened = false;
}
$page .="<b>";
}
else
{
if($opened == false)
{
$page .= "<table width='369' border='0'>
<tr>";
$opened = true;
}
}
$title = $article["title"];
$link = $article["link"];
$img = $article["description"];
$page .= "<td width='125' align='center' valign='middle'>
<div align='center'>$img</div></td>
<td width='228' align='left' valign='middle'><div align='left'><a
href=\"$click\" target='_top'>$title</a></div></td>";
if($withdate)
{
$date = $article["date"];
$page .=' <span class="rssdate">'.$date.'</span>';
}
if($type==0)
{
$page .="<br />";
}
}
if($opened == true)
{
$page .="</tr>
</table>";
}
return $page."\n";
}
?>
To separate the image and description you need to parse the HTML that is stored inside the description element again as XML. Luckily it is valid XML inside that element, therefore you can do this straight forward with SimpleXML, the following code-example take the URL and converts each item *description* into the text only and extracts the src attribute of the image to store it as the image element:
<item>
<title>Fake encounter: BJP backs Kataria, says CBI targeting Modi</title>
<link>http://ibnlive.in.com/news/fake-encounter-bjp-backs-kataria-says-cbi-targeting-modi/391802-37-64.html</link>
<description>The BJP lashed out at the CBI and questioned its 'shoddy investigation' into the Sohrabuddin fake encounter case.</description>
<pubDate>Wed, 15 May 2013 13:48:56 +0530</pubDate>
<guid>http://ibnlive.in.com/news/fake-encounter-bjp-backs-kataria-says-cbi-targeting-modi/391802-37-64.html</guid>
<image>http://static.ibnlive.in.com/ibnlive/pix/sitepix/05_2013/bjplive_kataria3.jpg</image>
</item>
The code-example is:
$url = 'http://ibnlive.in.com/ibnrss/top.xml';
$feed = simplexml_load_file($url);
$items = $feed->xpath('(//channel/item)');
foreach ($items as $item) {
list($description, $image) =
simplexml_load_string("<r>$item->description</r>")
->xpath('(/r|/r//#src)');
$item->description = (string)$description;
$item->image = (string)$image;
}
You can then import the SimpleXML into a DOMElement with dom_import_simplexml() however honestly, I just would wrap that little HTML creation as well into a foreach of SimpleXML because you can make use of LimitIterator for the paging as well as you could with DOMDocument and the data you access is actually easily at hand with SimpleXML, it's just easy to pass along the XML elements as SimpleXMLElements instead of parsing into an array first and then processing the array. That's moot.

One result array

I'm trying to add the results of a script to an array, but once I look into it there is only one item in it, probably me being silly with placement
function crawl_page($url, $depth)
{
static $seen = array();
$Linklist = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
}
if(shouldScrape($href)==true)
{
crawl_page($href, $depth - 1);
}
}
echo "URL:",$url;
echo http_response($url);
echo "<br/>";
$Linklist[] = $url;
$XML = new DOMDocument('1.0');
$XML->formatOutput = true;
$root = $XML->createElement('Links');
$root = $XML->appendChild($root);
foreach ($Linklist as $value)
{
$child = $XML->createElement('Linkdetails');
$child = $root->appendChild($child);
$text = $XML->createTextNode($value);
$text = $child->appendChild($text);
}
$XML->save("linkList.xml");
}
$Linklist[] = $url; will add a single item to the $Linklist array. This line needs to be in a loop I think.
static $Linklist = array(); i think, but code is awful

Categories