One result array - php

I'm trying to add the results of a script to an array, but once I look into it there is only one item in it, probably me being silly with placement
function crawl_page($url, $depth)
{
static $seen = array();
$Linklist = array();
if (isset($seen[$url]) || $depth === 0) {
return;
}
$seen[$url] = true;
$dom = new DOMDocument('1.0');
#$dom->loadHTMLFile($url);
$anchors = $dom->getElementsByTagName('a');
foreach ($anchors as $element) {
$href = $element->getAttribute('href');
if (0 !== strpos($href, 'http')) {
$href = rtrim($url, '/') . '/' . ltrim($href, '/');
}
if(shouldScrape($href)==true)
{
crawl_page($href, $depth - 1);
}
}
echo "URL:",$url;
echo http_response($url);
echo "<br/>";
$Linklist[] = $url;
$XML = new DOMDocument('1.0');
$XML->formatOutput = true;
$root = $XML->createElement('Links');
$root = $XML->appendChild($root);
foreach ($Linklist as $value)
{
$child = $XML->createElement('Linkdetails');
$child = $root->appendChild($child);
$text = $XML->createTextNode($value);
$text = $child->appendChild($text);
}
$XML->save("linkList.xml");
}

$Linklist[] = $url; will add a single item to the $Linklist array. This line needs to be in a loop I think.

static $Linklist = array(); i think, but code is awful

Related

How to parse url with DOMparser using getNamedItem

I am trying to grab URL, with DOMparser but stuck at getNamedItem
How to solve this problem? What I am missing here? I welcome for any idea!
$url = 'https://www.31sumai.com/search/area/kansai/result/?area=16,17,18';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$mainlink = null;
$allPTags = $DOMParser->getElementsByTagName('p');
foreach ($allPTags as $ptag) {
$class = $ptag->attributes->getNamedItem("class");
if ($class && $class->nodeValue == 'c-name') {
$main = $ptag->attributes->getNamedItem("href");
if ($main) {
$mainlink = $main->nodeValue;
}
}
}
var_dump($mainlink);
It s returning null but already checked the website, there is a URL in that tag.
$url = 'https://lions-mansion.jp/area/kansai/';
$html = file_get_contents($url);
libxml_use_internal_errors(true);
$DOMParser = new \DOMDocument();
$DOMParser->loadHTML($html);
$mainlink = null;
$allPTags = $DOMParser->getElementsByTagName('p');
foreach ($allPTags as $ptag) {
$class = $ptag->attributes->getNamedItem("class");
if ($class && $class->nodeValue == 'areapageDetailList_item_btn_hp') {
$links = $ptag->getElementsByTagName('a');
foreach ($links as $link) {
$hrefAttr = $link->attributes->getNamedItem("href");
if ($hrefAttr) {
$mainlink = $hrefAttr->nodeValue;
}
}
}
}
echo $mainlink;

php DOMDocument extract links with anchor or alt

I which to extract all the link include on page with anchor or alt attribute on image include in the links if this one come first.
$html = 'Anchor';
Must return "lien.fr;Anchor"
$html = '<img alt="Alt Anchor">Anchor';
Must return "lien.fr;Alt Anchor"
$html = 'Anchor<img alt="Alt Anchor">';
Must return "lien.fr;Anchor"
I did:
$doc = new DOMDocument();
$doc->loadHTML($html);
$out = "";
$n = 0;
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
$href = $img_alt = $anchor = "";
$href = $element->getAttribute('href');
$n++;
if (!strrpos($href, "panier?")) {
if ($element->firstChild->nodeName == "img") {
$imgs = $element->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($anchor = $img->getAttribute('alt')) {
break;
}
}
}
if (($anchor == "") && ($element->nodeValue)) {
$anchor = $element->nodeValue;
}
$out[$n]['link'] = $href;
$out[$n]['anchor'] = $anchor;
}
}
This seems to work but if there some space or indentation it doesn't
as
$html = '<a href="link.fr">
<img src="ceinture-gris" alt="alt anchor"/>
</a>';
the $element->firstChild->nodeName will be text
Something like this:
$doc = new DOMDocument();
$doc->loadHTML($html);
// Output texts that will later be joined with ';'
$out = [];
// Maximum number of items to add to $out
$max_out_items = 2;
// List of img tag attributes that will be parsed by the loop below
// (in the order specified in this array!)
$img_attributes = ['alt', 'src', 'title'];
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
if ($href = trim($element->getAttribute('href'))) {
$out []= $href;
if (count($out) >= $max_out_items)
break;
}
foreach ($element->childNodes as $child) {
if ($child->nodeType === XML_TEXT_NODE &&
$text = trim($child->nodeValue))
{
$out []= $text;
if (count($out) >= $max_out_items)
break;
} elseif ($child->nodeName == 'img') {
foreach ($img_attributes as $attr_name) {
if ($attr_value = trim($child->getAttribute($attr_name))) {
$out []= $attr_value;
if (count($out) >= $max_out_items)
goto Result;
}
}
}
}
}
Result:
echo $out = implode(';', $out);

Replacing last char (string) using regex or DOMDocument

I'm using one small script to convert from absolute links to relative ones. It is working but it needs improvement. Not sure how to proceed. Please have a look at part of the script used for this.
Script:
public function links($path) {
$old_url = 'http://test.dev/';
$dir_handle = opendir($path);
while($item = readdir($dir_handle)) {
$new_path = $path."/".$item;
if(is_dir($new_path) && $item != '.' && $item != '..') {
$this->links($new_path);
}
// it is a file
else{
if($item != '.' && $item != '..')
{
$new_url = '';
$depth_count = 1;
$folder_depth = substr_count($new_path, '/');
while($depth_count < $folder_depth){
$new_url .= '../';
$depth_count++;
}
$file_contents = file_get_contents($new_path);
$doc = new DOMDocument;
#$doc->loadHTML($file_contents);
foreach ($doc->getElementsByTagName('a') as $link) {
if (substr($link, -1) == "/"){
$link->setAttribute('href', $link->getAttribute('href').'/index.html');
}
}
$doc->saveHTML();
$file_contents = str_replace($old_url,$new_url,$file_contents);
file_put_contents($new_path,$file_contents);
}
}
}
}
As you can see I've added inside while loop that DOMDocument but it doesn't work. What I'm trying to achieve here is to add for every link at the end index.html if last char in that link is /
What am I doing wrong?
Thank you.
Is this what you want?
$file_contents = file_get_contents($new_path);
$dom = new DOMDocument();
$dom->loadHTML($file_contents);
$xpath = new DOMXPath($dom);
$links = $xpath->query("//a");
foreach ($links as $link) {
$href = $link->getAttribute('href');
if (substr($href, -1) === '/') {
$link->setAttribute('href', $href."index.html");
}
}
$new_file_content = $dom->saveHTML();
# save this wherever you want
See a demo on ideone.com.
Hint: Your call to $dom->saveHTML() leads to nowhere (ie there's no variable capturing the output).

Remove HTML Attributes using PHP

Using PHP i want to remove all HTML attributes except
"src" attribute from "img" tag
and
"href" attribute from "a" tag
My Input file is .html file which is been converted from .doc and .docx
My output file again should be HTML file with removed attribute
Kindly help me please
Edit ::
After Trying alexander script as below if i open the strip.html in code editor i don't see any changes
<?php
$path = '/var/www/strip.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//img"))) die('Error');
foreach ($elements as $element) {
for ($i = $element->attributes->length; --$i >= 0;) {
$name = $element->attributes->item($i)->name;
if ('src' !== $name) {
$element->removeAttribute($name);
}
}
}
if (false === ($elements = $xpath->query("//a"))) die('Error');
foreach ($elements as $element) {
for ($i = $element->attributes->length; --$i >= 0;) {
$name = $element->attributes->item($i)->name;
if ('href' !== $name) {
$element->removeAttribute($name);
}
}
}
$dom->saveHTMLFile($path);
?>
Use DOMDocument class for parsing HTML ("a" and "img" tags processing):
$path = '/path/to/file.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
//$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//img"))) die('Error');
foreach ($elements as $element) {
for ($i = $element->attributes->length; --$i >= 0;) {
$name = $element->attributes->item($i)->name;
if ('src' !== $name) {
$element->removeAttribute($name);
}
}
}
if (false === ($elements = $xpath->query("//a"))) die('Error');
foreach ($elements as $element) {
for ($i = $element->attributes->length; --$i >= 0;) {
$name = $element->attributes->item($i)->name;
if ('href' !== $name) {
$element->removeAttribute($name);
}
}
}
$dom->saveHTMLFile($path);
Also, read why you can't parse [X]HTML with regex and take a look at useful xpath links.
Update (all tags with exception "a" and "img" attributes processing):
$path = '/path/to/file.html';
$html = file_get_contents($path);
$dom = new DOMDocument();
//$dom->strictErrorChecking = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
if (false === ($elements = $xpath->query("//*"))) die('Error');
foreach ($elements as $element) {
for ($i = $element->attributes->length; --$i >= 0;) {
$name = $element->attributes->item($i)->name;
if (('img' === $element->nodeName && 'src' === $name)
|| ('a' === $element->nodeName && 'href' === $name)
) {
continue;
}
$element->removeAttribute($name);
}
}
$dom->saveHTMLFile($path);

concatenate innerhtml of div into string variable

i tried to concatenate innerhtml of div into string variable:
games variable:
$games = '';
DOMinnerHTML function:
function DOMinnerHTML($element)
{
$innerHTML = "";
$children = $element->childNodes;
foreach ($children as $child)
{
$tmp_dom = new DOMDocument();
$tmp_dom->appendChild($tmp_dom->importNode($child, true));
$innerHTML.=trim($tmp_dom->saveHTML());
}
return $innerHTML;
}
ExtractFromType function:
function ExtractFromType($type)
{
$html = file_get_contents('www.site.com/' .$type);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$divs = $dom->getElementsByTagName('div');
foreach ($divs as $div) {
if (strpos($div->getAttribute('style'),'MyString') !== false) {
//////
$games = $games.DOMinnerHTML($div);
//////
}
}
}
code:
ExtractFromType('MyType');
echo $games; // = Nothing.
this code return nothing.
$games is defined in the global scope, and it's not available inside ExctractFromType. Define it inside the function, then return the value:
function ExtractFromType($type) {
$html = file_get_contents('www.site.com/' .$type);
$dom = new domDocument;
#$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$divs = $dom->getElementsByTagName('div');
$games = '';
foreach ($divs as $div) {
if (strpos($div->getAttribute('style'),'MyString') !== false) {
$games = $games.DOMinnerHTML($div);
}
}
}
echo ExtractFromType('MyType');

Categories