Can't get the dom node value extracted - php

I have a code that links to another site, grabs that data, and returns the string to a variable.. i'm wondering why this isn't working however?
<?php
$file = $DOCUMENT_ROOT . "http://www.sc2brasd.net";
$doc = new DOMDocument();
#$doc->loadHTMLFile($file);
$elements = $doc->getElementsByTagName('h1');
for ($i=1; $i<=7; $i++)
{
echo trim($elements->item($i)->nodeValue);
}
?>
there are seven "h1" tags that i would like to grab but they won't return to echo out? an example of the string would be "Here is the test string i am trying to pull out"

This will not work because the path dont exists. It points to a file on your server.
$file = $DOCUMENT_ROOT . "http://www.sc2brasd.net";
I'n not sure if loadHTMLFile() can handle URLs at all. You may need to get the document with file() and load it with DOMDocument::loadHTML.

Related

image scraping pointing each url to directory using php

I have the code for image scraping but what I am trying to fix here are a few things:
replace this $the_site = "url"; with my input type="text"
so instead of putting url on the code I want to put the url on my input.
I want to make multiple folder and links, instead of putting same code like 5 times on the page I want to to point each url to a directory.
My code is about downloading images from pages and save them to folder so I want to put all inside one php tags
here's my code
<?php
$the_site = "url";
$the_tag = "div"; #
$the_class = "slides";
$html = file_get_contents($the_site);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//'.$the_tag.'[contains(#id,"'.$the_class.'")]/img') as $item) {
$img_src = $item->getAttribute('src');
//print $img_src."\n"; Ignore This
//copy($img_src,'C:\xampp\htdocs\grabIMG\download'); Ignore This
$img_name = end(explode("/",$img_src));
echo $img_name.' has downloaded<br />';
$img_content = file_get_contents($img_src);
$fp = fopen(" folder/".$img_name,"w");
fwrite($fp,$img_content);
fclose($fp);
}
?>
i been posting this code like 5 times in the page, each time opening new php tags but i get this error and excution won't be completed
Fatal error: Maximum execution time of 30 seconds exceeded in
C:\xampp\htdocs\grabIMG\index.php on line 106

Php sort multiple xmlDoc by date

I am pulling a list of blog pages from a .XML file and printing the 2 newest entries for a web page. I however have no idea how to sort the .XML files by pubDate or file_edited.
The code successfully retrieves the files and prints the two newest entries.
Here is the PHP code block that retrieves the files and prints them.
<?php
date_default_timezone_set('Europe/Helsinki');
/* XML Source URL:s */
$pages=("blog/data/other/pages.xml");
/* XML Doc Conversions */
$xmlDoc = new DOMDocument();
echo "<div class='blog_article_wrapper'>";
function myFunction($x){
// Run 2 times, skip first file and stop loop.
for ($i=1; $i<=2; $i++) {
//Get "Title
$item_title=$x->item($i)->getElementsByTagName('title')
->item(0)->childNodes->item(0)->nodeValue;
//Get "Date" from .XML document.
$item_date=$x->item($i)->getElementsByTagName('pubDate')
->item(0)->childNodes->item(0)->nodeValue;
//Get "URL" from .XML document.
$item_url=$x->item($i)->getElementsByTagName('url')
->item(0)->childNodes->item(0)->nodeValue;
//Get "Author" from .XML document.
$item_author=$x->item($i)->getElementsByTagName('author')
->item(0)->childNodes->item(0)->nodeValue;
//Format date and author
$item_date = date('d.m.Y', strtotime($item_date));
$item_author = ucfirst(strtolower($item_author));
//Get content data from specifix .XML document being iterated in loop
$url=("blog/data/pages/" . $item_url . ".xml");
$xmlDoc = new DOMDocument();
$xmlDoc->load($url);
$y=$xmlDoc->getElementsByTagName('content')->item(0)->nodeValue;
//Limit content to 150 letters and first paragraph tag.
$start = strpos($y, '<p>="') + 9;
$length = strpos($y, '"</p>') - $start;
$src = substr($y, $start, $length);
$item_content = "\"" . (substr($src, 0, 150)) . "...\"";
// Page specific code for output comes here.
}
}
//Call loop and iterate data
$xmlDoc->load($pages);
$x=$xmlDoc->getElementsByTagName('item');
myFunction($x);
?>
Any advice, code or articles pointing in the right direction would be much appreciated.
Thank you!
I figured this out my self using another stackoverflow question and php.net
//Directory where files are stored.
$folder = "blog/data/pages/";
$array = array();
//scandir and populate array with filename as key and filemtime as value.
foreach (scandir($folder) as $node) {
$nodePath = $folder . DIRECTORY_SEPARATOR . $node;
if (is_dir($nodePath)) continue;
$array[$nodePath] = filemtime($nodePath);
}
//Sort entry and store two newest files into $newest
arsort($array);
$newest = array_slice($array, 0, 2);
// $newest is now populated with name of .XML document as key and filemtime as value
// Use built in functions array_keys() and array_values() to access data
?>
I can now modify the original code in the question to use only these two outputted files for retrieving the desired data.

Using php to get parent element of link with URL

I'm trying to implement a "find and replace" system for broken links. The problem is, for some links there are no replacements. So, I need to comment out certain li elements. You can see my code below to do this. (I'm starting with an HTML form).
<?php
$brokenlink = $_POST['brokenlink'];
$newlink = $_POST['newlink'];
$brokenlink = '"' . $brokenlink . '"';
$newlink = '"' . $newlink . '"';
$di = new RecursiveDirectoryIterator('hugedirectory');
foreach (new RecursiveIteratorIterator($di) as $filename => $file) {
// echo $filename . ' - ' . $file->getSize() . ' bytes <br/>';
$filetoedit = file_get_contents($file);
if(strpos($filetoedit, $brokenlink)) {
echo $brokenlink . "found in " . $filename . "<br/>";
$filetoedit = str_replace($brokenlink, $newlink, $filetoedit);
file_put_contents($filename, $filetoedit);
}
}
?>
What I want to accomplish is this: If I have a URL, I want to be able to find its li parent. For instance, I want PHP to be able to comment out the code below if the user inputs http://www.espn.com in an HTML form, I want php to find this element on my server:
<li>Sports</li>
And replace it with this:
<!-- <li>Sports</li> -->
Is this possible? Thanks.
I would try using this to parse the DOM.
http://simplehtmldom.sourceforge.net/
You can set a class to all the ones you want comment out. Then use this tool to find those classes and comment them all out at once.
Why not use a regexp to find and replace links, it would also take care of the perhaps expensive looping over links.
Here's a regex for matching urls
http://daringfireball.net/2010/07/improved_regex_for_matching_urls
then preg_replace the broken with the new, or the broken with the commented out version of the broken link
Alternatively you can just run grep on the directory via shell_exec, that way you don't have to open / read and parse files yourself.
Also take a look at this match url pattern in php using regular expression
I suggest you construct DOMDocument with the file content and use XPath to search for the broken link node.
$dom = new DOMDocument();
#$dom->loadHTML($filetoedit);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//li/a[#href="' . $brokenlink . '"]');
for ($i = 0; $i < $nodes->length; $i++) {
$node = $nodes->item($i);
// Do whatever you want here
}

Get anchor tags from mutiple HTML Files

I am not sure if this is even possible but I am trying to extract all the anchor tag links in a few HTML files on my website. I have currently written a php script that scans a few directories and sub directories that builds an array of HTML file links. Here is that code:
$di = new RecursiveDirectoryIterator('Migration');
$migrate = array();
foreach (new RecursiveIteratorIterator($di) as $filename => $file) {
if (eregi("\.html",$file) || eregi("\.htm",$file) ) {
$migrate[] .= $filename;
}
}
This method successfully produces the HTML File links that I need. Ex:
Migration/administration/billing/Billing.htm
Migration/administration/billing/_notes/Billing.htm.mno
Migration/administration/new business/_notes/New Business.htm.mno
Migration/administration/new business/New Business.htm
Migration/account/nycds/_notes/NYCDS Index.htm.mno
Migration/account/nycds/NYCDS Index.htm
There's more links but this gives you an idea. The next part is where I am stuck. I was thinking that I would need a for loop to loop through each array element, open the file, extract the links, then store those links somewhere. I am just not sure how I would go about this process. I tried to google this question but I never seemed to get results that matched what I was looking to do. Here is the simplified for loop that I have.
var obj = <?php echo json_encode($migrate); ?>;
for(var i=0;i< obj.length;i++){
// alert(obj[i]);
}
The above code is in javascript. From what I am reading, It seems that I shouldn't be using javascript but should maybe continue using PHP. I am confused on what my next steps should be. If someone can point me in the right direction I would really appreciate it. Thank you so much for your time.
Use DOMDocument::getElementsByTagName to retrieve all <a> tags
http://www.php.net/manual/en/domdocument.getelementsbytagname.php
Example,
$doc = new DOMDocument();
$doc->loadHTMLFile("filename.html");
$anchors = $doc->getElementsByTagName('a'); //retrieve all anchor tags
foreach ($anchors as $a) { //loop anchors
echo $a->nodeValue;
}

In PHP, how can I get an XML attribute based on a variable?

I'm retrieving files like so (from the Internet Archive):
<files>
<file name="Checkmate-theHumanTouch.gif" source="derivative">
<format>Animated GIF</format>
<original>Checkmate-theHumanTouch.mp4</original>
<md5>72ec7fcf240969921e58eabfb3b9d9df</md5>
<mtime>1274063536</mtime>
<size>377534</size>
<crc32>b2df3fc1</crc32>
<sha1>211a61068db844c44e79a9f71aa9f9d13ff68f1f</sha1>
</file>
<file name="CheckmateTheHumanTouch1961.thumbs/Checkmate-theHumanTouch_000001.jpg" source="derivative">
<format>Thumbnail</format>
<original>Checkmate-theHumanTouch.mp4</original>
<md5>6f6b3f8a779ff09f24ee4cd15d4bacd6</md5>
<mtime>1274063133</mtime>
<size>1169</size>
<crc32>657dc153</crc32>
<sha1>2242516f2dd9fe15c24b86d67f734e5236b05901</sha1>
</file>
</files>
They can have any number of <file>s, and I'm solely looking for the ones that are thumbnails. When I find them, I want to increase a counter. When I've gone through the whole file, I want to find the middle Thumbnail and return the name attribute.
Here's what I've got so far:
//pop previously retrieved XML file into a variable
$elem = new SimpleXMLElement($xml_file);
//establish variable
$i = 0;
// Look through each parent element in the file
foreach ($elem as $file) {
if ($file->format == "Thumbnail"){$i++;}
}
//find the middle thumbnail.
$chosenThumb = ceil(($i/2)-1);
//Gloriously announce the name of the chosen thumbnail.
echo($elem->file[$chosenThumb]['name']);`
The final echo doesn't work because it doesn't like have a variable choosing the XML element. It works fine when I hardcode it in. Can you guess that I'm new to handling XML files?
Edit:
Francis Avila's answer from below sorted me right out!:
$sxe = simplexml_load_file($url);
$thumbs = $sxe->xpath('/files/file[format="Thumbnail"]');
$n_thumbs = count($thumbs);
$middlethumb = $thumbs[(int) ($n_thumbs/2)];
$happy_string = (string)$middlethumb[name];
echo $happy_string;
Use XPath.
$sxe = simplexml_load_file($url);
$thumbs = $sxe->xpath('/files/file[format="Thumbnail"]');
$n_thumbs = count($thumbs);
$middlethumb = $thumbs[(int) ($n_thumbs/2)];
$middlethumbname = (string) $middlethumb['name'];
You can also accomplish this with a single XPath expression if you don't need the total count:
$thumbs = $sxe->xpath('/files/file[format="Thumbnail"][position() = floor(count(*) div 2)]/#name');
$middlethumbname = (count($thumbs)) ? $thumbs[0]['name'] : '';
A limitation of SimpleXML's xpath method is that it can only return nodes and not simple types. This is why you need to use $thumbs[0]['name']. If you use DOMXPath::evaluate(), you can do this instead:
$doc = new DOMDocument();
$doc->loadXMLFile($url);
$xp = new DOMXPath($doc);
$middlethumbname = $xp->evaluate('string(/files/file[format="Thumbnail"][position() = floor(count(*) div 2)]/#name)');
$elem->file[$chosenThumb] will give the $chosenThumb'th element from the main file[] not the filtered(for Thumbnail) file[], right?
foreach ($elem as $file) {
if ($file->format == "Thumbnail"){
$i++;
//add this item to a new array($filteredFiles)
}
}
$chosenThumb = ceil(($i/2)-1);
//echo($elem->file[$chosenThumb]['name']);
echo($filteredFiles[$chosenThumb]['name']);
Some problems:
Middle thumbnail is incorrectly calculated. You'll have to keep a separate array for those thumbs and get the middle one using count.
file might need to be {'file'}, I'm not sure how PHP sees this.
you don't have a default thumbnail
Code you should use is this one:
$files = new SimpleXMLElement($xml_file);
$thumbs = array();
foreach($files as $file)
if($file->format == "Thumbnail")
$thumbs[] = $file;
$chosenThumb = ceil((count($thumbs)/2)-1);
echo (count($thumbs)===0) ? 'default-thumbnail.png' : $thumbs[$chosenThumb]['name'];
/edit: but I recommend that guy's solution, to use XPath. Way easier.

Categories