Trouble printing array of titles and urls - php

I'm not very good at PHP.
I have a folder of .html files that will change often, and then I want to search through the folder, parse the <h1> tags, and then print/echo each <h1> tag and its url.
Getting the <h1> tags out of the .html files was easy enough with some Googling, but I cannot seem to print a list of <h1> titles and their corresponding URL's.
Here is what I have so far:
$url_list = glob('posts/*.html'); // Searches for all files and folders in above directory that end in .html.
foreach ($url_list as $url) { // Creates an array of post URL's and title <h1> tags.
$post = new DOMDocument(); // Creates string to load blog post.
$post->loadHTMLFile($url); // Loads blog post into string $post from its URL.
$h1_tags = $post->getElementsByTagName('h1'); // Finds all <h1> tags.
$first_h1 = $h1_tags->item(0); // Gets value of first <h1> tag.
$title = $first_h1->nodeValue; // Sets $title to value of first <h1> tag.
if (!empty($title)) { // Will only run on files which have a date in their metadata.
$post_list[$url] = $title;
$post_list[$title] = $url;
}
}
sort($post_list); // Sorts list of posts in alphabetical order.
$num = 1;
foreach ($post_list as $title) { //
echo "<h2>".($num++).". {$title} = {$url}</h2>";
}

You are adding the titles and URL's into the same list - but reversed. If you build up your data as...
if (!empty($title)) { // Will only run on files which have a date in their metadata.
$post_list[$title] = $url;
}
So this only adds it in once, and then output it like...
foreach ($post_list as $title => $url) { //
echo "<h2>".($num++).". {$title} = {$url}</h2>";
}
Edit: change sort() to asort()

Related

Simple html dom always loading the default first page and not the specified url

I want to scrape few web pages. I am using php and simple html dom parser.
For instance trying to scrape this site: https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=5
I use this load the url.
$html = new simple_html_dom();
$html->load_file($url);
This loads the correct page. Then I find the next page link, here it will be:
https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes&page=6
Just the page value is changed from 5 to 6. The code snippet to get the next link is:
function getNextLink($_htmlTemp)
{
//Getting the next page links
$aNext = $_htmlTemp->find('a.next', 0);
$nextLink = $aNext->href;
return $nextLink;
}
The above method returns the correct link with page value being 6.
Now when I try to load this next link, it fetches the first default page with page query absent from the url.
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>'; //$nxtLink has correct value
$originalHtml->load_file($nxtLink); //This line fetches default page
}
The whole flow is something like this:
$html->load_file($url);
//Whole thing in a do-while loop
$originalHtml = $html;
$shouldLoop = true;
//Main Array
$value = array();
do{
$listings = $originalHtml->find('div.searchResult');
foreach($listings as $item)
{
//Some logic here
}
//After loop we will have details of all the listing in this page -- so get next page link
$nxtLink = getNextLink($originalHtml); //Returns string url
if(!empty($nxtLink))
{
//Yay, we have the next link -- load the next link
print 'Next Url: '.$nxtLink.'<br>';
$originalHtml->load_file($nxtLink);
}
else
{
//No next link -- stop the loop as we have covered all the pages
$shouldLoop = false;
}
} while($shouldLoop);
I have tried encoding the whole url, only the query parameters but the same result. I also tried creating new instances of simple_html_dom and then loading the file, no luck. Please help.
You need to html_entity_decode those links, I can see that they are getting mangled by simple-html-dom.
$url = 'https://www.autotrader.co.uk/motorhomes/motorhome-dealers/bc-motorhomes-ayr-dpp-10004733?channel=motorhomes';
$html = str_get_html(file_get_contents($url));
while($a = $html->find('a.next', 0)){
$url = html_entity_decode($a->href);
echo $url . "\n";
$html = str_get_html(file_get_contents($url));
}

scraping images from url using php

i am trying to make a page that allows me to grab and save images from another link , so here's what i want to add on my page:
text box (to enter url that i want to get images from).
save dialog box to specify the path to save images.
but what i am trying to do here i want to save images only from that url and from inside specific element.
for example on my code i say go to example.com and from inside of element class="images" grab all images.
notes: not all images from the page, just from inside the element
whether element has 3 images in it or 50 or 100 i don't care.
here's what i tried and worked using php
<?php
$html = file_get_contents('http://www.tgo-tv.net');
preg_match_all( '|<img.*?src=[\'"](.*?)[\'"].*?>|i',$html, $matches );
echo $matches[ 1 ][ 0 ];
?>
this gets image name and path but what i am trying to make is a save dialog box and the code must save image directly into that path instead of echo it out
hope you understand
Edit 2
it's ok of Not having save dialog box. i must specify save path from the code
If you want something generic, you can use:
<?php
$the_site = "http://somesite.com";
$the_tag = "div"; #
$the_class = "images";
$html = file_get_contents($the_site);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//'.$the_tag.'[contains(#class,"'.$the_class.'")]/img') as $item) {
$img_src = $item->getAttribute('src');
print $img_src."\n";
}
Usage:
Change the site, tag, which can be a div, span, a, etc. also change the class name.
For example, change the values to:
$the_site = "https://stackoverflow.com/questions/23674744/what-is-the-equivalent-of-python-any-and-all-functions-in-javascript";
$the_tag = "div"; #
$the_class = "gravatar-wrapper-32";
Output:
https://www.gravatar.com/avatar/67d8ca039ee1ffd5c6db0d29aeb4b168?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24da669dda96b6f17a802bdb7f6d429f?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24780fb6df85a943c7aea0402c843737?s=32&d=identicon&r=PG
Maybe you should try HTML DOM Parser for PHP. I've found this tool recently and to be honest it works pretty well. It was JQuery-like selectors as you can see on the site. I suggest you to take a look and try something like:
<?php
require_once("./simple_html_dom.php");
foreach ($html->find("<tag>") as $<tag>) //Start from the root (<html></html>) find the the parent tag you want to search in instead of <tag> (e.g "div" if you want to search in all divs)
{
foreach ($<tag>->find("img") as $img) //Start searching for img tag in all (divs) you found
{
echo $img->src . "<br>"; //Output the information from the img's src attribute (if the found tag is <img src="www.example.com/cat.png"> you will get www.example.com/cat.png as result)
}
}
?>
I hope i helped you less or more.

Get and return media url (m3u8) using PHP

I have a website that hosts videos from a client. On the website the files load externally via m3u8 link.
The client would now like to have those videos on a Roku channel.
If I simply use the m3u8 link from the site it gives an error because the url generated is sent with a cookie and so a client must click and the link to generate a new code for them.
I would like if possible (and I have not seen this here) is to scrape the html page and just return the link via PHP script on the website from the Roku.
I know how to get titles and such using pure php but am having problems returning the m3u8 link..
I do have code to show I am not looking for handouts and actually am trying.
This is what I have used for getting the title name for example.
Note: I would like to know if it is possible to have one php that autofills the html page per url so I do not have to use a different php for each video with the url pretyped in.
<?php
$html = file_get_contents('http://example.com'); //get the html returned from the following url
$movie_doc = new DOMDocument();
libxml_use_internal_errors(TRUE); //disable libxml errors
if(!empty($html)){ //if any html is actually returned
$movie_doc->loadHTML($html);
libxml_clear_errors(); //remove errors for yucky html
$movie_xpath = new DOMXPath($movie_doc);
//get all the titles
$movie_row = $movie_xpath->query('//title');
if($movie_row->length > 0){
foreach($movie_row as $row){
echo $row->nodeValue . "<br/>";
}
}
}
?>
There is a simple approach for this, which involves using regex.
In this example let's say the video M3u8 file is located at: http://example.com/theVideoPage
You would point the video URL Source in your XML to your PHP file.
http://thisPhpFileLocation.com
<?php
$html = file_get_contents("http://example.com/theVideoPage");
preg_match_all(
'/(http.*m3u8)/',
$html,
$posts, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($posts as $post) {
$link = $post[0];
header("Location: $link");
}
?>
Now if you want to use a URL that you can append a URL link at the end it could look something like this and you would use an address as such for a Video Url located at
http://thisPhpFileLocation.com?id=theVideoPage
<?php
$id = $_GET['id'];
$html = file_get_contents("http://example.com".$id);
preg_match_all(
'/(http.*m3u8)/',
$html,
$things, // will contain the article data
PREG_SET_ORDER // formats data into an array of posts
);
foreach ($things as $thing) {
$link = $thing[1];
// clear out the output buffer
while (ob_get_status())
{
ob_end_clean();
}
// no redirect
header("Location: $link");
}
?>

How to display image url from website sub pages using php code

I am using below mentioned php code to display images from webpages.Below mentioned code is able to display image url from main page but unable to display image urls from sub pages.
enter code here
<?php
include_once('simple_html_dom.php');
$target_url = "http://fffmovieposters.com/";
$html = new simple_html_dom();
$html->load_file($target_url);
foreach($html->find('img') as $img)
{
echo $img->src."<br />";
echo $img."<br/>";
}
?>
If by sub-page you mean a page that http://fffmovieposters.com is linking to, then of course that script won't show any of those since you're not loading those pages.
You basically have to write a spider that not only finds images, but also anchor tags and then repeats the process for those links. Just remember to add some filters so that you don't process pages more than once or start processing the entire internet by following external links.
Pseudo'ish code
$todo = ['http://fffmovieposters.com'];
$done = [];
$images = [];
while( ! empty($todo))
$link = array_shift($todo);
$done[] = $link;
$html = get html;
$images += find <img> tags
$newLinks = find <a> tags
remove all external links and all links already in $done from $newLinks
$todo += $newLinks;
Or something like that...

Extract image data when pulling in RSS feeds using PHP

The script I am using to pull in the feed titles is:
<?php
function getFeed($feed_url) {
$content = file_get_contents($feed_url);
$x = new SimpleXmlElement($content);
echo "<ul>";
foreach($x->channel->item as $entry) {
echo "<li><a href='$entry->link' title='$entry->title'>" . $entry->title . "</a> </li>";
$i++;
if($i==5) break;
}
echo "</ul>";
}
?>
I would like to pull in the images for each title/article and place them next to the title.
The float part is easy. I'm having trouble getting the actual image in. I've tried that mark like this: <img src="$entry->image" />, but it didn't work.
How would I go about this please?
As the answer to random's suggestion:
I'm using $item['description'] to represent the description string.
Then match for img tag
Output image tag
Then remove the image tag from the original $content and output that, you're left with just the description text
<?php
$content = $item['description'];
preg_match('/(<img[^>]+>)/i', $content, $matches);
echo $matches[0];
echo str_replace($matches[0],"",$content);
?>
When displaying the feed contents, variables such as $entry->link and $entry->title work because they are valid, present and required elements in a standard RSS feed item.
To call $entry->image would require the source RSS to use and populate that optional element.
Some feeds may include that data, but most will not.
Alternatively, you can write your own function or method to scan the contents of the description element and find any suitable image to include as you need.

Categories