Anybody any idea how they do it? I currently use OffLiberty.com to parse Mixcloud links to get the raw MP3 URL for use in a custom HTML5 player for iOS compatibility, I was just wondering if anyone knew how exactly their process works, so I could create something similar that would 'cut out the middleman' so to speak, so my end-user wouldn't have to go to an external site to get a link to the MP3 for the mix they want to post. Just a thought really, not terribly important if it couldn't be done, but it would be a nice touch :)
Anybody any idea?
Note that I'm against content scraping and you should ask those website permission to scrap their MP3 URLs. Else, if I was them, I'd block you right now and ad vitam æternam.
Anyway, you can parse its HTML using DOMDocument.
For example :
<?php
// just so you don't see parse errors
$internal_errors = libxml_use_internal_errors(true);
// initialize the document
$doc = new DomDocument();
// load a page
$doc->loadHTMLFile('http://www.mixcloud.com/LaidBackRadio/le-motel-on-the-road/');
// initialize XPATH for the document
$xpath = new DomXPath($doc);
// span with "data-preview-url" seems to contain MP3 url
// we request them inside a DomNodeList http://www.php.net/manual/en/class.domnodelist.php
$mp3 = $xpath->query('//span[#data-preview-url]');
foreach($mp3 as $m){
// we print the attribute value
echo $m->attributes->getNamedItem('data-preview-url')->nodeValue . '<br/>';
}
libxml_use_internal_errors($internal_errors);
Related
I would like to get content of a iframe with id or class of that iframe.
$links = "https://www.amazon.co.uk/Book-Secret-Wisdom-Prophetic-Evolution/dp/599054314X/ref=pd_sbs_14_2/257-3608675-7951114?_encoding=UTF8&pd_rd_i=599054314X&pd_rd_r=f103ccc1-7985-11e9-987c-178a1a538946&pd_rd_w=E5xsZ&pd_rd_wg=206kw&pf_rd_p=18edf98b-139a-41ee-bb40-d725dd59d1d3&pf_rd_r=MV7NZ41V278ECZM1135G&psc=1&refRID=MV7NZ41V278ECZM1135G";
$res = #file_get_contents($links);
$dom = new DomDocument();
#$dom->loadHTML($res);
$xpath = new DOMXpath($dom);
I would like to get content by id:
$dom->getElementById('iframeContent');
However, it alway return a page, not content of that iframe.
Anyone meet that problem?, pls help.
Iframes don't have content (well, they might, but it is alternative content for when iframes are not supported by the client).
They have a src attribute containing a URL pointing at an external document.
You need to read the src attribute value, resolve it to an absolute URL (if it isn't one already), then make an HTTP request to it, parse that, and extract the data from it.
I am trying to find a way of displaying the text from a website on a different site.
I own both the sites, and they both run on wordpress (I know this may make it more difficult). I just need a page to mirror the text from the page and when the original page is updated, the mirror also updates.
I have some experience in PHP and HTML, and I also would rather not use Js.
I have been looking at some posts that suggest cURL and file_get_contents but have had no luck editing it to work with my sites.
Is this even possible?
Look forward to your answers!
Both cURL and file_get_contents() are fine to get the full html output from an url. For example with file_get_contents() you can do it like this:
<?php
$content = file_get_contents('http://elssolutions.co.uk/about-els');
echo $content;
However, in case you need just a portion of the page, DOMDocument and DOMXPath are far better options, as with the latter you also can query the DOM. Below is working an example.
<?php
// The `id` of the node in the target document to get the contents of
$url = 'http://elssolutions.co.uk/about-els';
$id = 'comp-iudvhnkb';
$dom = new DOMDocument();
// Silence `DOMDocument` errors/warnings on html5-tags
libxml_use_internal_errors(true);
// Loading content from external url
$dom->loadHTMLFile($url);
libxml_clear_errors();
$xpath = new DOMXPath($dom);
// Querying DOM for target `id`
$xpathResultset = $xpath->query("//*[#id='$id']")->item(0);
// Getting plain html
$content = $dom->saveHTML($xpathResultset);
echo $content;
How to load a URL obtained from $_SERVER['REQUEST_URI'] into domDocument?
I am trying to load a dynamic webpage into DOMDocument to be parsed for certain words. Ultimately I want to create a glossary for my site (Tiki Wiki CMS). I started very simple and right now I am only trying to load a page and parse the text for testing purposes.
I am new to DOMDocument and after reading several articles on this site and on PHP Manual, I know that I have to load a html page with loadHTMLFile, then parse the site by getElementsById or getElementsByTagName in order to do stuff with it. It works fine for static pages, but the main problem I am having is that I cannot enter a static url into loadHTMLFile, because parsing should be performed when the site is uploaded by the user.
Here's the code that DID work:
$url = 'http://mysite.org/bbk/tiki-index.php?page=pagetext';
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
So, I thought I could use $_SERVER['REQUEST_URI'] for the job, but it did not work.
This did NOT work (no error message):
$url = $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
After checking what the $url output was, I decided to add http://mysite.org to it to make it identical to the url that worked. However, no luck either and this time I got an internal server error.
This did NOT work either (Internal Server Error):
$url = 'http://mysite.org' . $_SERVER['REQUEST_URI'];
$dom = new DOMDocument;
$dom->loadHTMLFile($url);
$a = $dom->getElementsByTagName('a');
foreach ($a as $link) {
echo $link->nodeValue;
}
I think I am missing something substantial here and I thought it might just not be possible to use DOMDocument in this way, so I was searching the web for help again (if it is possible to use $_SERVER['REQUEST_URI'] in combination with DOMdocument at all), but I didn't find an answer. So I hope anybody here can help. Any suggestions including third party parsers etc. would be helpful, except anything that requires parsing with regex. Tiki Wiki CMS already has a glossary option done with regex, but it is very buggy.
Thanks.
UPDATE
I haven't found an answer to the problem, but I think I have an idea on where my mistake was. I was expecting $_SERVER['REQUEST_URI'] to run on a dynamic page that was not completely built yet. I ran the script on the main setup page, so I guess the html was not rendered yet, when I tried to point $_SERVER['REQUEST_URI'] to it. When I noticed that this might be the problem, I abandoned the idea of parsing the document with DomDocument and used a javascript solution that can be loaded after the document is ready.
I can think of two things that you can do (probably won't solve your problem directly, but will help you greatly with solving it):
$_SERVER['REQUEST_URI'] doesn't contain what you think it does. Try echoing or var_dumping it, and see if it matches your expectations.
Enable error reporting. The reason you are seeing a generic 500 error page, is because error reporting is disabled. enable it using error_reporting().
Also note that DOMDocument only parses HTML, if you have dynamic DOM nodes generated and added to the page using a client-side language, or CSS pseudo elements, they won't be displayed unless you deploy a JS/CSS parser as well (which is not trivial).
I am grabbing the contents from google with PhP, how can I search $page for elements with the id of "#lga" and echo out another property? Say #lga is an image, how would I echo out it's source?
No, i'm not going to do this with Google, Google is strictly an example and testing page.
<body><img id="lga" src="snail.png" /></body>
I want to find the element named "lga" and echo out it's source; so the above code I would want to echo out "snail.png".
This is what i'm using and how i'm storing what I found:
<?php
$url = "https://www.google.com/";
$page = file($url);
foreach($page as $part){
}
?>
You can achieve this using the built-in DOMDocument class. This class allows you to work with HTML in a structured manner rather than parsing plain text yourself, and it's quite versatile:
$dom = new DOMDocument();
$dom->loadHTML($html);
To get the src attribute of the element with the id lga, you could simply use:
$imageSrc = $dom->getElementById('lga')->getAttribute('src');
Note that DOMDocument::loadHTML will generate warnings when it encounters invalid HTML. The method's doc page has a few notes on how to suppress these warnings.
Also, if you have control over the website you are parsing the HTML from, it might be more appropriate to have a dedicated script to serve the information you are after. Unless you need to parse exactly what's on a page as it is served, extracting data from HTML like this could be quite wasteful.
I need to store the XML that i get it from Google Analytics. Its format is XML file. I need to create the script ( PHP ) that will read XML file from Google Analytics and store in my server with user defined name. I tried like that
<?php
$dom = new DOMDocument();
$dom->load('https://www.google.com/analytics/reporting/export?fmt=1&id=346044461&pdr=20100611-20100711&cmp=average&rpt=DashboardReport');
$dom->save('books3.xml');
?>
Can you help me
you're not assigning the result of load to anything you can save afterwards. and that is assuming you created a function load.
you'd need something more along the lines of
<?php
$remoteUri = 'https://www.google.com/analytics/reporting/export?...';
$doc = new DOMDocument();
$doc->loadXML(file_get_contents($remoteUri));
$xml = $doc->saveXML($doc->documentElement);
file_put_contents($yourLocalFilePath, $xml);
or if you just want a completely verbatim copy locally:
<?php
$remoteUri = ...
file_put_contents($yourLocalFilePath, file_get_contents($remoteUri));
the second, simpler version doesn't attempt to parse any xml and will therefore not have any clue if something is wrong with the recieved document.
depending on your server, you might have to resort to more complex methods of getting the file if url wrappers for fopen aren't enabled, or if your google endpoint wants to use cookies etc. for example.