Building an MTDB DB with php, and need to scrape a specific tag from the URL.
Tag to get from url
vars.disqus = '';
vars.lists = [];
vars.titleId = '35079';
vars.trailersPlayer = 'default';
vars.userId = '907791';
vars.title = {"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec}
I need the
"id":35079,"title":"Family Vacation","trailer":35097.flv,"timing":0.50sec
My code:
$html = 'myurl';
libxml_use_internal_errors(TRUE); $dom = new DOMDocument; $dom->loadHTMLFile($html); libxml_clear_errors();
$xp = new DOMXpath($dom); $nodes = $xp->query('//script[#\'id','trailer','title');
echo $nodes->item(0)->nodeValue;
the "Tag" is not a HTML format, its looks like some javascript code ~~
to resolve these string, simply via regex
preg_match('/title\s*=\s*\{([^}]+)}/', $str, $matches);
var_dump($matches[1]);
Related
I am trying to get a description of a site but without the use of the meta tags. Basically what I am trying to get is the first couple of sentences of a site.
So far I got this but I do not know how to get the content out the div:
$checkLinkOnPage = '{sitehere}';
$html = file_get_contents($checkLinkOnPage);
$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
// find the element whose href value you want by XPath
$nodes = $xpath->query('//*');
$approvedLinks = array();
foreach($nodes as $href) {
//Check all links see if they are valid.
$url = $href->getAttribute('html');
if($href->tagName == 'div'){
//Display first div content here
}
}
All I want to do is save the first div with attribute role="main" as a string from an external URL using PHP.
So far I have this:
$doc = new DOMDocument();
#$doc->loadHTMLFile("http://example.com/");
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//div[#role="main"]');
$str = "";
if ($elements->length > 0) {
$str = $elements->item(0)->textContent;
}
echo htmlentities($str);
But unfortunately the $str does not seem to be displaying the HTML tags. Just the text.
You can get the HTML via the saveHTML() method.
$str = $doc->saveHTML($elements->item(0));
exactly as its descriped in the title currently my code is:
<?php
$url = "remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$elements = $doc->getElementsByTagName('div');
?>
my coding skills are very basic so at this point i am lost and dont know how to display only the div that has the id id=mydiv
If you have PHP 5.3.6 or higher you can do the following:
$url = "remotesite.com/page1.html";
$html = file_get_contents($url);
$doc = new DOMDocument(); // create DOMDocument
libxml_use_internal_errors(true);
$doc->loadHTML($html); // load HTML you can add $html
$testElement = $doc->getElementById('divIDName');
echo $doc->saveHTML($testElement);
http://php.net/manual/en/domdocument.getelementbyid.php
If you have a lower version I believe you would need to copy the Dom node once you found it with getElementById into a new DomDocument object.
$elementDoc = new DOMDocument();
$cloned = $testElement->cloneNode(TRUE);
$elementDoc->appendChild($elementDoc->importNode($cloned,TRUE));
echo $elementDoc->saveHTML();
I have this code which extracts all links from a website. How do I edit it so that it only extracts links that ends on .mp3?
Here are the following code:
preg_match_all("/\<a.+?href=(\"|')(?!javascript:|#)(.+?)(\"|')/i", $html, $matches);
Update:
A nice solution would be to use DOM together with XPath, as #zerkms mentioned in the comments:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$xpath = new DOMXPath($doc);
// use the XPath function ends-with to select only those links which end with mp3
$links = $xpath->query('//a[ends-with(#href, ".mp3")]/#href');
Original Answer:
I would use DOM for this:
$doc = new DOMDocument();
$doc->loadHTML($yourHtml);
$links = array();
foreach($doc->getElementsByTagName('a') as $elem) {
if($elem->hasAttribute('href')
&& preg_match('/.*\.mp3$/i', $elem->getAttribute('href')) {
$links []= $elem->getAttribute('href');
}
}
var_dump($links);
I would prefer XPath, which is meant to parse XML/xHTML:
$DOM = new DOMDocument();
#$DOM->loadHTML($html); // use the # to suppress warnings from invalid HTML
$XPath = new DOMXPath($DOM);
$links = array();
$link_nodes = $XPath->query('//a[contains(#href, ".mp3")]');
foreach($link_nodes as $link_node) {
$source = $link_nodes->getAttribute('href');
// do some extra work to make sure .mp3 is at the end of the string
$links[] = $source;
}
There is an ends-with() XPath function that you can replace contains(), if you are using XPath 2.0. Otherwise, you might want to add an extra conditional to make sure the .mp3 is at the end of the string. It may not be necessary though.
I am having trouble decoding entities in the title from this youtube video:
http://www.youtube.com/watch?v=p7NMsywVQhY
Here is my code:
$url = 'http://www.youtube.com/watch?v=p7NMsywVQhY';
$html = #file_get_contents($url);
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
//decode the '' in the title
$title = html_entity_decode($title,ENT_QUOTES,'UTF-8'); //does not seem to have any effect
//decode the utf data
$title = utf8_decode($title);
$title returns everything fine except returns question marks where is originally in the title.
Thanks.
I don't know if PHP provides any function to do that, however you can use preg_replace like this:
$string = preg_replace('/&#x([0-9a-f]+);/ei', 'chr(hexdec("$1"))', $string);
Try this to force correct detection of the charset:
$doc = new DOMDocument();
#$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$nodes = $doc->getElementsByTagName('title');
$title = $nodes->item(0)->nodeValue;
echo $title;