I have this keyword: yt-lookup-title.
I want the next 17 letters after this in a variable. So I would have:
"<a href="/watch?v=HnlC81tWoY8"
How can I archive that I get it from all lines with this Keyword?
Keywords
If you want to get the href content, you can rely on domdocument.
If I'm not mistaken, all the links (<a>) have this class yt-uix-tile-link. So you can do the following:
$dom = new DOMDocument;
// $html is a string containing the html of the page you're parsing
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
links = array ();
$nodes = $xpath->query('//a[#class="yt-uix-tile-link"]/#href');
foreach ($nodes as $node) {
$links [] = $node->nodeValue;
}
var_dump ($links);
Hope that helps
Related
I'm calling some wikipedia content two different way:
$html = file_get_contents('https://en.wikipedia.org/wiki/Sans-serif');
The first one is to call the first paragraph
$dom = new DomDocument();
#$dom->loadHTML($html);
$p = $dom->getElementsByTagName('p')->item(0)->nodeValue;
echo $p;
The second one is to call the first paragraph after a specific $id
$dom = new DOMDocument();
#$dom->loadHTML($html);
$p=$dom->getElementById('$id')->getElementsByTagName('p')->item(0);
echo $p->nodeValue;
I'm looking for a third way to call all the first part.
So I was thinking about calling all the <p> before the id or class "toc" which is the id/class of the table of content.
Any idea how to do that?
If you're just looking for the intro in plain text, you can simply use Wikipedia's API:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=Sans-serif
If you want HTML formatting as well (excluding inner images and the likes):
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&titles=Sans-serif
You could use DOMDocument and DOMXPath with for example an xpath expression like:
//div[#id="toc"]/preceding-sibling::p
$doc = new DOMDocument();
$doc->load("https://en.wikipedia.org/wiki/Sans-serif");
$xpath = new DOMXPath($doc);
$nodes = $xpath->query('//div[#id="toc"]/preceding-sibling::p');
foreach ($nodes as $node) {
echo $node->nodeValue;
}
That would give you the content of the paragraphs preceding the div with id = toc.
I found a way to remove all tag attributes from a html string using php:
$html_string = "<div class='myClass'><b>This</b> is an <span style='margin:20px'>example</span><img src='ima.jpg' /></div>";
$output = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i",'<$1$2>', $html_string);
echo $output;
//<div><b>This</b> is an <span>example</span><img/></div>
But I would like to keep certain tags such as src and href. I have almost no experience with regular expresions, so any help would be really appreciated.
[maybe] Relevant update: This is parto of a process of 'cleaning' posts on a database. I am iterating through all the posts, getting the html, cleaning it, and updating it on the corresponding table.
You usually should not parse HTML using regular expressions. Instead, in PHP you should call DOMDocument::loadHTML. You can then recurse through the elements in the document and call removeAttribute. Regular expressions for HTML tags are notoriously tricky.
REF: http://php.net/manual/en/domdocument.loadhtml.php
Examples: http://coursesweb.net/php-mysql/html-attributes-php
Here's a solution for you. It will iterate over all tags in the DOM, and remove attributes which are not src or href.
$html_string = "<div class=\"myClass\"><b>This</b> is an <span style=\"margin:20px\">example</span><img src=\"ima.jpg\" /></div>";
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html_string); // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//#*');
foreach ($nodes as $node) {
if($node->nodeName != "src" && $node->nodeName != "href") {
$node->parentNode->removeAttribute($node->nodeName);
}
}
echo $dom->saveHTML(); // output cleaned HTML
Here is another solution using xPath to filter on attribute names instead:
$dom = new DOMDocument; // init new DOMDocument
$dom->loadHTML($html_string); // load the HTML
$xpath = new DOMXPath($dom);
$nodes = $xpath->query("//#*[local-name() != 'src' and local-name() != 'href']");
foreach ($nodes as $node) {
$node->parentNode->removeAttribute($node->nodeName);
}
echo $dom->saveHTML(); // output cleaned HTML
Tip: Set the DOM parser to UTF-8 if you are using extended character like this:
$dom->loadHTML(mb_convert_encoding($html_string, 'HTML-ENTITIES', 'UTF-8'));
I am trying to extract a complete table including the HTML tags, with XPath, that I can store in a variable, do a bit of string replacement on, then echo directly to the screen. I have found numerous posts on getting the text out of the table but I want to retain the HTML formatting since I am just going to display it (after minor modification).
At present I am extracting the table using string functions stristr, substr etc. but I would prefer to use XPath.
I can display the contents of the table with the following but it just displays the table TD fields with no formatting. It also does not store it in a variable that I can manipulate.
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = $xpath->query('//table');
foreach($arr as $el) {
echo $el->textContent;
I tried this but got no output:
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$arr = $xpath->query('//table');
echo $arr->saveHTML();
Use DOMNode::C14N():
foreach($arr as $el) {
echo $el->C14N();
I am using domDocument hoping to parse this little html code. I am looking for a specific span tag with a specific id.
<span id="CPHCenter_lblOperandName">Hello world</span>
My code:
$dom = new domDocument;
#$dom->loadHTML($html); // the # is to silence errors and misconfigures of HTML
$dom->preserveWhiteSpace = false;
$nodes = $dom->getElementsByTagName('//span[#id="CPHCenter_lblOperandName"');
foreach($nodes as $node){
echo $node->nodeValue;
}
But For some reason I think something is wrong with either the code or the html (how can I tell?):
When I count nodes with echo count($nodes); the result is always 1
I get nothing outputted in the nodes loop
How can I learn the syntax of these complex queries?
What did I do wrong?
You can use simple getElementById:
$dom->getElementById('CPHCenter_lblOperandName')->nodeValue
or in selector way:
$selector = new DOMXPath($dom);
$list = $selector->query('/html/body//span[#id="CPHCenter_lblOperandName"]');
echo($list->item(0)->nodeValue);
//or
foreach($list as $span) {
$text = $span->nodeValue;
}
Your four part question gets an answer in three parts:
getElementsByTagName does not take an XPath expression, you need to give it a tag name;
Nothing is output because no tag would ever match the tagname you provided (see #1);
It looks like what you want is XPath, which means you need to create an XPath object - see the PHP docs for more;
Also, a better method of controlling the libxml errors is to use libxml_use_internal_errors(true) (rather than the '#' operator, which will also hide other, more legitimate errors). That would leave you with code that looks something like this:
<?php
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach($xpath->query("//span[#id='CPHCenter_lblOperandName']") as $node) {
echo $node->textContent;
}
i have the following code. and i want to retrieve only the a href titles , that have /movie/ within url.
function get_a_contentmovies(){
$h1count = preg_match_all("/(<a.*>)(\w.*)(<.*>)/ismU",$this->DataFromSite,$patterns);
return $patterns[2];
}
You can use DOMXpath like this:
$dom = new DomDocument();
$dom->loadHTML($string);
$xpath = new DOMXpath($dom);
$elements = $xpath->query("//a[contains(#href, '/movie/')]");
foreach($elements as $el) {
var_dump($el->getAttribute('title'));
}
Using Regex to parse (x)HTML is a bad idea. You should use a DOM parser such as DomDocument. Have a look at this topic.