search for element name using PHP simple HTML dom parser - php

I'm hoping someone can help me. I'm using PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm) successfully, but I now am trying to find elements based on a certain name. For example, in the fetched HTML, there might be a tags such as:
<p class="mattFacer">Matt Facer</p>
<p class="mattJones">Matt Jones</p>
<p class="daveSmith">DaveS Smith</p>
What I need to do is to read in this HTML and capture any HTML elements which match anything beginning with the word, "matt"
I've tried
$html = str_get_html("http://www.testsite.com");
foreach($html->find('matt*') as $element) {
echo $element;
}
but this doesn't work. It returns nothing.
Is it possible to do this? I basically want to search for any HTML element which contains the word "matt". It could be a span, div or p.
I'm at a dead end here!

$html = str_get_html("http://www.testsite.com");
foreach($html->find('[class*=matt]') as $element) {
echo $element;
}
Let's try that

Maybe this?
foreach(array_merge($html->find('[class*=matt]'),$html->find('[id*=matt]')) as $element) {
echo $element;
}

Related

How to format plaintext in PHP Simple HTML DOM Parser?

I'm trying to extract the content of a webpage in plain text - without the html tags. Here's some sample code:
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html($url);
$result['body'] = $dom->find('body', 0)->plaintext;
The problem is that what I get in $result['body'] is very messy. The HTML was removed, sure, but sentences often merge into others since there are no spaces or periods to delimit where the text from one HTML tag ended, and text from the following tag begins.
An example:
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Results in:
"Headerthis is a paragraphthis is another paragraph"
Desired result:
"Header. this is a paragraph. this is another paragraph"
Is there any way to format the result from plaintext or perhaps apply extra manipulation on the innertext before using plaintext to achieve clear delimiters for sentences?
EDIT:
I'm thinking of doing something like this:
foreach($dom->find('div') as $element) {
$text = $element->plaintext;
$result['body'] .= $text.'. ';
}
but there's a problem when the divs are nested, since it would add the content of the parent, which includes text from all children, and then add the content of the children, effectively duplicating the text. This can be fixed simply by checking if there is a </div> inside the $text though.
Perhaps I should try callbacks.
Possibly something like this? Tested.
<?php
require_once 'vendor/autoload.php';
$dom = \Sunra\PhpSimple\HtmlDomParser::file_get_html("index.html");
$result['body'] = implode('. ', array_map(function($element) {
return $element->plaintext;
}, $dom->find('div')));
echo $result['body'];
<body>
<div class="H2">Header</div>
<div class="P">this is a paragraph</div>
<div class="P">this is another paragraph</div>
</body>
Try this code:
$result = array();
foreach($html->find('div') as $e){
$result[] = $e->plaintext;
}

parsing html page using php to find out text on which link is assiged

say i have html code like this
$html = "This is some stuff right here. OH MY GOSH";
i am trying to get values of href and also on which anchor work i mean check this out text i am able to get href value by following this code
$displaybody->find('a ') as $element;
echo $element;
well it works for me but how do i get value of check this out could you guys help me out. i did search but i am not able to find it out . thanks in advance
my actual html look like this
» Download MP4 « - <b>144p (Video Only)</b> - <span> 19.1</span> MB<br />
my href look like this above code return download mp4 and i want it like downloadmp4 114p (video only) 19.1 mb how do i do that
If what you are using now is the SimpleHTMLDOM, then ->innertext works fine on that anchor elements that you have found:
include 'simple_html_dom.php';
$html = "This is some stuff right here. OH MY GOSH";
$displaybody = str_get_html($html);
foreach($displaybody->find('a ') as $element) {
echo $element->innertext . '<br/>';
}
If you were referring to PHP's DOMDocument, then its not find() function you need to use, to target each anchor element, you need to use ->getElementsByTagName(), then each selected elements you need to use ->nodeValue:
$html = "This is some stuff right here. OH MY GOSH";
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach($dom->getElementsByTagName('a') as $element) {
echo $element->nodeValue . '<br/>';
}

php dom parser return parent and child

I think this is a simple question but I can't sort it, I am trying to get all heading tags with the simple php DOM parser, my code works only one way, example
$heading['h2']=$html->find('h2 a');//works fine
I have found some sites wrap the h2 within the a tag like this
<a href='#'><h2> my heading</h2></a>
The problem is trying to get both tags so I can display the link with it. So when I do this
$heading['h2']=$html->find('a h2');
I get the h2 fine but it will not wrap the link tag around it, which of course makes sense, find all h2 tags that are children of a but how do I get the entire parent tag, I hope that makes sense, what I want it to return is
<h2>My Headings</h2>
then I can just print the output with
echo $headings['h2']; //and the link with be there
If the <a href="[..]"> ist just the outer element, you can do it like this:
$heading['h2']=$html->find('a h2');
foreach ($heading['h2'] as $h2) {
echo $h2->parent(), "\n";
}
You could also go up the DOM tree until you reach an <a> tag:
$heading['h2']=$html->find('a h2');
foreach ($heading['h2'] as $h2) {
$a = $h2;
while ($a && $a->tag != "h2") $a = $a->parent();
if (!$a) continue; // no <a> above <h2>
echo $a, "\n";
}
Well my first thought we be to use
$html->find('a');
But I'm guessing you have multiple links on your page. So the correct practice would then be to use an ID (or a class) to identify your link
<h2> my heading</h2>
And then search for that specific ID:
$html->find('a#titleLink');
I don't know what library you're using and what syntax it supports, but I hope you get the idea anyway.
According to docs: $heading['h2']=$html->find('a > h2')->parent(); would return the anchor tag wrapping the h2, but if you have multiple 'a > h2' in the page, the find function will return an array, so try it and/or use foreach.
$info = $html->find('a,h2');
echo '<a href='.$info[0]->href.'>'.$info[1]->innertext.'</a>';

How to get html code between two <p> tag?

I want get some html code between 2 tag and I have 2 regex for it
1-$LinkGrabber = "<p><strong>item1:<\/strong> <span style=\"color: #ff0000;\"><strong>Full<\/strong><\/span><\/p>(.*)<p> <\/p>";
2-$linkGrabber = "<p><strong>item2<\/strong> <span style=\"color: #ff0000;\"><strong>Full<\/strong><\/span><\/p>(.*)<p> <\/p>";
first code work fine but second not.can you tel me what's different between these code?
I'd say, they both work fine but they're named different. Make sure, when testing the second one to use $linkGrabber instead of $LinkGrabber in the first example.
Don't ever use Regex to Parse HTML tags. Make use of a DOM Parser.
$dom = new DOMDocument;
#$dom->loadHTML($html); //<---- Pass your HTML source here
foreach ($dom->getElementsByTagName('p') as $tag) {
echo $tag->nodeValue; //"prints" the content of the p tag.
}
The first is looking for HTML tags that contains item1: while the second looks for item2...

Stripping span tag from simple html dom parser

HI i dont want to parse the span tag which is a child tag of from where i am extracting my data.....
Ex:- <a class="imp">
Some data 1 2 3
<span>
Unwanted Data
</span>
</a>
Code i am using:-
foreach($html->find(a.imp) as $value)
{
echo $value->innertext;
}
Output:-
Some data 1 2 3
Unwanted Data...
Desired output:-
Some data 1 2 3
I really dont knw is there any function or way so that i cant include the child tags ???
I believe you would have to loop through your first set of results, find all span elements and set each span element's outertext to an empty string, thus removing the entire HTML for that element.
foreach($html->find('a.imp') as $value)
{
foreach($value->find('span') as $e)
{
$e->outertext = '';
}
echo $value->innertext;
}
Simple HTML DOM Parser will work:
$content = file_get_html($link);
$stuffiwant = $content->find("//a/text()");
var_dump($stuffiwant);
I don't believe simple has a clean way to remove elements. In phpquery you can:
$doc->find('a.imp span')->remove();
echo $doc->find('a.imp')->text();

Categories