PHP Simple HTML DOM parser give faulty data

PHP Simple HTML DOM parser give faulty data - php

I'm using PHP Simple HTML DOM to parse a web page with the following HTML. Notice the extra </span>-tags in each <li>.
<li>
<span class="name">
Link asdasd
</span>
</span>
</li>
<li>
<span class="name">
Link asdasd2
</span>
</span>
</li>
My queries are:
$lis = $dom->find('li');
foreach ($lis as $li) {
$spans = $li->find('span');
foreach ($spans as $span) {
echo $span->plaintext."<br>";
}
}
My output is:
Link asdasd
Link asdasd2
-----------
Link asdasd2
-----------
As you can see the find('span') finds two spans as children to the first <li> and getting the value from the next <span> it can find (even though it's a child of the next <li>). Removing the trailing </span> fixes the problem.
My questions are:
Why is this happening?
How I can solve this particular case?
Everything else works well and I'm not in a position to make big changes to my script. I can change the DOM queries easily though if needed.
I am thinking about counting start and closing tags and stripping one </span> if there are too many of them. Since they will always be <span>s, are there a smart way to check it with regexp?

1) Simple is trying to fix your extra </span> by adding a <span> somewhere. So now you have an extra span that shouldn't be there. For the record, DomDocument would do the same thing, although perhaps in a more predictable way.
2) Simplify:
foreach ($dom->find('li > span') as $span) {
echo $span->plaintext."<br>";
}
// Link asdasd <br> Link asdasd2 <br>
Now you've told it you only want the span that is a child of a li. Even better, do something like:
foreach ($dom->find('span.name') as $span) {
echo $span->plaintext."<br>";
}
Use those attributes, that's what they're good for.

$newTxt = preg_replace('/\<\/span\>[\S]*\<\/span\>/','</span>',$txt);
The method 'find(x)' is an overloaded function that can return the equivalents of:
$e->getElementById(x);
$e->getElementsById(x);
$e->getElementByTagName(x); and
$e->getElementsByTagName(x);
In your first call makes it use of the last call. In the second $li of the third possibility. It is probably a method of optimization which question you were asking according to the API. I guess you have found a bug in the API, because you were asking in both cases the use of the third call:
$e->getElementByTagName();

Related

Adding a class to all English text in HTML?

The requirement is to add an englishText class around all english words on a page. The problem is similar to this, but the Javascript solutions wont work for me. I require a PHP example to solve this problem. For example, if you have this:
<p>Hello, 你好</p>
<div>It is me, 你好</div>
<strong>你好, how are you</strong>
Afterwards I need to end with:
<p><span class="englishText">Hello</span>, 你好</p>
<div><span class="englishText">It is me</span>, 你好</div>
<strong>你好, <span class="englishText">how are you</span></strong>
There are more complicated cases, such as:
<strong>你好, TEXT?</strong>
<div>It is me, 你好</div>
This should become:
<strong>你好, <span class="englishText">TEXT?</span></strong>
<div><span class="englishText">It is me</span>, 你好</div>
But I think I can sort out these edge cases once I know how actually iterate over the document correctly.
I can't use javascript to solve this because:
This needs to work on browsers that don't support javascript
I would prefer to have the classes in place on page load so there isn't any delay in rendering the text in the correct font.
I figured the best way to iterate over the document would be using PHP Simple HTML DOM Parser.
But the problem is that if I try this:
foreach ($html->find('div') as $element)
{
// make changes here
}
My concern is that the following case will cause chaos:
<div>
Hello , 你好
<div>Hello, 你好</div>
</div>
As you can see, it's going to go into the first div and then if I process that node, I will be processing the node within that too.
Any ideas how to get around this and only select the nodes for processing once?
UPDATE
I realise now that what I effectively need is a recursive way to iterate over HTML elements with the ability to change them as I iterate over them.

You should travel through siblings that way you won't get in trouble with such a cases...
Something like that:
<?php
foreach ($html->find('div') as $element)
{
foreach($element->next_sibling() as $sibling){
echo $sibling->plaintext()."\n";
}
}
?>
Or much easier way imo:
Just...
Change every <*> to "\n"."<*>" with preg_replace();
Make an array of lines like $lines = explode("\n",$html_string);
3.
foreach($lines as $line){
$text = strip_tags($line);
echo $text;
}

Closing element tags in php

I have just started to learn PHP from books and have come across something I don't understand. In the book they never close html tags is this correct practice or should they be closed? Here is an example of the books content:
<?php
$cars = array('Dodge'=>'Viper','Chevrolet'=>'Camaro','Ford'=>'Mustang');
echo '<dl><dt>Original Element Order:<dd>';
foreach($cars as $key => $value){
echo '•', $key.' '.$value;
}
?>
Could anyone tell me if this is correct and common practice?
Thanks

Some old books :). They definitely need to be closed.
You need to echo </dd></dt></dl> after the foreach loop.

The end tags for the <dt> and <dd> elements are optional in HTML. The missing </dl> is a problem though.
Other issues with this fragment:
Using • for a list instead of list markup
Using something that looks like a <ul> (but is simulated with •) for key/value pairs
Having a dl with only one dt in it.

Using Simple HTML DOM to Scrape?

Simple HTML DOM is basically a php you add to your pages which lets you have simple web scraping. It's good for the most part but I can't figure out the manual as I'm not much of a coder. Are there any sites/guides out there that have any easier help for this? (the one at php.net is a bit too complicated for me at the moment) Is there a better place to ask this kind of question?
The site for it is at: http://simplehtmldom.sourceforge.net/manual.htm
I can scrape stuff that has specific classes like <tr class="group">, but not for stuff that's in between. For example.. This is what I currently use...
$url = 'http://www.test.com';
$html = file_get_html($url);
foreach($html->find('tr[class=group]') as $result)
{
$first = $result->find('td[class=category1]',0);
$second = $result->find('td[class=category2]',0);
echo $first.$second;
}
}
But here is the kind of code I'm trying to scrape.
<table>
<tr class="Group">
<td>
<dl class="Summary">
<dt>Heading 1</dt>
<dd>Cat</dd>
<dd>Bacon</dd>
<dt>Heading 2</dt>
<dd>Narwhal</dd>
<dd>Ice Soap</dd>
</dl>
</td>
</tr>
</table>
I'm trying to extract the content of each <dt> and put it to a variable. Then I'm trying to extract the content of each <dd> and put it to a variable, but nothing I tried works. Here's the best I could find, but it gives me back only the first heading repeatedly rather than going to the second.
foreach($html->find('tr[class=Summary]') as $result2)
{
echo $result2->find('dt',0)->innertext;
}
Thanks to anyone who can help. Sorry if this is not clear or that it's so long. Ideally I'd like to be able to understand these DOM commands more as I'd like to figure this out myself rather than someone here just do it (but I'd appreciate either).
TL;DR: I am trying to understand how to use the commands listed in the manual (url above). The 'manual' isn't easy enough. How do you go about learning this stuff?

I think $result2->find('dt',0) gives you back element 0, which is the first. If you omit that, you should be able to get an array (or nodelist) instead. Something like this:
foreach($html->find('tr[class=Summary]') as $result2)
{
foreach ($result2->find('dt') as $node)
{
echo $node->innertext;
}
}
You don't strictly need the outer for loop, since there's only 1 tr in your document. You could even leave it altogether to find each dt in the document, but for tools like this, I think it's a good thing to be both flexible and strict, so you are prepared for multiple rows, but don't accidentally parse dts from anywhere in the document.

Finding the maximum occurring string within a text file

So I've seen questions asked before that are along the lines of finding the maximum occurence of a string within a file but all of those rely on knowing what to look for.
I have what you might almost call a flat file database that grabs a bunch of input data and basically wraps different parts of it in html span tags with referencing ids.
Each line comes out in this kind of fashion:
<p>
<span class="ip">58.106.**.***</span>
Wrote <span class='text'>some text</span>
<span class='effect1'> and caused seizures </span>
<span class='time'>23:47</span>
</p>
How would I then go about finding the #test contents that occurs the most times.
i.e if I had
<p>
<span class="ip">58.106.**.***</span>
Wrote <span id='text'>woof</span>
<span class='effect1'> and caused seizures </span>
<span class='time'>23:47</span>
</p>
<p>
<span class="ip">58.106.**.***</span>
Wrote <span class='text'>meow</span>
<span class='effect1'> and caused mind-splosion </span>
<span class='time'>23:47</span>
</p>
<p>
<span class="ip">58.106.**.***</span>
Wrote <span class='text'>meow</span>
<span class='effect1'> and used no effect </span>
<span class='time'>23:47</span>
</p>
<p>
<span class="ip">58.106.**.***</span>
Wrote <span class='text'>meow</span>
<span class='effect1'> and used no effect </span>
<span class='time'>23:47</span>
</p>
the output would be 'meow'.
How would I accomplish this in php?

First off: Your format is not conducive to this type of data manipulation; you might want to consider changing it.
That said, based on this structure the logical solution would be to leverage DOMXPath as Dani says. This could have been problematic because of all the duplicate ids in there, but in practice it works (after emitting a boatload of warnings, which is one more reason that the data structure affords revision).
Here's some code to go with the idea:
$input = '<body>'.get_input().'</body>';
$doc = new DOMDocument;
$doc->loadHTML($input); // lots of warnings, duplicate ids!
$xpath = new DOMXPath($doc);
$result = $xpath->query("//*[#id='text']/text()");
$occurrences = array();
foreach ($result as $item) {
if (!isset($occurrences[$item->wholeText])) {
$occurrences[$item->wholeText] = 0;
}
$occurrences[$item->wholeText]++;
}
// Sort the results and produce final answer
arsort($occurrences);
reset($occurrences);
echo "The most common text is '".key($occurrences).
"', which occurs ".current($occurrences)." times.";
See it in action.
Update (seeing as you fixed the duplicate id issue): You would simply change the xpath query to "//*[#class='text']/text()" so that it continues to match. However this way of doing things remains inefficient, so if one or more of these apply:
you are going to do this all the time
you have lots of data
you need it to be really fast
then changing the data format is a good idea.

Have a look at DOMXPath, you can use an XPath query to get all the #text and then find the most used one with php.
There is a problem that you used the same id few times which is not valid HTML so DOM might break.

How do you access Simple DOM selectors?

I can access some of the 'class' items with a
$ret = $html->find('articleINfo'); and then print the first key of the returned array.
However, there are other tags I need like span=id"firstArticle_0" and I cannot seem to find it.
$ret = $html->find('#span=id[ etc ]');
In some cases something is returned, but it's not an array, or is an array with empty keys.
Unfortunately I cannot use var_dump to see the object, since var_dump produces 1000 pages of unreadable junk. The code looks like this.
<div id="articlething">
<p class="byline">By Lord Byron and Alister Crowley</p>
<p>
<span class="location">GEORGIA MOUNTAINS, Canada</span> |
<span class="timestamp">Fri Apr 29, 2011 11:27am EDT</span>
</p>
</div>
<span id="midPart_0"></span><span class="mainParagraph"><p><span class="midLocation">TUSCALOOSA, Alabama</span> - Who invented cheese? Everyone wants to know. They held a big meeting. Tom Cruise is a scientologist. </p>
</span><span id="midPart_1"></span><p>The president and his family visited Chuck-e-cheese in the morning </p><span id="midPart_2"></span><p>In Russia, 900 people were lost in the balls.</p><span id="midPart_3">

Simple HTML DOM can be used easily to find a span with a specific class.
If want all span's with class=location then:
// create HTML DOM
$html = file_get_html($iUrl);
// get text elements
$aObj = $html->find('span[class=location]');
Then do something like:
foreach($aObj as $key=>$oValue)
{
echo $key.": ".$oValue->plaintext."<br />";
}
It worked for me using your example my output was:
label=span, class=location: Found 1
0: GEORGIA MOUNTAINS, Canada
Hope that helps... and please Simple HTML DOM is great for what it does and easy to use once you get the hang of it. Keep trying and you will have a number of examples that you just use over and over again. I've scraped some pretty crazy pages and they get easier and easier.

Try using this. Worked for me very well and extremely easy to use. http://code.google.com/p/phpquery/

The docs on the PHP Simple DOM parser are spotty on deciphering Open Graph meta tags. Here's what seems to work for me:
<?php
// grab the contents of the page
$summary = file_get_html($url);
// Get image possibilities (for example)
$img = array();
// First, if the webpage has an og:image meta tag, it's easy:
if ($summary->find('meta[property=og:image]')) {
foreach ($summary->find('meta[property=og:image]') as $e) {
$img[] = $e->attr['content'];
}
}
?>

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP Simple HTML DOM parser give faulty data - php

Related

Adding a class to all English text in HTML?

Closing element tags in php

Using Simple HTML DOM to Scrape?

Finding the maximum occurring string within a text file

How do you access Simple DOM selectors?

Categories

Resources