simple html dom - space in class name

simple html dom - space in class name - php

I'm using PHP Simple HTML DOM to get element from a source code of a site (not mine) and when I find a ul class that is called "board List",this is not found.I think it might be a problem of space but I don't know how to solve it.
this is a piece of php code:
$html = str_get_html($result['content']); //get the html of the site
$board = $html->find('.board List'); // Find all element which class=board List,but in my case it doesn't work,with other class name it works
and this is a piece of html code of the site:
<!-- OTHER HTML CODE BEFORE THIS --><ul class="board List"><li id="c111131" class="skin_tbl">
<table class="mback" cellpadding="0" cellspacing="0" onclick="toggleCat('c111131')"><tr>
<td class="mback_left"><div class="plus"></div><td class="mback_center"><h2 class="mtitle">presentiamoci</h2><td class="mback_right"><span id="img_c111131"></span></table>
<div class="mainbg">
<div class="title top"><div class="aa"></div><div class="bb">Forum</div><div class="yy">Statistiche</div><div class="zz">Ultimo Messaggio</div></div>
<ul class="big_list"><!-- OTHER HTML AFTER THIS -->

I solved it by removing board from the find parameter,as this:
$board = $html->find('.List');
now the parser seems to work correctly

With simple you would probably want to use:
$html->find('*[class="board List"]', 0);
If you really want to use:
$html->find('.board.List', 0);
Then use this one.

The answer is that: You cannot use spaces in classnames. spaces are the seperaters of classes
if you have <div class="container wrapper-something anothersomething"></div> then you can use .container, .wrapper-something or .anothersomething as a selector and you allways match that div.
So in your code you have <ul class="board List">, so to get a match in a css-selector ($html->find('{here_comes_the_css_selector}');) you can use eather .board or .List as the selctor
Therefor your line $board = $html->find('.board List'); should look more like this:
$board = $html->find('.board.List');
// maches every element who has class 'board' AND 'List'
// Here it is really important that there is no spaces between those 2 selectors
// or
$board = $html->find('.List');
// maches every element who has class 'List'
// or
$board = $html->find('.board');
// maches every element who has class 'board'

$board = $html->find('[class="board List"]');
With this syntax SimpleHTMLDOM finds elements with multiple class attribute

Related

how Access to a span tag without class name

I have this codes in my SimpleHtmlDom Project
how can I access this span Tags without Class Name?
<div class="somename">
<span>This text i need </span>
<span>This text i need too </span>
</div>
how can I echo that span tags?
I already tried this:
$html->find(".somename",0)->innertext;

I believe you are using simple_html_dom.php. If that is the case then:
$html->find("span",0)->innertext;
should give you the first span
$html->find("span",1)->innertext;
should give you the second span
$html->find("span")->innertext;
should give you all spans in an array
If you are trying to retrieve the content of the span you should use plaintext not innertext
If you want it to specifically search for spans in a div with a class somename you can do it like this:
$html->find("div[class=somename] span")->innertext;
Reference: http://simplehtmldom.sourceforge.net/manual.htm

Use xpath to get those span tags.
$xml = new SimpleXMLElement($yourHtmlContents);
$result = $xml->xpath('//span');
$firstSpan = (string) $result[0];
$secondSpan = (string) $result[1];

How to extract HTML element from a source file

I need to replace a HTML section identified by a tag id in a source code, which is combination of HTML and PHP using PHP. In case it's pure HTML, DOM parser could be used; in case there is no DIV in DIV, I can imagine how to use preg_match. This is what I am trying to do - I have a code (loaded into a string) like:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div>
<div>
<img >
</div>
</div>
</div>
and my task is to replace content of "mydiv" DIV with a new one e.g.
<div id="newdiv>
some text
</div>
so the string will look like this after the change:
<div>
<img >
</div>
<? include(); ?>
<div id="mydiv">
<div id="newdiv>
some text
</div>
</div>
I have already tried:
1) parsing the code using DOMdocument's loadHTML => it produces a lot of errors in case PHP code is included.
2) I played around a bit with regexes like preg_match_all('/<div id="myid"([^<]*)<\/div>/', $src, $matches), which fails in case more child divs are included.
The best approach I have found so far is:
1) find id="mydiv" string
2) search for '<' and '>' chars and count them like '<'=1 and '>'=-1 (not exactly, but it gives the idea)
3) once I get sum == 0 I should be on position of the closing tag, so I know, which portion string I should exchange
This is quite "heavy" solution, which can stop working in some cases, where the code is different (e.g. onpage PHP code contains the chars as well instead of just simple "include"). So I am looking so some better solution.

You could try something like this:
$file = 'filename.php';
$content = file_get_contents($file);
$array_one = explode( '<div id="mydiv">' , $content );
$my_div_content = explode("</div>" , $array_one[1] )[0];
Or use preg_match like you said:
preg_match('/<div id="mydiv"(.*?)<\/div>/s', $content, $matches)

Yes there is. First you need to use a function that will get the content of the file. Lets call the file homepage.php:
$homepageString = file_get_contents('homepage.php');
Now you have a string with all the content. The next thing you would do is use the preg_replace() function to take out the part of code that you want to take out:
$newHomepageString = preg_replace('/id="mydiv"/',"", $homepageString);
Now you overwrite the existing homepage.php file with the new source code:
file_put_contents("homepage.php", $newHomepageString);
Let me know if it worked for you! :)

How to parse multiple elements in portions for html via Simple Html Dom

I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.

You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.

php dom parser return parent and child

I think this is a simple question but I can't sort it, I am trying to get all heading tags with the simple php DOM parser, my code works only one way, example
$heading['h2']=$html->find('h2 a');//works fine
I have found some sites wrap the h2 within the a tag like this
<a href='#'><h2> my heading</h2></a>
The problem is trying to get both tags so I can display the link with it. So when I do this
$heading['h2']=$html->find('a h2');
I get the h2 fine but it will not wrap the link tag around it, which of course makes sense, find all h2 tags that are children of a but how do I get the entire parent tag, I hope that makes sense, what I want it to return is
<h2>My Headings</h2>
then I can just print the output with
echo $headings['h2']; //and the link with be there

If the <a href="[..]"> ist just the outer element, you can do it like this:
$heading['h2']=$html->find('a h2');
foreach ($heading['h2'] as $h2) {
echo $h2->parent(), "\n";
}
You could also go up the DOM tree until you reach an <a> tag:
$heading['h2']=$html->find('a h2');
foreach ($heading['h2'] as $h2) {
$a = $h2;
while ($a && $a->tag != "h2") $a = $a->parent();
if (!$a) continue; // no <a> above <h2>
echo $a, "\n";
}

Well my first thought we be to use
$html->find('a');
But I'm guessing you have multiple links on your page. So the correct practice would then be to use an ID (or a class) to identify your link
<h2> my heading</h2>
And then search for that specific ID:
$html->find('a#titleLink');
I don't know what library you're using and what syntax it supports, but I hope you get the idea anyway.

According to docs: $heading['h2']=$html->find('a > h2')->parent(); would return the anchor tag wrapping the h2, but if you have multiple 'a > h2' in the page, the find function will return an array, so try it and/or use foreach.

$info = $html->find('a,h2');
echo '<a href='.$info[0]->href.'>'.$info[1]->innertext.'</a>';

Regular expression for DIV elements

Say I had this piece of HTML for example:
<div id="gallery2" class="galleryElement">
<h2>My Photos</h2>
<div class = "imageElement">
<h3>#Embassy - VIP </h3>
<p><b>Image URL:</b>
http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg</p>
<img src = "http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg" class = "full"/>
<img src = "http://photos-p.friendster.com/photos/78/86/77426887/1_887303260m.jpg" class = "thumbnail"/>
</div>
<div class = "imageElement">
<h3>#Embassy - VIP </h3>
<p><b>Image URL:</b>
http://photos-p.friendster.com/photos/78/86/77426887/1_119466535.jpg</p>
<img src = "http://photos-p.friendster.com/photos/78/86/774534426887/1_119466535.jpg" class = "full"/>
<img src = "http://photos-p.friendster.com/photos/78/86/774534426887/1_887303260m.jpg" class = "thumbnail"/>
</div>
</div>
I need to build the proper regular expression to parse each div class'ed as imageElement and store the contents (as text) in an array starting from the opening <div class = "imageElement"> till its ending div pair </div>. Also, there really are spaces on class = "imageElement". So far I have the expression:
\<div class = "imageElement">[\s\S\d\D]*</div>
but it only gets the whole set of elements. Thanks in advance.

This is a pretty common question here ("How do I parse this XML/HTML with a regular expression?") and I'll give you the same answer: don't.
Regular expressions are notoriously bad at this kind of thing. HTML/XML is not "regular" in the regex sense.
PHP comes with at least 3 XML parsers (SimpleXML, DOMDocument and XMLReader spring to mind) that will do this reliably. Use one of those.
Take a look at Parse HTML With PHP And DOM as an example.

sounds like the trouble you're having is that the * is greedy, ie it matches as much as possible, where you want it to match a little as possible.
If the data inside your divs does not contain "</div>" then you can keep the parsing pretty simple. If it can contain arbitrary HTML data (specifically nested divs), you'll need to parse it more.
If it stays basic, you could do the whole thing without regex. It's a little hackish, but as long as your data says simple, and expected, it should work really fast:
$chunks = explode($body, '<div class = "imageElement">');
array_shift($chunks);
$matches = array();
foreach($chunks as $chunk) {
$pos = strpos('</div>', $chunk);
if($pos) {
$matches[] = substr($chunk, 0, $pos);
{
}
If you need something more flexible, use a real html parser.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

simple html dom - space in class name - php

I solved it by removing board from the find parameter,as this: $board = $html->find('.List'); now the parser seems to work correctly

With simple you would probably want to use: $html->find('*[class="board List"]', 0); If you really want to use: $html->find('.board.List', 0); Then use this one.

$board = $html->find('[class="board List"]'); With this syntax SimpleHTMLDOM finds elements with multiple class attribute

Related

how Access to a span tag without class name

How to extract HTML element from a source file

How to parse multiple elements in portions for html via Simple Html Dom

php dom parser return parent and child

Regular expression for DIV elements

Categories

Resources