I am looking to parse a html page for a predefined string and then build a unique reference to that location. With the help of 'Wikken' I have come this far.. but it is not working quite correctly.
$d = new DOMDocument();
$d->loadXML($xml);
$x = new DOMXPath($d);
$result = $x->evaluate("//text()[contains(.,'STRING')]/ancestor::*/#id");
$unique = null;
for($i = $result->length -1;$i >= 0 && $item = $result->item($i);$i--){
if($x->query("//*[#id='".addslashes($item->value)."']")->length == 1){
echo 'Unique ID is '.$item->value."\n";
$unique = $item->value;
break;
}
}
if(is_null($unique)) echo 'no unique ID found';
__EDIT_
Let me explain the problem again. I am looking to make a html parser that can parse a page and find a unique string. I then need to find a unique place holder for that location so that when the data on the website changes I can find the new data. Wrikken has helped me so far as to locating the string... unfortunately the code above does not find a unique css selector correctly.
Not sure if this is your problem, but I normally would do the first part more like this:
//*[contains(text(),'STRING')]/ancestor::*/#id
Related
the source of this problem is because I'm running ads on my website, my content is mainly HTML stored in a database, so I decided to place "In-Text Ads", ads that are not in a fixed zone.
My solution was to explode the content by paragraphs and place the text ad in the middle of the p tags, which worked pretty cool since I use CKEditor to generate the content, I thought images, blockquotes, and other tags would be nested inside p tags (fool me) I realize now that images and blockquotes disappeared from my posts, what did I do next? I changed my code to explode using * instead of exploding by p tag, I sang victory too soon, because now I get a lot of duplicate content, for example, if I have one image now I get the same image 4 times as well as all other tags, I´m not sure about the source of this duplicates but I think It has something to do with nested HTML, I looked for a solution for hours and now I'm here asking to see whether somebody can help me solve this headache
Here is my code:
//In a helper file
function splitByHTMLTagName(string $string, string $tagName = 'p')
{
$text = <<<TEXT
$string
TEXT;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$nodes = [];
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $text);
foreach ($dom->getElementsByTagName($tagName) as $node) {
array_push($nodes, $dom->saveHTML($node));
}
libxml_clear_errors();
return $nodes;
}
//In my view
$text = nl2br($database['content']);
$nodes = splitByHTMLTagName($text, '*');
//Using var_dump($nodes); here shows the duplicates are here already.
$nodes_count = count($nodes);
$show_ad_at = -1;
$was_added = false;
if($nodes_count % 2 == 0 ){
$show_ad_at = $nodes_count /2;
}else if ($nodes_count == 1 || $nodes_count < 3){
$show_ad_at = -1; //add later
}else if ($nodes_count > 3 && $nodes_count % 2 != 0){
$show_ad_at = ceil($nodes_count/2);
}
for($i = 0; $i<count($nodes); $i++){
if(!$was_added && $i == $show_ad_at){
$was_added = true;
?>
<div>
<script></script><!--This script is provided to me, it adds the ad where it is placed, I don't show the full script, It has nothing to do with the duplicates problem-->
</div>
<?php
}
echo $nodes[$i]; //print the node that comes from $nodes array where the duplicates already exist
}
if(!$was_added){
$was_added = true;
?>
<div>
<script></script><!--This script is provided to me, it adds the ad where it is placed, I don't show the full script, It has nothing to do with the duplicates problem-->
</div>
<?php
}
What can I do?
Thanks in advance.
Postdata #1: I use codeigniter as PHP Framework
Postdata #2: My ads provider does not implement "In-Text ads" as a feature like google does.
It seems you are printing the "ads block" inside if statement.
If I don't misunderstood your code is like
foreach ... {
if (strpos($html_line, "In-Text Ads") !== FALSE) {
print($ads_html);
}
I think, you should use str_replace() instead of print() like functions, if you are using something like print() when you outputting the value...
I know how to xpath and echo text off another website via tags like div id, class ,etc, using the below code. But, I don't know how to do it under more precise conditions, for example when trying to scrape and echo a bit of text that has no unique tag identifier like a div.
This below code spits out scraped data.
$doc = new DOMDocument;
// We don't want to bother with white spaces
$doc->preserveWhiteSpace = false;
// Most HTML Developers are chimps and produce invalid markup...
$doc->strictErrorChecking = false;
$doc->recover = true;
$doc->loadHTMLFile('http://www.nbcnews.com/business');
$xpath = new DOMXPath($doc);
$query = "//div[#class='market']";
$entries = $xpath->query($query);
foreach ($entries as $entry) {
echo trim($entry->textContent); // use `trim` to eliminate spaces
}
In this below source code for an example, I want to pull the value "21,271.97". But there's no unique tag for this, no div id. Is it possible to pull this data by identifying a keyword in the < p> that never changes, for example "DJIA all time".
<p>DJIA All Time, Record-High Close: <font color="#0000FF">June 9,
2017</font>
(<font color="#FF0000"><b bgcolor="#FFFFCC"><font face="Verdana, Arial,
Helvetica, sans-serif" size="2">21,271.97</font></b></font>)</p>
Wondering if I could possibly replace this with something around the lines of $query = "//div[#class='market']";
$query = "//p['DJIA all time']";
Could this be possible?
I also wonder if using a loop with something like $query = "//p[='DJIA']";?
could work, though I don't know how to use that exactly.
Thanks!!
It would be good to have a play with an online XPath tester - I use https://www.freeformatter.com/xpath-tester.html#ad-output
$query = "//p[contains(text(),'DJIA')]";
Although if you use the page your after, I've found that the value seems to be the first record for...
$query = "//span[contains(#class,'market_price')]";
But the idea is the same in both cases, using contains(source,value) will match a set of nodes. In the first case the text() is the value of the node,the second looks for the specific class definition.
Try to use below XPath expression:
//p[contains(text(), "DJIA All Time")]//b/font
Considering provided link (http://www.nbcnews.com/business) you can get required text with
//span[text()="DJIA"]/following-sibling::span[#class="market_item market_price"]
The PHP I have right now is only half working and it is a little clunky. I'm looking to display the 3 most recent press releases from an XML feed that match a specific value type. What I have right now is only looking at the first three items, and just echoing the ones that match the value. I'm also pretty sure DOM object is not the best approach here, but had issues getting xparse to work properly. Some help with the logic would be greatly appreciated.
//create new document object
$dom_object = new DOMDocument();
//load xml file
$dom_object->load("http://cws.huginonline.com/A/138060/releases_999_all.xml");
$cnt = 0;
foreach ($dom_object->getElementsByTagName('press_release') as $node) {
if($cnt == 3 ) {
break;
}
$valueID = $node->getAttribute('id');
$valueType = $node->getAttribute('type');
$headline = $dom_object->getElementsByTagName("headline");
$headlineContent = $headline->item(0)->nodeValue;
$releaseDate = $dom_object->getElementsByTagName("published");
$valueDate = $releaseDate->item(0)->getAttribute('date');
$cnt++;
if ($valueType == 5) {
echo "<div class=\"newsListItem\"> <p> $valueDate </p> <h4>$headlineContent</h4><p></p></div>";
}
}
DOM has the ability to execute Xpath expression on the XML tree to fetch nodes and scalar values. Use DOMXpath::evaluate() - not just the DOM methods.
But the "3 most recent" is not a filter. It is a sort with a limit. You will have to read each item and keep the 3 "newest" or read all of them into a list, sort it and get the first 3 from the list.
You can do this in PHP or using XSLT.
this did the trick...
if ($valueType == 5)
{
if ($printCount < 3) {
echo "...";
$printCount++;
}
}
I started mixing XML with PHP today and I'm pretty bad at it, even though it looks super simple.
Right now, I'm trying to make something that sounds very easy but I can't understand how it works. I'm basically trying to create a fake mailbox for a game.
So I stored my emails in an XML file, classed by categories (received, sent, etc.). I managed to get the list of emails depending on the category, but I can't get to the part where I click on an email and it shows the content of this particular email.
Here is my simplified code:
XML :
<mailbox>
<received>
<expediter>James</expediter>
<content>Blah blah blah</content>
</received>
<received>
<expediter>Paul</expediter>
<content>Bluh bluh bluh</content>
</received>
<sent>
<expediter>Jack</expediter>
<content>Blah blah blah</content>
</sent>
<sent>
<expediter>John</expediter>
<content>Bluh bluh bluh</content>
</sent>
</mailbox>
XML;
?>
PHP :
<?php
include 'emails.php';
$emails = new SimpleXMLElement($xmlstr);
$cat = $_GET['cat'];
if(!isset($_GET['id'])){
$i = 0;
foreach($emails->$cat as $mailbox){
echo ''.$mailbox->expediter.'<br />';
$i++;
}
}
else{
$id = $_GET['id'];
echo $emails->$cat[$id]->content;
}
?>
So if there is no ID in the url, it shows the list of expediters with links to the email and if there is an ID in the url, it should show the content of the email designed by this number.
It works if I write manually :
echo $emails->received[1]->content;
But of course, I want that part to be dynamic and it doesn't work with :
echo $emails->$cat[$id]->content;
Is there any way to do that?
Thank you!
Camille
Try this:
$a = new stdClass();
$b = new stdClass();
$b->field = 5;
$a->list = array(
1 => $b
);
print_r($a);
$param = 'list';
$id = 1;
print_r($a->list[1]->field); // outputs 5;
print_r($a->{$param}[$id]->field); // outputs 5;
The key is:
$a->{$param}[$id]->field // notice the curly brackets.
Adapting to your question, you should use:
echo $emails->{$cat}[$id]->contenu;
As a good practice, you might want to check if it exists first:
if(isset($emails->{$cat}[$id])){
// echo it here, after you know it exists
}
You can see it online at 3v4l example
I got a PHP array with a lot of XML users-file URL :
$tab_users[0]=john.xml
$tab_users[1]=chris.xml
$tab_users[n...]=phil.xml
For each user a <zoom> tag is filled or not, depending if user filled it up or not:
john.xml = <zoom>Some content here</zoom>
chris.xml = <zoom/>
phil.xml = <zoom/>
I'm trying to explore the users datas and display the first filled <zoom> tag, but randomized: each time you reload the page the <div id="zoom"> content is different.
$rand=rand(0,$n); // $n is the number of users
$datas_zoom=zoom($n,$rand);
My PHP function
function zoom($n,$rand) {
global $tab_users;
$datas_user=new SimpleXMLElement($tab_users[$rand],null,true);
$tag=$datas_user->xpath('/user');
//if zoom found
if($tag[0]->zoom !='') {
$txt_zoom=$tag[0]->zoom;
}
... some other taff here
// no "zoom" value found
if ($txt_zoom =='') {
echo 'RAND='.$rand.' XML='.$tab_users[$rand].'<br />';
$datas_zoom=zoom($r,$n,$rand); } // random zoom fct again and again till...
}
else {
echo 'ZOOM='.$txt_zoom.'<br />';
return $txt_zoom; // we got it!
}
}
echo '<br />Return='.$datas_zoom;
The prob is: when by chance the first XML explored contains a "zoom" information the function returns it, but if not nothing returns... An exemple of results when the first one is by chance the good one:
// for RAND=0, XML=john.xml
ZOOM=Anything here
Return=Some content here // we're lucky
Unlucky:
RAND=1 XML=chris.xml
RAND=2 XML=phil.xml
// the for RAND=0 and XML=john.xml
ZOOM=Anything here
// content founded but Return is empty
Return=
What's wrong?
I suggest importing the values into a database table, generating a single local file or something like that. So that you don't have to open and parse all the XML files for each request.
Reading multiple files is a lot slower then reading a single file. And using a database even the random logic can be moved to SQL.
You're are currently using SimpleXML, but fetching a single value from an XML document is actually easier with DOM. SimpleXMLElement::xpath() only supports Xpath expression that return a node list, but DOMXpath::evaluate() can return the scalar value directly:
$document = new DOMDocument();
$document->load($xmlFile);
$xpath = new DOMXpath($document);
$zoomValue = $xpath->evaluate('string(//zoom[1])');
//zoom[1] will fetch the first zoom element node in a node list. Casting the list into a string will return the text content of the first node or an empty string if the list was empty (no node found).
For the sake of this example assume that you generated an XML like this
<zooms>
<zoom user="u1">z1</zoom>
<zoom user="u2">z2</zoom>
</zooms>
In this case you can use Xpath to fetch all zoom nodes and get a random node from the list.
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXpath($document);
$zooms = $xpath->evaluate('//zoom');
$zoom = $zooms->item(mt_rand(0, $zooms->length - 1));
var_dump(
[
'user' => $zoom->getAttribute('user'),
'zoom' => $zoom->textContent
]
);
Your main issue is that you are not returning any value when there is no zoom found.
$datas_zoom=zoom($r,$n,$rand); // no return keyword here!
When you're using recursion, you usually want to "chain" return values on and on, till you find the one you need. $datas_zoom is not a global variable and it will not "leak out" outside of your function. Please read the php's variable scope documentation for more info.
Then again, you're calling zoom function with three arguments ($r,$n,$rand) while the function can only handle two ($n and $rand). Also the $r is undiefined, $n is not used at all and you are most likely trying to use the same $rand value again and again, which obviously cannot work.
Also note that there are too many closing braces in your code.
I think the best approach for your problem will be to shuffle the array and then to use it like FIFO without recursion (which should be slightly faster):
function zoom($tab_users) {
// shuffle an array once
shuffle($tab_users);
// init variable
$txt_zoom = null;
// repeat until zoom is found or there
// are no more elements in array
do {
$rand = array_pop($tab_users);
$datas_user = new SimpleXMLElement($rand, null, true);
$tag=$datas_user->xpath('/user');
//if zoom found
if($tag[0]->zoom !='') {
$txt_zoom=$tag[0]->zoom;
}
} while(!$txt_zoom && !empty($tab_users));
return $txt_zoom;
}
$datas_zoom = zoom($tab_users); // your zoom is here!
Please read more about php scopes, php functions and recursion.
There's no reason for recursion. A simple loop would do.
$datas_user=new SimpleXMLElement($tab_users[$rand],null,true);
$tag=$datas_user->xpath('/user');
$max = $tag->length;
while(true) {
$test_index = rand(0, $max);
if ($tag[$test_index]->zoom != "") {
break;
}
}
Of course, you might want to add a bit more logic to handle the case where NO zooms have text set, in which case the above would be an infinite loop.