Extract info from html? - php

First of all, I've seen a good deal of similar questions. I know regex or dom can be used, but I can't find any good examples of DOM and regex makes me pull my hair. In addition, I need to pull out multiple values from the html source, some simply contents, some attributes.
Here is an example of the html I need to get info from:
<div class="log">
<div class="message">
<abbr class="dt" title="time string">
DATA_1
</abbr>
:
<cite class="user">
<a class="tel" href="tel:+xxxx">
<abbr class="fn" title="DATA_2">
Me
</abbr>
</a>
</cite>
:
<q>
DATA_3
</q>
</div>
</div>
The "message" block may occur once or hundreds of times. I am trying to end up with data like this:
array(4) {
[0] => array(3) {
["time"] => "DATA_1"
["name"] => "DATA_2"
["message"] => "DATA_3"
}
[1] => array(3) {
["time"] => "DATA_1"
["name"] => "DATA_2"
["message"] => "DATA_3"
}
[2] => array(3) {
["time"] => "DATA_1"
["name"] => "DATA_2"
["message"] => "DATA_3"
}
[3] => array(3) {
["time"] => "DATA_1"
["name"] => "DATA_2"
["message"] => "DATA_3"
}
}
I tried using simplexml but it only seems to work on very simple html pages. Could someone link me to some examples? I get really confused since I need to get DATA_2 from a title attribute. What do you think is the best way to extract his data? It seems very similar to XML extraction which I have done, but I need to use some other method.

Here is an example using DOMDocument and DOMXpath to parse your HTML.
$doc = new DOMDocument;
$doc->loadHTMLFile('your_file.html');
$xpath = new DOMXpath($doc);
$res = array();
foreach ($xpath->query('//div[#class="message"]') as $elem) {
$res[] = array(
'time' => $xpath->query('abbr[#class="dt"]', $elem)->item(0)->nodeValue,
'name' => $xpath->query('cite/a/abbr[#class="fn"]', $elem)->item(0)->getAttribute('title'),
'message' => $xpath->query('q', $elem)->item(0)->nodeValue,
);
}

Can I suggest using xPath? It seems like a perfect candidate for what you want to do (but I may be misinterpreting what you're asking).
XPath will let you select particular nodes of an XML/HTML tree, and then you can operate on them from there. After that, it should be a simple task (or a tiny bit of simple regex at most. Personally, I love regex, so let me know if you need help with that).
Your XPath statements will look something like (assuming no conflicting names):
time (data 1):
/div/div/abbr/text()
name (data 2):
/div/div/cite/a/abbr/#title
message (data 3):
/div/div/q/text()
You can get more tech than this if, for example, if you want to identify the elements via their attributes, but what I've given you will be pretty fast.

Related

Simple way to read variables on different lines from STDIN?

I want to read two integers on two lines like:
4
5
This code works:
fscanf(STDIN,"%d",$num);
fscanf(STDIN,"%d",$v);
But I wonder if there's a shorter way to write this? (For more variables, I don't want to write a statement for each variable) Like:
//The following two lines leaves the second variable to be NULL
fscanf(STDIN,"%d%d",$num,$v);
fscanf(STDIN,"%d\n%d",$num,$v);
Update: I solved this using the method provided in the answer to read an array and list to assign variables from an array.
Consider this example:
<?php
$formatCatalog = '%d,%s,%s,%d';
$inputValues = [];
foreach (explode(',', $formatCatalog) as $formatEntry) {
fscanf(STDIN, trim($formatEntry), $inputValues[]);
}
var_dump($inputValues);
When executing and feeding it with
1
foo
bar
4
you will get this output:
array(4) {
[0] =>
int(1)
[1] =>
string(3) "foo"
[2] =>
string(3) "bar"
[3] =>
int(4)
}
Bottom line: you certainly can use loops or similar for the purpose and this can shorten your code a bit. Most of all it simplifies its maintenance. However if you want to specify a format to read with each iteration, then you do need to specify that format somewhere. That is why shortening the code is limited...
Things are different if you do not want to handle different types of input formats. In that case you can use a generic loop:
<?php
$inputValues = [];
while (!feof(STDIN)) {
fscanf(STDIN, '%d', $inputValues[]);
}
var_dump($inputValues);
Now if you feed this with
1
2
3
on standard input and then detach the input (by pressing CTRL-D for example), then the output you get is:
array(3) {
[0] =>
int(1)
[1] =>
int(2)
[2] =>
int(3)
}
The same code is obviously usable with input redirection, so you can feed a file into the script which makes detaching the standard input obsolete...
If you can in your code, try to implement a array :
fscanf(STDIN, "%d\n", $n);
$num=array();
while($n--){
fscanf(STDIN, "%d\n", $num[]);
}
print_r($num);

PHP/SimpleXML - Arrays generated differently for single child and multiple children

I'm using SimpleXML to parse an XML feed of property listings from different realtors. The relevant section of the XML feed looks something like this:
<branch name="Trustee Realtors">
<properties>
<property>
<reference>1</reference>
<price>275000</price>
<bedrooms>3</bedrooms>
</property>
<property>
<reference>2</reference>
<price>350000</price>
<bedrooms>4</bedrooms>
</property>
<property>
<reference>3</reference>
<price>128500</price>
<bedrooms>4</bedrooms>
</property>
</properties>
</branch>
<branch name="Quick-E-Realty Inc">
<properties>
<property>
<reference>4</reference>
<price>180995</price>
<bedrooms>3</bedrooms>
</property>
</properties>
</branch>
and is then converted to an array like this:
$xml = file_get_contents($filename);
$xml = simplexml_load_string($xml);
$xml_array = json_decode(json_encode((array) $xml), 1);
$xml_array = array($xml->getName() => $xml_array);
The issue I'm having is that when the array is created the data for the single listing is in a different position in the array to the multiple listings - I'm not sure exactly how to explain this, but if I var_dump() the array for the multiple items it looks like this:
array(3) {
[0]=>
array(3) {
["reference"]=>
string(4) "0001"
["price"]=>
string(6) "275000"
["bedrooms"]=>
int(3)
}
[1]=>
array(3) {
["reference"]=>
string(4) "0002"
["price"]=>
string(6) "350000"
["bedrooms"]=>
int(4)
}
[2]=>
array(3) {
["reference"]=>
string(4) "0003"
["price"]=>
string(6) "128500"
["bedrooms"]=>
int(2)
}
}
If I var_dump() the array for the single listing it looks like this:
array(3) {
["reference"]=>
string(4) "0004"
["price"]=>
string(6) "180995"
["bedrooms"]=>
int(3)
}
But what I need it to look like is this:
array(1) {
[0]=>
array(3) {
["reference"]=>
string(4) "0004"
["price"]=>
string(6) "180995"
["bedrooms"]=>
int(3)
}
}
Each of these arrays represents the property listings from a single realtor. I'm not sure whether this is just the way that SimpleXML or the json functions work but what I need is for the same format to be used (the array containing the property listing to be the value of the [0] key).
Thanks in advance!
SimpleXML is quirky like this. I used it recently trying to make configuration files "easier" to write up and found out in the process that SimpleXML doesn't always act consistent. In this case I think you will benefit from simply detecting if a <property> is the only one in a set, and if so, then wrap it in an array by itself and then send it to your loop.
NOTE: ['root'] is there because I needed to wrap a '<root></root>' element around your XML to make my test work.
//Rebuild the properties listings
$rebuild = array();
foreach($xml_array['root']['branch'] as $key => $branch) {
$branchName = $branch['#attributes']['name'];
//Check to see if 'properties' is only one, if it
//is then wrap it in an array of its own.
if(is_array($branch['properties']['property']) && !isset($branch['properties']['property'][0])) {
//Only one propery found, wrap it in an array
$rebuild[$branchName] = array($branch['properties']['property']);
} else {
//Multiple properties found
$rebuild[$branchName] = $branch['properties']['property'];
}
}
That takes care of rebuilding your properties. It feels a little hackish. But basically you are detecting for the lack of a multi-dimensional array here:
if(is_array($branch['properties']['property']) && !isset($branch['properties']['property'][0]))
If you don't find a multi-dimensional array then you explicitly make one of the single <property>. Then to test that everything was rebuilt correctly you can use this code:
//Now do your operation...whatever it is.
foreach($rebuild as $branch => $properties) {
print("Listings for $branch:\n");
foreach($properties as $property) {
print("Reference of " . $property['reference'] . " sells at $" . $property['price'] . " for " . $property['bedrooms'] . " bedrooms.\n");
}
print("\n");
}
This produces the following output:
Listings for Trustee Realtors:
Reference of 1 sells at $275000 for 3 bedrooms.
Reference of 2 sells at $350000 for 4 bedrooms.
Reference of 3 sells at $128500 for 4 bedrooms.
Listings for Quick-E-Realty Inc:
Reference of 4 sells at $180995 for 3 bedrooms.
And a dump of the rebuild will produce:
Array
(
[Trustee Realtors] => Array
(
[0] => Array
(
[reference] => 1
[price] => 275000
[bedrooms] => 3
)
[1] => Array
(
[reference] => 2
[price] => 350000
[bedrooms] => 4
)
[2] => Array
(
[reference] => 3
[price] => 128500
[bedrooms] => 4
)
)
[Quick-E-Realty Inc] => Array
(
[0] => Array
(
[reference] => 4
[price] => 180995
[bedrooms] => 3
)
)
)
I hope that helps you out getting closer to a solution to your problem.
The big massive "think outside the box" question to ask yourself here is: why are you converting the SimpleXML object to an array in the first place?
SimpleXML is not just a library for parsing XML and then using something else to manipulate it, it's designed for exactly the kind of thing you're about to do with that array.
In fact, this problem of sometimes having single elements and sometimes multiple is one of the big advantages it has over a plain array representation: for nodes that you know will be single, you can leave off the [0]; but for nodes you know might be multiple, you can use [0], or a foreach loop, and that will work too.
Here are some examples of why SimpleXML lives up to its name with your XML:
$sxml = simplexml_load_string($xml);
// Looping over multiple nodes with the same name
// We could also use $sxml->children() to loop regardless of name
// or even the shorthand foreach ( $sxml as $children )
foreach ( $sxml->branch as $branch ) {
// Access an attribute using array index notation
// the (string) is optional here, but good habit to avoid
// passing around SimpleXML objects by mistake
echo 'The branch name is: ' . (string)$branch['name'] . "\n";
// We know there is only one <properties> node, so we can take a shortcut:
// $branch->properties means the same as $branch->properties[0]
// We don't know if there are 1 or many <property> nodes, but it
// doesn't matter: we're asking to loop over them, so SimpleXML
// knows what we mean
foreach ( $branch->properties->property as $property ) {
echo 'The property reference is ' . (string)$property->reference . "\n";
}
}
Basically, whenever I see that ugly json_decode(json_encode( trick, I cringe a little, because 99 times out of 100 the code that follows is much uglier than just using SimpleXML.
One possibility is reading the XML with DOM+XPath. XML can not just be converted to JSON, but building a specific JSON for a specific XML is easy:
$dom = new DOMDocument();
$dom->loadXml($xml);
$xpath = new DOMXPath($dom);
$result = [];
foreach ($xpath->evaluate('//branch') as $branchNode) {
$properties = [];
foreach ($xpath->evaluate('properties/property', $branchNode) as $propertyNode) {
$properties[] = [
'reference' => $xpath->evaluate('string(reference)', $propertyNode),
'price' => (int)$xpath->evaluate('string(price)', $propertyNode),
'bedrooms' => (int)$xpath->evaluate('string(bedrooms)', $propertyNode)
];
}
$result[] = [
'name' => $xpath->evaluate('string(#name)', $branchNode),
'properties' => $properties
];
}
echo json_encode($result, JSON_PRETTY_PRINT);
Output: https://eval.in/154352
[
{
"name": "Trustee Realtors",
"properties": [
{
"reference": "1",
"price": 275000,
"bedrooms": 3
},
{
"reference": "2",
"price": 350000,
"bedrooms": 4
},
{
"reference": "3",
"price": 128500,
"bedrooms": 4
}
]
},
{
"name": "Quick-E-Realty Inc",
"properties": [
{
"reference": "4",
"price": 180995,
"bedrooms": 3
}
]
}
Use the SimpleXMLElement Class:
<?php
$xml = "<body>
<item>
<id>2</id>
</item>
</body>";
$elem = new SimpleXMLElement($xml);
if($elem->children()->count() === 1){
$id = $elem->item->addChild(0)->addChild('id',$elem->item->id);
unset($elem->item->id);
};
$array = json_decode(json_encode($elem), true);
print_r($array);
Output:
Array
(
[item] => Array
(
[0] => Array
(
[id] => 2
)
)
)
did you use this:
$xml_array['branch']['properties']['property']
as loop source? try to use this:
$xml_array['branch']['properties']
don't use ['property'] at the end of the line, don't use 3 segment just use 2 segment
<?php
$xml = file_get_contents('simple.xml');
$xml = simplexml_load_string($xml);
$xml_array = json_decode(json_encode((array) $xml), 1);
$xml_array = array($xml->getName() => $xml_array);
print_r($xml_array);
foreach($xml_array['branch']['properties'] as $a){
print_r($a);
}
?>
In order to solve this problem, you should select using xpath (as other mention), but in my opinion this is not a very familiar tool to most web-developers. I created a very small composer enabled package, which solves this problem. Credit to the symfony package CssSelector (https://symfony.com/doc/current/components/css_selector.html) which rewrites CSS selectors to xpath selectors. My package is just a thin wrapper that actually deals with what you in the most common cases will do with XML using PHP. You can find it here: https://github.com/diversen/simple-query-selector
use diversen\querySelector;
// Load simple XML document
$xml = simplexml_load_file('test2.xml');
// Get all branches as DOM elements
$elems = querySelector::getElementsAsDOM($xml, 'branch');
foreach($elems as $elem) {
// Get attribute name
echo $elem->attributes()->name . "\n";
// Get properties as array
$props = querySelector::getElementsAsAry($elem, 'property');
print_r($props); // You will get the array structure you expect
}
You could also (if you don't care about the branch name) just do:
$elems = querySelector::getElementsAsAry($xml, 'property');
Testing if the parsed XML has multiple tags, or is a single tag converted to array, instead of rebuilding the array, you could just test for the following case:
<?php
if (is_array($info[0])) {
foreach ($info as $fields) {
// Do something...
}
} else {
// Do something else...
}
Try it=)
$xml = simplexml_load_string($xml_raw, "SimpleXMLElement", LIBXML_NOCDATA);
$json = json_encode($xml);
$array = json_decode($json, TRUE);
$marray['RepairSheets']['RepairSheet'][0] = $array['RepairSheets']['RepairSheet'];
$array = (isset($array['RepairSheets']['RepairSheet'][0]) == true) ? $array : $marray;

Parse Xml file for comparison

ok this is driving me crazy.
I have been trying to parse a xml file into a specific array or object so I can compare it to a similar file to test for differences.
However I have had no luck. I have been attempting to use SimpleXMLIterator and SimpleXMLElement to do this.
Here are some samples:
<xml>
//This is the first record of 1073
<viddb>
<movies>1074</movies>
<movie>
<title>10.5</title>
<origtitle>10.5</origtitle>
<year>2004</year>
<genre>Disaster</genre>
<release></release>
<mpaa></mpaa>
<director>John Lafia</director>
<producers>Howard Braunstein, Jeffrey Herd</producers>
<actors>Kim Delaney, Fred Ward, Ivan Sergei</actors>
<description>An earthquake reaching a 10.5 magnitude on the Richter scale, strikes the west coast of the U.S. and Canada. A large portion of land falls into the ocean, and the situation is worsened by aftershocks and tsunami.</description>
<path>E:\www\Media\Videos\Disaster\10.5.mp4</path>
<length>164</length>
<size>3648</size>
<resolution>640x272</resolution>
<framerate>29.97</framerate>
<videocodec>AVC</videocodec>
<videobitrate>2966</videobitrate>
<label>Roku Media</label>
<poster>images/10.5.jpg</poster>
</movie>
Here is the object this record produces using $iter = new SimpleXMLIterator($xml, 0, TRUE);
object(SimpleXMLIterator)#71 (1) {
["viddb"] => object(SimpleXMLIterator)#72 (2) {
["movies"] => string(4) "1074"
["movie"] => array(1074) {
[0] => object(SimpleXMLIterator)#73 (19) {
["title"] => string(4) "10.5"
["origtitle"] => string(4) "10.5"
["year"] => string(4) "2004"
["genre"] => string(8) "Disaster"
["release"] => object(SimpleXMLIterator)#1158 (0) {
}
["mpaa"] => object(SimpleXMLIterator)#1159 (0) {
}
["director"] => string(10) "John Lafia"
["producers"] => string(31) "Howard Braunstein, Jeffrey Herd"
["actors"] => string(35) "Kim Delaney, Fred Ward, Ivan Sergei"
["description"] => string(212) "An earthquake reaching a 10.5 magnitude on the Richter scale, strikes the west coast of the U.S. and Canada. A large portion of land falls into the ocean, and the situation is worsened by aftershocks and tsunami."
["path"] => string(37) "E:\www\Media\Videos\Disaster\10.5.mp4"
["length"] => string(3) "164"
["size"] => string(4) "3648"
["resolution"] => string(7) "640x272"
["framerate"] => string(5) "29.97"
["videocodec"] => string(3) "AVC"
["videobitrate"] => string(4) "2966"
["label"] => string(10) "Roku Media"
["poster"] => string(15) "images/10.5.jpg"
}
What I'm trying to produce (at the moment) is a single level associative array for each movie . All the examples I've read on and followed always produced an array of arrays, which is much more difficult to work with.
This is were i'm at :
$iter = new SimpleXMLIterator($xml, 0, TRUE);
Zend_Debug::dump($iter);
//so far xpath has not worked for me, I can't get $result to return anything
$result = $iter->xpath('/xml/viddb/movies/movie');
$movies = array();
for ($iter->rewind(); $iter->valid(); $iter->next()) {
foreach ($iter->getChildren() as $key => $value) {
//I can get each movie title to echo but when I try to put them into an
// array it only has the last record
echo $value->title . '<br />';
$movies['title'] = $value->title;
}
}
return $movies;
I feel like I'm missing something simple and obvious...as usual :)
[EDIT]
I found my error, I was tripping over the array of objects thing. I had to cast the data I wanted as a string to make it work how I wanted. Just for info here is what I came up with to put me on the track I wanted:
public function indexAction() {
$xml = APPLICATION_PATH . '/../data/Videos.xml';
$iter = new SimpleXMLElement($xml, 0, TRUE);
$result = $iter->xpath('//movie');
$movies = array();
foreach ($result as $key => $movie) {
$movies[$key + 1] = (string) $movie->title;
}
Zend_Debug::dump($movies, 'Movies');
}
XPATH is the answer you are looking for. I think the reason your XPATH isn't working is because you are looking for a movie node under the movies node when the movies node does not have any children.
Edit: Think it might be easier to just use a foreach loop instead of the iterator. I had to look up the iterator as I had never seen it before. Been using simplxml and xpath for a while too. Also, I believe you should only use SimpleXMLElement if you are planning on editing the XML as well. If you simply want to read it for comparison, best to use simplexml_load_file. You can also change your xpath to simply.
xpath('//movie');
If you just need to compare the entire file contents, read the contents of both files into a string and do a string comparison. Otherwise, you can do the same at a lower level of the document by getting the innerXML of any node.

array_key_exists returning false when array clearly has key

I'm doing some content importing using the node import module in drupal. My problem is that I'm getting errors on data that looks like it should be working smoothly. This is the code at issue:
if (count($allowed_values) && !array_key_exists($item['value'], $allowed_values)) { //$allowed_values[$item['value']] == NULL) {
print "||||" . $item['value'] . "||||";
print_r($allowed_values);
And this is a sample of what is printing:
||||1||||Array ( [0] => no [1] => Zicam® Cold Remedy Nasal Gel Spray Single Hole Actuator (“Jet”) ) ||||1||||Array ( [0] => No [1] => Yes )
It looks to me like it's saying that "1" is not in the array, when printing the way "1" is clearly visible. If I replace the existing module code with the commented out check, no error is thrown.
Your code is not complete and i cannot reproduce the error.
Allow me to adjust your example:
<?
$item = array('value' => 1);
$allowed_values = array(0 => 'no',1 => 'yes');
echo "needle:";
var_dump($item['value']);
echo "haystack:";
var_dump($allowed_values);
if (count($allowed_values) && !array_key_exists($item['value'], $allowed_values)) {
echo "needle hast not been found or haystack is empty\n";
} else {
echo "needle has been found\n";
}
gives the desired output:
needle:int(1)
haystack:array(2) {
[0]=>
string(2) "no"
[1]=>
string(3) "yes"
}
needle has been found
PHP also works when you assign the needle a string and not an integer. It is some sort of lossy type conversion that can be really convenient but also a pain in the ass. Often you dont know whats going on and errors are caused.
But still. I bet you have something wrong with your variable types.
You should dump them and see what is really in there.

PHPquery lib. and parsing XML

I started using the phpquery thingy, but I got lost in all that documentation.
In case someone does not know what the hell I am talking about: http://code.google.com/p/phpquery/
My question is pretty much basic.
I succeeded at loading an XML document and now I want to parse all the tags from it.
Using pq()->find('title') I can output all of the contents inside the title tags. Great!
But I want to throw every <title> tag in a variable. So, lets say that there are 10 <title> tags, I want every one of them in a separate variable, like: $title1, $title2 ... $title10. How can this be done?
Hope you understand the question.
TIA!
You could do it like this:
phpQuery::unloadDocuments();
phpQuery::newDocument($content);
$allTitles = [];
pq('title')->each(function ($item) use (&$allTitles) {
$allTitles[] = pq($item)->text();
});
var_dump($allTitles);
For example if there are 3 titles in the $content this var_dump will output:
array(3) {
[0] =>
string(6) "title1"
[1] =>
string(6) "title2"
[2] =>
string(6) "title3"
}

Categories