scrape data using regex and simplehtmldom

scrape data using regex and simplehtmldom - php

i am trying to scrape some data from this site : http://laperuanavegana.wordpress.com/ . actually i want the title of recipe and ingredients . ingredients is located inside two specific keyword . i am trying to get this data using regex and simplehtmldom . but its showing the full html text not just the ingredients . here is my code :
<?php
include_once('simple_html_dom.php');
$base_url = "http://laperuanavegana.wordpress.com/";
traverse($base_url);
function traverse($base_url)
{
$html = file_get_html($base_url);
$k1="Ingredientes";
$k2="Preparación";
preg_match_all("/$k1(.*)$k2/s",$html->innertext,$out);
echo $out[0][0];
}
?>
there is multiple ingredients in this page . i want all of them . so using preg_match_all()
it will be helpful if anybody detect the bug of this code .
thanks in advance.

When you are already using an HTML parser (even a poor one like SimpleHtmlDom), why are you trying to mess up things with Regex then? That's like using a scalpel to open up the patient and then falling back to a sharpened spoon for the actual surgery.
Since I strongly believe no one should use SimpleHtmlDom because it has a poor codebase and is much slower than libxml based parsers, here is how to do it with PHP's native DOM extension and XPath. XPath is effectively the Regex or SQL for X(HT)ML documents. Learn it, so you will never ever have to touch Regex for HTML again.
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com/2011/06/11/ensalada-tibia-de-quinua-mango-y-tomate/');
libxml_clear_errors();
$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('content');
$recipe['title'] = $xpath->evaluate('string(div/h2/a)', $contentDiv);
foreach ($xpath->query('div/div/ul/li', $contentDiv) as $listNode) {
$recipe['ingredients'][] = $listNode->nodeValue;
}
print_r($recipe);
This will output:
Array
(
[title] => Ensalada tibia de quinua, mango y tomate
[ingredients] => Array
(
[0] => 250gr de quinua cocida tibia
[1] => 1 mango grande
[2] => 2 tomates
[3] => Unas hojas de perejil
[4] => Sal
[5] => Aceite de oliva
[6] => Vinagre balsámico
)
)
Note that we are not parsing http://laperuanavegana.wordpress.com/ but the actual blog post. The main URL will change content whenever the blog owner adds a new post.
To get all the Recipes from the main page, you can use
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com');
libxml_clear_errors();
$contentDiv = $dom->getElementById('content');
$xp = new DOMXPath($dom);
$recipes = array();
foreach ($xp->query('div/h2/a|div/div/ul/li', $contentDiv) as $node) {
echo
($node->nodeName === 'a') ? "\n# " : '- ',
$node->nodeValue,
PHP_EOL;
}
This will output
# Ensalada tibia de quinua, mango y tomate
- 250gr de quinua cocida tibia
- 1 mango grande
- 2 tomates
- Unas hojas de perejil
- Sal
- Aceite de oliva
- Vinagre balsámico
# Flan de lúcuma
- 1 lúcuma grandota o 3 pequeñas
- 1/2 litro de leche de soja evaporada
…
and so on
Also see
How do you parse and process HTML/XML in PHP?
DOMDocument in php

You need to add a question mark there. It makes the pattern ungreedy - otherwise it will take everything form the first $k1 to the last $k2 on the page. If you add the question mark it will always take the next $k2.
preg_match_all("/$k1(.*?)$k2/s",$html->innertext,$out);

Related

Extract specification list from text (unknown format)

How can I extract a specification from product description which is of unknown format (sometimes it is unordered list, sometimes it's br-element etc.) but it ALWAYS looks same way on front.
The visual is like:
Some description text, somethimes it is one sentence, sometimes more..
== sometimes here's an empty line, sometimes is not ==
spec item1
spec item2
Is there a way to extract that "by it's visual' in PHP?
Example:
<h2> desc <br>
<br>
> <strong> T Shirt</strong><br>
> Breathable mesh fabric<br>
> Reflective detail<br>
> Flat lock seams <br>

You could try and filter your entries. I've managed to get your example into an array. It would then be a case of a little wrangling with the result:
<?php
$html =<<<HTML
<h2> desc </h2>
<br>
> <strong> T Shirt</strong><br>
> Breathable mesh fabric<br>
> Reflective detail<br>
> Flat lock seams <br>
HTML;
$no_html = strip_tags($html);
$no_entities = preg_replace('/&#?[a-z0-9]+;/i', '', $no_html);
$parts = preg_split('/\R/', $no_entities);
$trimmed_parts = array_map('trim', $parts);
var_export($trimmed_parts);
Output:
array (
0 => 'desc',
1 => '',
2 => 'T Shirt',
3 => 'Breathable mesh fabric',
4 => 'Reflective detail',
5 => 'Flat lock seams',
)

This can be done with file_get_contents() and some regex processing. Please ensure you have proper settings enabled in PHP.ini (fopen URL wrappers)
Refer:
http://php.net/manual/en/filesystem.configuration.php
Sample code:
<?php
$page = file_get_contents('Provide your url here');
preg_match("/regex pattern here/", $page, $agent_name);
// display agent name matches
print_r($agent_name)
Personal suggestion - Using python will simpler the process. Lot of packages are available already for this purpose. Eg- bs4

Loop through repeated XML entities using PHP

I am trying to use PHP to read a large XML file (gzipped). The file consists of repeated products (actually books). Each book has 1 or more contributors. This is an example of a product.
<Product>
<ProductIdentifier>
<IDTypeName>EAN.UCC-13</IDTypeName>
<IDValue>9999999999999</IDValue>
</ProductIdentifier>
<Contributor>
<SequenceNumber>1</SequenceNumber>
<ContributorRole>A01</ContributorRole>
<PersonNameInverted>Bloggs, Joe</PersonNameInverted>
</Contributor>
<Contributor>
<SequenceNumber>2</SequenceNumber>
<ContributorRole>A01</ContributorRole>
<PersonNameInverted>Jones, John</PersonNameInverted>
</Contributor>
<Contributor>
<SequenceNumber>3</SequenceNumber>
<ContributorRole>B01</ContributorRole>
<PersonNameInverted>Other, An</PersonNameInverted>
</Contributor>
The output I would wish for this example is
Array
(
[1] => 9999999999999
[2] => Bloggs, Joe(A01)
[3] => Jones, John(A01)
[4] => Other, An(B01)
)
My code loads the gzipped XML file and handles the repeated sequence of products with no problem but I cannot get it to handle the repeated sequence of contributors. My code for handling the products and first contributor is shown below but I have tried various ways of looping through the contributors but cannot seem to achieve what I need. I'm a beginner with PHP and XML although an IT professional for many years.
$reader = new XMLReader();
//load the selected XML file to the DOM
if(!$reader->open("compress.zlib://filename.xml.gz","r")){
die('Failed to open file!');
}
while ($reader->read()):
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Product')
{
$xml = simplexml_load_string($reader->readOuterXML());
list($result) = $xml->xpath('//ProductIdentifier[IDTypeName = "EAN.UCC-13"]');
$line[1] = (string)$result->IDValue;
list($result) = $xml->xpath('//Contributor');
$contributorname = (string)$result->PersonNameInverted;
$role = (string)$result->ContributorRole;
$line[2] = $contributorname."(".$role.")";
echo '<pre>'; print_r($line); echo '</pre>';
}
endwhile;

Since you have several contributors, you must handle it as an array and loop on them to prepare your final variable:
<?php
$reader = new XMLReader();
//load the selected XML file to the DOM
if(!$reader->open("compress.zlib://filename.xml.gz","r")){
die('Failed to open file!');
}
while ($reader->read()) {
if ($reader->nodeType == XMLReader::ELEMENT && $reader->name == 'Product') {
$xml = simplexml_load_string($reader->readOuterXML());
list($result) = $xml->xpath('//ProductIdentifier[IDTypeName = "EAN.UCC-13"]');
$line[1] = (string)$result->IDValue;
// get all contributors in an array
$contributors = $xml->xpath('//Contributor');
$i = 2;
// go through all contributors
foreach($contributors as $contributor) {
$contributorname = (string)$contributor->PersonNameInverted;
$role = (string)$contributor->ContributorRole;
$line[$i] = $contributorname."(".$role.")";
$i++;
}
echo '<pre>'; print_r($line); echo '</pre>';
}
}
This will give you the following output:
Array
(
[1] => 9999999999999
[2] => Bloggs, Joe(A01)
[3] => Jones, John(A01)
[4] => Other, An(B01)
)
EDIT: Some explanation here on what is wrong on your code. Instead of taking all the contributors, you just take the first one with list()
http://php.net/manual/en/function.list.php (assign all values of an array into variables). Since you don't know how many contributors you have (i guess...), you cannot use this.
Then you assign the first one into your $line, so you always have only the first one.

How to Parse the ajax script in DOM Parser?

I parsing scores from http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383
I able to parse all the attributes. But I can't able to parse the time.
I Used
$homepages = file_get_html("http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383");
$teama = $homepages->find('span[id="clock"]');
Kindly help me

Since the that particular site is loading the values dynamically (thru AJAX request), you cant really parse the value upon initial load.
<span id="clock"></span> // this tends to be empty initial load
Normal scrapping:
$homepages = file_get_contents("http://sports.in.msn.com/football-world-cup-2014/south-africa-v-brazil/1597383");
$doc = new DOMDocument();
#$doc->loadHTML($homepages);
$xpath = new DOMXPath($doc);
$query = $xpath->query("//span[#id='clock']");
foreach($query as $value) {
echo $value->nodeValue; // the traversal is correct, but this will be empty
}
My suggestion is instead of scraping it, you will need to have to access it thru a request also, since it is a time (of course, as the match goes on this will change and change until the game has ended). Or you can also use their request.
$url = 'http://sports.in.msn.com/liveplayajax/SOCCERMATCH/match/gsm/en-in/1597383';
$contents = file_get_contents($url);
$data = json_decode($contents, true);
echo '<pre>';
print_r($data);
echo '</pre>';
Should yield something like (a part of it actually):
[2] => Array
(
[Code] =>
[CommentId] => -1119368663
[CommentType] => manual
[Description] => FULL-TIME: South Africa 0-5 Brazil.
[Min] => 90'
[MinExtra] => (+3)
[View] =>
[ViewHint] =>
[ViewIndex] => 0
[EditKey] =>
[TrackingValues] =>
[AopValue] =>
)
You should get the 90' by using foreach. Consider this example:
foreach($data['Commentary']['CommentaryItems'] as $key => $value) {
if(stripos($value['Description'], 'FULL-TIME') !== false) {
echo $value['Min'];
break;
}
}
Should print: 90'

PHP Web scraping of Javascript generated contents [duplicate]

This question already has answers here:
Scrape web page data generated by javascript
(2 answers)
Closed 8 years ago.
I am stuck with a scraping task in my project.
i want to grab the data from the link in $html , all table content of tr and td , here i am trying to grab the link but it only shows javascript: self.close()
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
foreach($html->find('a') as $element)
echo $element->href . '<br>';
?>

Usually, this kind of pages load a bunch of Javascript (jQuery, etc.), which then builds the interface and retrieves the data to be displayed from a data source.
So what you need to do is open that page in Firefox or similar, with a tool such as Firebug in order to see what requests are actually being done. If you're lucky, you will find it directly in the list of XHR requests. As in this case:
http://www.govliquidation.com/json/buyer_ux/salescalendar.js
Notice that this course of action may infringe on some license or terms of use. Clear this with the webmaster/data source/copyright owner before proceeding: detecting and forbidding this kind of scraping is very easy, and identifying you is probably only slightly less so.
Anyway, if you issue the same call in PHP, you can directly scrape the data (provided there is no session/authentication issue, as seems the case here) with very simple code:
<?php
$url = "http://www.govliquidation.com/json/buyer_ux/salescalendar.js";
$json = file_get_contents($url);
$data = json_decode($json);
?>
This yields a data object that you can inspect and convert in CSV by simple looping.
stdClass Object
(
[result] => stdClass Object
(
[events] => Array
(
[0] => stdClass Object
(
[yahoo_dur] => 11300
[closing_today] => 0
[language_code] => en
[mixed_id] => 9297
[event_id] => 9297
[close_meridian] => PM
[commercial_sale_flag] => 0
[close_time] => 01/06/2014
[award_time_unixtime] => 1389070800
[category] => Tires, Parts & Components
[open_time_unixtime] => 1388638800
[yahoo_date] => 20140102T000000Z
[open_time] => 01/02/2014
[event_close_time] => 2014-01-06 17:00:00
[display_event_id] => 9297
[type_code] => X3
[title] => Truck Drive Axles # Killeen, TX
[special_flag] => 1
[demil_flag] => 0
[google_close] => 20140106
[event_open_time] => 2014-01-02 00:00:00
[google_open] => 20140102
[third_party_url] =>
[bid_package_flag] => 0
[is_open] => 1
[fda_count] => 0
[close_time_unixtime] => 1389045600
You retrieve $data->result->events, use fputcsv() on its items converted to array form, and Bob's your uncle.

In the case of the second site, you have a table with several TR elements, and you want to catch the first two TD children of each TR.
By inspecting the source code you see something like this:
<tr>
<td> Allendale</td>
<td> Eastern Time
</td>
</tr>
<tr>
<td> Alpine</td>
<td> Eastern Time
</td>
So you just grab all the TR's
<?php
include("simple_html_dom.php");
$html = file_get_html('http://www.areacodelocations.info/allcities.php?ac=201');
$fp = fopen('output.csv', 'w');
if (!$fp) die("Cannot open output CSV - permission problems maybe?");
foreach($html->find('tr') as $tr) {
$csv = array(); // Start empty. A new CSV row for each TR.
// Now find the TD children of $tr. They will make up a row.
foreach($tr->find('td') as $td) {
// Get TD's innertext, but
$csv[] = $td->innertext;
}
fputcsv($fp, $csv);
}
fclose($fp);
?>
You will notice that the CSV text is "dirty". That is because the actual text is:
<td> Alpine</td>
<td> Eastern Time[CARRIAGE RETURN HERE]
</td>
So to have "Alpine" and "Eastern Time", you have to replace
$csv[] = $td->innertext;
with something like
$csv[] = strip(
html_entity_decode (
$td->innertext,
ENT_COMPAT | ENT_HTML401,
'UTF-8'
)
);
Check out the PHP man page for html_entity_decode() about character set encoding and entity handling. The above ought to work -- and an ought and fifty cents will get you a cup of coffee :-)

Flex: Send XML content with HTML entities to PHP via POST

I developed a desktop AIR application which saves data locally as XML files.
When the user clicks a button, the xml content is posted to a PHP application, where data is handled (inserted or updated).
The problem is, when I post some data containing html entities like & or >, the xml content is not parsed.
I'm using this class to parse the xml content:
https://sites.google.com/site/floweringmind/home
This is the input:
<companie_ps>
<denCompanie>Company & Friends</denCompanie>
<email>abc#abc.ro</email>
</companie_ps>
This is the wanted result:
Array
(
[0] => Array
(
[denCompanie] => Company & Friends
[email] => abc#abc.ro
)
)
This is the actual result:
Array
(
[0] => Array
(
[denCompanie] => Array
(
[#content] => Company
)
)
)
Error message: XML parse error 68 'XML_ERR_NAME_REQUIRED' at line 5, column 28 (byte index 63).
EDIT: The problem came from the fact that the post was converting html entities inside xml tags contents to actual characters. So, I used a regex to replace the special characters with their html entities. This is the code, in case someone will need it:
function _handle_match($match)
{
return '<' . $match[1] . '>' . htmlentities($match[2]) . '</' . $match[3] . '>';
}
$pattern = "/\<(.*)\>(.*?)\<\/(.*)\>/imU";
$xml = preg_replace_callback($pattern, '_handle_match', $xml);

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

scrape data using regex and simplehtmldom - php

You need to add a question mark there. It makes the pattern ungreedy - otherwise it will take everything form the first $k1 to the last $k2 on the page. If you add the question mark it will always take the next $k2. preg_match_all("/$k1(.*?)$k2/s",$html->innertext,$out);

Related

Extract specification list from text (unknown format)

Loop through repeated XML entities using PHP

How to Parse the ajax script in DOM Parser?

PHP Web scraping of Javascript generated contents [duplicate]

Flex: Send XML content with HTML entities to PHP via POST

Categories

Resources