import the content of a form in wikipedia - php

I want to import de whole name frome this page ( http://nl.wikipedia.org/w/index.php?title=Samenstelling_Tweede_Kamer_2012-heden&action=edit&section=1 )(from form) and then compare it with the names of this page (http://nl.wikipedia.org/wiki/Samenstelling_Tweede_Kamer_2012-heden) and printout de relevante links with php

You have to write some code to parse the HTML from the Wikipedia site.
The PHP Simple HTML DOM Parser is the way to go to parse the HTML and get the information you need.
Once you have your data from the Wikipedia pages, you can compare them in your code.
Example to get the names (not tested, you probably need some more selectors to get exactly what you want):
ini_set('memory_limit','160M');
require('simple_html_dom.php');
// Create DOM from URL or file
$url = 'http://nl.wikipedia.org/wiki/Samenstelling_Tweede_Kamer_2012-heden';
// Object oriented style
$html = new simple_html_dom();
$html->load_file($url);
// Procedural style
// $html = file_get_html($url);
$items = array();
// Find div with class editmode and loop through it.
foreach($html->find('div.editmode') as $article) {
// Get all anchors in a unordened list with a list tag
foreach($article->find('ul li a') as $a)
$items[] = "<a href='". $a->href . "'>" . $a->plaintext . "</a>";
}
print_r($items);
If you see some weird characters in names (André Bosman for example), you should consider defining your charset (to UTF-8) in your html like this:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Related

PHP - Get all images from class with simple html dom parser

I need to get all images from the info box in Wikipedia page. I made this code but it gets all images from the page not only for the info box ,i need some help.
include("simple_html_dom.php");
$wikilink = "http://en.wikipedia.org/wiki/Aberdeen_F.C.";
//Wikipedia page to parse
$html = file_get_html($wikilink);
$images_array = array();
foreach ($html->find('table.infobox vcard td, img') as $element) {
$allimages = strtok($element->src . '|', '|');
array_push($images_array, $allimages);
}
print_r($images_array);
The below example shows the html elements what i want to get

Is it possible to extract full accurate image tag from a html code using DOM in php?

I'm trying to get full accurate img tags from a html code using DOM:
$content=new DOMDocument();
$content->loadHTML($htmlcontent);
$imgTags=$content->getElementsByTagName('img');
foreach($imgTags as $tag) {
echo $content->saveXML($tag); }
If i had the original <img src="img">, the result would be <img src="img"/>. But i need the exact value corresponding to the original.
It is possible - to get the exact img tag using DOM without regular expressions or thirdparty libraries (Simple HTML DOM)?
No. It isn't possible to do this.
However, you can achieve your goal of removing the <img> elements from an HTML document if they meet specific conditions using DOMDocument. Here's some sample code which removes images which contain the class attribute "removeme".
$htmlcontent =
'<!DOCTYPE html><html><head><title>Example</title></head><body>'
. '<img src="1"><img src="2" class="removeme"><img src="3"><img class="removeme" src="4">'
. '</body></html>';
$content=new DOMDocument();
$content->loadHTML($htmlcontent);
foreach ($content->getElementsByTagName('img') as $image) {
if ($image->getAttribute("class") == "removeme") {
$image->parentNode->removeChild($image);
}
}
echo $content->saveHTML();
Output:
<!DOCTYPE html> <html><head><title>Example</title></head><body><img src="1"><img src="3"></body></html>

Extracting data from HTML using Simple HTML DOM Parser

For a college project, I am creating a website with some back end algorithms and to test these in a demo environment I require a lot of fake data. To get this data I intend to scrape some sites. One of these sites is freelance.com.To extract the data I am using the Simple HTML DOM Parser but so far I have been unsuccessful in my efforts to actually get the data I need.
Here is an example of the HTML layout of the page I intend to scrape. The red boxes mark the required data.
Here is the code I have written so far after following some tutorials.
<?php
include "simple_html_dom.php";
// Create DOM from URL
$html = file_get_html('http://www.freelancer.com/jobs/Website-Design/1/');
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table[id=project_table] tr') as $tr) {
foreach($tr->find('td[class=title-col]') as $t) {
//get the inner HTML
$data = $t->outertext;
echo $data;
}
}
?>
Hopefully someone can point me in the right direction as to how I can get this working.
Thanks.
The raw source code is different, that's why you're not getting the expected results...
You can check the raw source code using ctrl+u, the data are in table[id=project_table_static], and the cells td have no attributes, so, here's a working code to get all the URLs from the table:
$url = 'http://www.freelancer.com/jobs/Website-Design/1/';
// Create DOM from URL
$html = file_get_html($url);
//Get all data inside the <tr> of <table id="project_table">
foreach($html->find('table#project_table_static tbody tr') as $i=>$tr) {
// Skip the first empty element
if ($i==0) {
continue;
}
echo "<br/>\$i=".$i;
// get the first anchor
$anchor = $tr->find('a', 0);
echo " => ".$anchor->href;
}
// Clear dom object
$html->clear();
unset($html);
Demo

Fixing unclosed HTML tags

I am working on some blog layout and I need to create an abstract of each post (say 15 of the lastest) to show on the homepage. Now the content I use is already formatted in html tags by the textile library. Now if I use substr to get 1st 500 chars of the post, the main problem that I face is how to close the unclosed tags.
e.g
<div>.......................</div>
<div>...........
<p>............</p>
<p>...........| 500 chars
</p>
<div>
What I get is two unclosed tags <p> and <div> , p wont create much trouble , but div just messes with the whole page layout. So any suggestion how to track the opening tags and close them manually or something?
There are lots of methods that can be used:
Use a proper HTML parser, like DOMDocument
Use PHP Tidy to repair the un-closed tag
Some would suggest HTML Purifier
As ajreal said, DOMDocument is a solution.
Example :
$str = "
<html>
<head>
<title>test</title>
</head>
<body>
<p>error</i>
</body>
</html>
";
$doc = new DOMDocument();
#$doc->loadHTML($str);
echo $doc->saveHTML();
Advantage : natively included in PHP, contrary to PHP Tidy.
You can use DOMDocument to do it, but be careful of string encoding issues. Also, you'll have to use a complete HTML document, then extract the components you want. Here's an example:
function make_excerpt ($rawHtml, $length = 500) {
// append an ellipsis and "More" link
$content = substr($rawHtml, 0, $length)
. '… More >';
// Detect the string encoding
$encoding = mb_detect_encoding($content);
// pass it to the DOMDocument constructor
$doc = new DOMDocument('', $encoding);
// Must include the content-type/charset meta tag with $encoding
// Bad HTML will trigger warnings, suppress those
#$doc->loadHTML('<html><head>'
. '<meta http-equiv="content-type" content="text/html; charset='
. $encoding . '"></head><body>' . trim($content) . '</body></html>');
// extract the components we want
$nodes = $doc->getElementsByTagName('body')->item(0)->childNodes;
$html = '';
$len = $nodes->length;
for ($i = 0; $i < $len; $i++) {
$html .= $doc->saveHTML($nodes->item($i));
}
return $html;
}
$html = "<p>.......................</p>
<p>...........
<p>............</p>
<p>...........| 500 chars";
// output fixed html
echo make_excerpt($html, 500);
Outputs:
<p>.......................</p>
<p>...........
</p>
<p>............</p>
<p>...........| 500 chars… More ></p>
If you are using WordPress you should wrap the substr() invocation in a call to wpautop - wpautop(substr(...)). You may also wish to test the length of the $rawHtml passed to the function, and skip appending the "More" link if it isn't long enough.

How should parse with PHP (simple html dom parser) background images and other images of webpage?

How should parse with PHP (simple html dom/etc..) background and other images of webpage?
case 1: inline css
<div id="id100" style="background:url(/mycar1.jpg)"></div>
case 2: css inside html page
<div id="id100"></div>
<style type="text/css">
#id100{
background:url(/mycar1.jpg);
}
</style>
case 3: separate css file
<div id="id100" style="background:url(/mycar1.jpg);"></div>
external.css
#id100{
background:url(/mycar1.jpg);
}
case 4: image inside img tag
solution to case 4 as he appears in php simple html dom parser:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Find all images
foreach($html->find('img') as $element)
echo $element->src . '<br>';
Please help me to parse case 1,2,3.
If exist more cases please write them, with soltion if you can please.
Thanks
For Case 1:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Get the style attribute for the item
$style = $html->getElementById("id100")->getAttribute('style');
// $style = background:url(/mycar1.jpg)
// You would now need to put it into a css parser or do some regular expression magic to get the values you need.
For Case 2/3:
// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');
// Get the Style element
$style = $html->find('head',0)->find('style');
// $style now contains an array of style elements within the head. You will need to work out using attribute selectors what whether an element has a src attribute, if it does download the external css file and parse (using a css parser), if it doesnt then pass the innertext to the css parser.
To extract <img> from the page you can try something like:
$doc = new DOMDocument();
$doc->loadHTML("<html><body>Foo<br><img src=\"bar.jpg\" title=\"Foo bar\" alt=\"alt\"></body></html>");
$xml = simplexml_import_dom($doc);
$images = $xml->xpath('//img');
foreach ($images as $img)
echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
See doc for DOMDocument for more details.

Categories