Error when get data when using regex in php

Error when get data when using regex in php - php

I have a sample code:
<?php
$adr = 'http://www.proxynova.com/proxy-server-list/country-gb/';
$c = file_get_contents($adr);
if ($c){
$regexp = '#<td>(.*?):(\d{1,4})</td>#';
$matches = array();
preg_match_all($regexp,$c,$matches);
print_r($matches);
if (count($matches) > 0){
foreach($matches[0] as $k => $m){
$port = intval($matches[2][$k]);
$ip = trim($matches[1][$k]);
}
}
}
I using $regex = '#<td>(.*?):(\d{1,4})</td>#'; to get data inculde ip and port, but result is null, how to fix it !

You can only see it properly in the browser, but in the source it's actually scrambled; you need something like this to decode it:
function decode($str)
{
return long2ip(strtr($str, array(
'fgh' => 2,
'iop' => 1,
'ray' => 0,
)));
}
Then use it together with a DOMDocument solution like this:
$doc = new DOMDocument;
libxml_use_internal_errors(true);
$doc->loadHTML(file_get_contents('http://www.proxynova.com/proxy-server-list/country-gb/'));
$xp = new DOMXPath($doc);
foreach ($xp->query('//table[#id="tbl_proxy_list"]//tr') as $row) {
$ip = $xp->query('./td/span[#class="row_proxy_ip"]/script', $row);
$port = $xp->query('./td/span[#class="row_proxy_port"]/a', $row);
if ($ip->length && $port->length) {
if (preg_match('/decode\("([^"]+)"\)/', $ip->item(0)->textContent, $matches)) {
echo decode($matches[1]) . ':' . $port->item(0)->textContent, PHP_EOL;
}
}
}

The html source code contains ip adresses and ports separated in two columns, so that's why your regex doens't worK.

Related

How to scrape data from HTML Table in PHP

Hey I've been trying to scrape data from an html table and I'm not having much luck.
Website: https://www.dnr.state.mn.us/hunting/seasons.html
What I'm trying to do: I want to grab the contents of the table and encode it into json like
['event_title' 'Waterfowl'] and ['event_date' '09/25/21']
but I don't know how to do this, I've tried a couple different things but in the end I can't get it to work.
Code Example (Closest I got):
<?php
$dom = new DOMDocument;
$page = file_get_contents('https://www.dnr.state.mn.us/hunting/seasons.html');
$dom->loadHTML($page);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//tbody/tr') as $tr) {
$tmp = []; // reset the temporary array so previous entries are removed
foreach ($xpath->query("td[#class]", $tr) as $td) {
$key = preg_match('~[a-z]+$~', $td->getAttribute('class'), $out) ? $out[0] : 'no_class';
if ($key === "event-title") {
$tmp['event_title'] = $xpath->query("a", $td);
}
$tmp[$key] = trim($td->textContent);
}
//$tmp['event_date'] = date("M. dS 'y", strtotime(preg_replace('~\.|\d+[ap]m *~', '', $tmp['date'])));
//$result[] = $tmp;
$marray[] = array_unique($tmp);
print_r($marray);
}
//$array2 = var_export($result);
//print_r($array2[1]);
//var_export($result);
//echo "\n----\n";
//echo json_encode($result);
?>

Optimize remote page retrieving and parsing

I'm retrieving a remote page with PHP, getting a few links from that page and accessing each link and parsing it.
It takes me about 12 seconds which are way too much, and I need to optimize the code somehow.
My code is something like that:
$result = get_web_page('THE_WEB_PAGE');
preg_match_all('/<a data\-a=".*" href="(.*)">/', $result['content'], $matches);
foreach ($matches[2] as $lnk) {
$result = get_web_page($lnk);
preg_match('/<span id="tests">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test'] = $match[1];
preg_match('/<span id="tests2">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test2'] = $match[1];
preg_match('/<span id="tests3">(.*)<\/span>/', $result['content'], $match);
$re[$index]['test3'] = $match[1];
++$index;
}
I have some more preg_match calls inside the loop.
How can I optimize my code?
Edit:
I've changed my code to use xpath instead of regex, and it became much more slower.
Edit2:
That's my full code:
<?php
$begin = microtime(TRUE);
$result = get_web_page('WEB_PAGE');
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
// Get the links
$matches = $xpath->evaluate('//li[#class = "lasts"]/a[#class = "lnk"]/#href | //li[#class=""]/a[ #class = "lnk"]/#href');
if ($matches === FALSE) {
echo 'error';
exit();
}
foreach ($matches as $match) {
$links[] = 'WEB_PAGE'.$match->value;
}
$index = 0;
// For each link
foreach ($links as $link) {
echo (string)($index).' loop '.(string)(microtime(TRUE)-$begin).'<br>';
$result = get_web_page($link);
$dom = new DOMDocument();
$dom->loadHTML($result['content']);
$xpath = new DOMXPath($dom);
$match = $xpath->evaluate('concat(//span[#id = "header"]/span[#id = "sub_header"]/text(), //span[#id = "header"]/span[#id = "sub_header"]/following-sibling::text()[1])');
if ($matches === FALSE) {
exit();
}
$data[$index]['name'] = $match;
$matches = $xpath->evaluate('//li[starts-with(#class, "active")]/a/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['types'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is a title" and #class = "info"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['info'][] = $match->data;
}
$matches = $xpath->evaluate('//span[#title = "this is another title" and #class = "name"]/text()');
if ($matches === FALSE) {
exit();
}
foreach ($matches as $match) {
$data[$index]['names'][] = $match->data;
}
++$index;
}
?>

As others mentioned, use a parser instead (ie DOMDocument) and combine it with xpath queries. Consider the following example:
<?php
# set up some dummy data
$data = <<<DATA
<div>
<a class='link'>Some link</a>
<a class='link' id='otherid'>Some link 2</a>
</div>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
# all links
$links = $xpath->query("//a[#class = 'link']");
print_r($links);
# special id link
$special = $xpath->query("//a[#id = 'otherid']")
# and so on
$textlinks = $xpath->query("//a[startswith(text(), 'Some')]");
?>

Consider using a DOM framework for PHP. This should be way faster.
Use PHP's DOMDocument with xpath queries:
http://php.net/manual/en/class.domdocument.php
See Jan's answer for more explanation.
The following also works but is less preferable, according to the comments.
For example:
http://simplehtmldom.sourceforge.net/
an example to get all a tags on a page:
<?php
include_once('simple_html_dom.php');
$url = "http://your_url/";
$html = new simple_html_dom();
$html->load_file($url);
foreach($html->find("a") as $link)
{
// do something with the link
}
?>

PHP Regex or DOMDocument for Matching & Removing URLs?

I'm trying to extract links from html page using DOM:
$html = file_get_contents('links.html');
$DOM = new DOMDocument();
$DOM->loadHTML($html);
$a = $DOM->getElementsByTagName('a');
foreach($a as $link){
//echo out the href attribute of the <A> tag.
echo $link->getAttribute('href').'<br/>';
}
Output:
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
I would like to remove all results matching dontwantthisdomain.com, dontwantthisdomain2.com and dontwantthisdomain3.com so the output will looks like that:
http://domain1.com/page-X-on-domain-com.html
http://domain.com/page-XZ-on-domain-com.html
http://domain3.com/page-XYZ-on-domain3-com.html
Some people saying I should not use regex for html and others that it's ok. Could somebody point the best way how I can remove unwanted urls from my html file? :)

Maybe something like this:
function extract_domains($buffer, $whitelist) {
preg_match_all("#<a\s+.*?href=\"(.+?)\".*?>(.+?)</a>#i", $buffer, $matches);
$result = array();
foreach($matches[1] as $url) {
$url = urldecode($url);
$parts = #parse_url((string) $url);
if ($parts !== false && in_array($parts['host'], $whitelist)) {
$result[] = $parts['host'];
}
}
return $result;
}
$domains = extract_domains(file_get_contents("/path/to/html.htm"), array('stackoverflow.com', 'google.com', 'sub.example.com')));
It does a rough match on the all the <a> with href=, grabs what's between the quotes, then filters it based on your whitelist of domains.

None regex solution (without potential errors :-) :
$html='
http://dontwantthisdomain.com/dont-want-this-domain-name/
http://dontwantthisdomain2.com/also-dont-want-any-pages-from-this-domain/
http://dontwantthisdomain3.com/dont-want-any-pages-from-this-domain/
http://domain1.com/page-X-on-domain-com.html
http://dontwantthisdomain.com/dont-want-link-from-this-domain-name.html
http://dontwantthisdomain2.com/dont-want-any-pages-from-this-domain/
http://domain.com/page-XZ-on-domain-com.html
http://dontwantthisdomain.com/another-page-from-same-domain-that-i-dont-want-to-be-included/
http://dontwantthisdomain2.com/same-as-above/
http://domain3.com/page-XYZ-on-domain3-com.html
';
$html=explode("\n", $html);
$dontWant=array('dontwantthisdomain.com','dontwantthisdomain2.com','dontwantthisdomain3.com');
foreach ($html as $link) {
$ok=true;
foreach($dontWant as $notWanted) {
if (strpos($link, $notWanted)>0) {
$ok=false;
}
if (trim($link=='')) $ok=false;
}
if ($ok) $final_result[]=$link;
}
echo '<pre>';
print_r($final_result);
echo '</pre>';
outputs
Array
(
[0] => http://domain1.com/page-X-on-domain-com.html
[1] => http://domain.com/page-XZ-on-domain-com.html
[2] => http://domain3.com/page-XYZ-on-domain3-com.html
)

How to add rel="nofollow" to links with preg_replace()

The function below is designed to apply rel="nofollow" attributes to all external links and no internal links unless the path matches a predefined root URL defined as $my_folder below.
So given the variables...
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
And the content...
internal
internal cloaked link
external
The end result, after replacement should be...
internal
internal cloaked link
external
Notice that the first link is not altered, since its an internal link.
The link on the second line is also an internal link, but since it matches our $my_folder string, it gets the nofollow too.
The third link is the easiest, since it does not match the blog_url, its obviously an external link.
However, in the script below, ALL of my links are getting nofollow. How can I fix the script to do what I want?
function save_rseo_nofollow($content) {
$my_folder = $rseo['nofollow_folder'];
$blog_url = get_bloginfo('url');
preg_match_all('~<a.*>~isU',$content["post_content"],$matches);
for ( $i = 0; $i <= sizeof($matches[0]); $i++){
if ( !preg_match( '~nofollow~is',$matches[0][$i])
&& (preg_match('~' . $my_folder . '~', $matches[0][$i])
|| !preg_match( '~'.$blog_url.'~',$matches[0][$i]))){
$result = trim($matches[0][$i],">");
$result .= ' rel="nofollow">';
$content["post_content"] = str_replace($matches[0][$i], $result, $content["post_content"]);
}
}
return $content;
}

Here is the DOMDocument solution...
$str = 'internal
internal cloaked link
external
external
external
external
';
$dom = new DOMDocument();
$dom->preserveWhitespace = FALSE;
$dom->loadHTML($str);
$a = $dom->getElementsByTagName('a');
$host = strtok($_SERVER['HTTP_HOST'], ':');
foreach($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
continue;
}
$noFollowRel = 'nofollow';
$oldRelAtt = $anchor->attributes->getNamedItem('rel');
if ($oldRelAtt == NULL) {
$newRel = $noFollowRel;
} else {
$oldRel = $oldRelAtt->nodeValue;
$oldRel = explode(' ', $oldRel);
if (in_array($noFollowRel, $oldRel)) {
continue;
}
$oldRel[] = $noFollowRel;
$newRel = implode($oldRel, ' ');
}
$newRelAtt = $dom->createAttribute('rel');
$noFollowNode = $dom->createTextNode($newRel);
$newRelAtt->appendChild($noFollowNode);
$anchor->appendChild($newRelAtt);
}
var_dump($dom->saveHTML());
Output
string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
internal
internal cloaked link
external
external
external
external
</body></html>
"

Try to make it more readable first, and only afterwards make your if rules more complex:
function save_rseo_nofollow($content) {
$content["post_content"] =
preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
return $content;
}
function cb2($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/hostgator"; // re-add quirky config here
$blog_url = "http://localhost/";
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
Gives following output:
[post_content] =>
internal
<a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>
<a href="http://cnn.com" rel=nofollow>external</a>
The problem in your original code might have been $rseo which wasn't declared anywhere.

Try this one (PHP 5.3+):
skip selected address
allow manually set rel parameter
and code:
function nofollow($html, $skip = null) {
return preg_replace_callback(
"#(<a[^>]+?)>#is", function ($mach) use ($skip) {
return (
!($skip && strpos($mach[1], $skip) !== false) &&
strpos($mach[1], 'rel=') === false
) ? $mach[1] . ' rel="nofollow">' : $mach[0];
},
$html
);
}
Examples:
echo nofollow('something');
// will be same because it's already contains rel parameter
echo nofollow('something'); // ad
// add rel="nofollow" parameter to anchor
echo nofollow('something', 'localhost');
// skip this link as internall link

Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.
Here's how it can look like:
$my_folder = 'http://localhost/mytest/go/';
$blog_url = 'http://localhost/mytest';
$html = '<html><body>
internal
internal cloaked link
external
</body></html>';
$dom = new DOMDocument;
$dom->loadHTML($html);
$sxe = simplexml_import_dom($dom);
// grab all <a> nodes with an href attribute
foreach ($sxe->xpath('//a[#href]') as $a)
{
if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
&& substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
{
// skip all links that start with the URL in $blog_url, as long as they
// don't start with the URL from $my_folder;
continue;
}
if (empty($a['rel']))
{
$a['rel'] = 'nofollow';
}
else
{
$a['rel'] .= ' nofollow';
}
}
$new_html = $dom->saveHTML();
echo $new_html;
As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:
// change the regexp to your own rules, here we match everything under
// "http://localhost/mytest/" as long as it's not followed by "go"
if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
{
continue;
}
Note
I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...

Thanks #alex for your nice solution. But, I was having a problem with Japanese text. I have fixed it as following way. Also, this code can skip multiple domains with the $whiteList array.
public function addRelNoFollow($html, $whiteList = [])
{
$dom = new \DOMDocument();
$dom->preserveWhiteSpace = false;
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$a = $dom->getElementsByTagName('a');
/** #var \DOMElement $anchor */
foreach ($a as $anchor) {
$href = $anchor->attributes->getNamedItem('href')->nodeValue;
$domain = parse_url($href, PHP_URL_HOST);
// Skip whiteList domains
if (in_array($domain, $whiteList, true)) {
continue;
}
// Check & get existing rel attribute values
$noFollow = 'nofollow';
$rel = $anchor->attributes->getNamedItem('rel');
if ($rel) {
$values = explode(' ', $rel->nodeValue);
if (in_array($noFollow, $values, true)) {
continue;
}
$values[] = $noFollow;
$newValue = implode($values, ' ');
} else {
$newValue = $noFollow;
}
// Create new rel attribute
$rel = $dom->createAttribute('rel');
$node = $dom->createTextNode($newValue);
$rel->appendChild($node);
$anchor->appendChild($rel);
}
// There is a problem with saveHTML() and saveXML(), both of them do not work correctly in Unix.
// They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.
// So we need to do as follows. #see https://stackoverflow.com/a/20675396/1710782
return $dom->saveHTML($dom->documentElement);
}

<?
$str='internal
internal cloaked link
external';
function test($x){
if (preg_match('#localhost/mytest/(?!go/)#i',$x[0])>0) return $x[0];
return 'rel="nofollow" '.$x[0];
}
echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
?>

Here is the another solution which has whitelist option and add tagret Blank attribute.
And also it check if there already a rel attribute before add a new one.
function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true)
{
$Whitelist[] = $_SERVER['HTTP_HOST'];
foreach ($Whitelist as $Key => $Link)
{
$Host = preg_replace('#^https?://#', '', $Link);
$Host = "https?://". preg_quote($Host, '/');
$Whitelist[$Key] = $Host;
}
if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER))
{
foreach ($matches as $Anchor_Tag)
{
$IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag = false;
if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2))
{
foreach ($All_matches2[1] as $Key => $Attr_Name)
{
if($Attr_Name == 'href')
{
$Is_Valid_Tag = true;
$Url = $All_matches2[2][$Key];
// bypass #.. or internal links like "/"
if(preg_match('/^\s*[#|\/].*/', $Url))
{
continue 2;
}
foreach ($Whitelist as $Link)
{
if (preg_match("#$Link#", $Url)) {
continue 3;
}
}
}
else if($Attr_Name == 'rel')
{
$IS_Rel_Exist = true;
$Rel = $All_matches2[2][$Key];
preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
if( count($match) > 0 )
{
$IS_Follow_Exist = true;
}
else
{
$New_Rel = 'rel="'. $Rel . ' nofollow"';
}
}
else if($Attr_Name == 'target')
{
$IS_Target_Blank_Exist = true;
}
}
}
$New_Anchor_Tag = $Anchor_Tag;
if(!$IS_Rel_Exist)
{
$New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
}
else if(!$IS_Follow_Exist)
{
$New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
}
if($Add_Target_Blank && !$IS_Target_Blank_Exist)
{
$New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
}
$Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
}
}
return $Content;
}
To use it:
$Page_Content = 'internal
internal
google
example
stackoverflow';
$Whitelist = ["http://yoursite.com","http://localhost"];
echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);

WordPress decision:
function replace__method($match) {
list($original, $tag) = $match; // regex match groups
$my_folder = "/articles"; // re-add quirky config here
$blog_url = 'https://'.$_SERVER['SERVER_NAME'];
if (strpos($tag, "nofollow")) {
return $original;
}
elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
return $original;
}
else {
return "<$tag rel='nofollow'>";
}
}
add_filter( 'the_content', 'add_nofollow_to_external_links', 1 );
function add_nofollow_to_external_links( $content ) {
$content = preg_replace_callback('~<(a\s[^>]+)>~isU', "replace__method", $content);
return $content;
}

a good script which allows to add nofollow automatically and to keep the other attributes
function nofollow(string $html, string $baseUrl = null) {
return preg_replace_callback(
'#<a([^>]*)>(.+)</a>#isU', function ($mach) use ($baseUrl) {
list ($a, $attr, $text) = $mach;
if (preg_match('#href=["\']([^"\']*)["\']#', $attr, $url)) {
$url = $url[1];
if (is_null($baseUrl) || !str_starts_with($url, $baseUrl)) {
if (preg_match('#rel=["\']([^"\']*)["\']#', $attr, $rel)) {
$relAttr = $rel[0];
$rel = $rel[1];
}
$rel = 'rel="' . ($rel ? (strpos($rel, 'nofollow') ? $rel : $rel . ' nofollow') : 'nofollow') . '"';
$attr = isset($relAttr) ? str_replace($relAttr, $rel, $attr) : $attr . ' ' . $rel;
$a = '<a ' . $attr . '>' . $text . '</a>';
}
}
return $a;
},
$html
);
}

php associative arrays, regex, array

I currently have the following code :
$content = "
<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>";
I need to find a method to create and array as name=>value. E.g Manufacturer => John Deere.
Can anyone help me with a simple code snipped I tried some regex but doesn't even work to extract the names or values, e.g.:
$pattern = "/<name>Manufacturer<\/name><value>(.*)<\/value>/";
preg_match_all($pattern, $content, $matches);
$st_selval = $matches[1][0];

You don't want to use regex for this. Try out something like SimpleXML
EDIT
Well, why don't you start with this:
<?php
$content = "<root>" . $content . "</root>";
$xml = new SimpleXMLElement($c);
print_r($xml);
?>
EDIT 2
Despite the fact that some of the answers posted using regular expression MAY work, you should get in the habit of using the correct tool for the job and regular expressions are not the correct tool for parsing of XML.

I'm using your $content variable:
$preg1 = preg_match_all('#<name>([^<]+)#', $content, $name_arr);
$preg2 = preg_match_all('#<value>([^<]+)#', $content, $val_arr);
$array = array_combine($name_arr[1], $val_arr[1]);

This is rather simple, can be solved by regex. Should be:
$name = '<name>\s*([^<]+)</name>\s*';
$value = '<value>\s*([^<]+)</value>\s*';
$pattern = "|$name $value|";
preg_match_all($pattern, $content, $matches);
# create hash
$stuff = array_combine($matches[1], $matches[2]);
# display
var_dump($stuff);
Regards
rbo

First of all, never use regex to parse xml...
You could do this with an XPATH query...
First, wrap the content in a root tag to make the parser happy (if it doesn't already have it):
$content = '<root>' . $content . '</root>';
Then, load the document
$dom = new DomDocument();
$dom->loadXml($content);
Then, initialize the XPATH
$xpath = new DomXpath($dom);
Write your query:
$xpathQuery = '//name[text()="Manufacturer"]/follwing-sibling::value/text()';
Then, execute it:
$manufacturer = $xpath->evaluate($xpathQuery);
If I did the xpath right, it $manufacturer should be John Deere...
You can see the docs on DomXpath, a basic primer on XPath, and a bunch of XPath examples...
Edit: That won't work (PHP doesn't support that syntax (following-sibling). You could do this instead of the xpath query:
$xpathQuery = '//name[text()="Manufacturer"]';
$elements = $xpath->query($xpathQuery);
$manufacturer = $elements->item(0)->nextSibling->nodeValue;

I think this is what you're looking for:
<?php
$content = "<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>";
$pattern = "(\<name\>(\w*)\<\/name\>\<value\>(\w*)\<\/value\>)";
preg_match_all($pattern, $content, $matches);
$arr = array();
for ($i=0; $i<count($matches); $i++){
$arr[$matches[1][$i]] = $matches[2][$i];
}
/* This is an example on how to use it */
echo "Location: " . $arr["Location"] . "<br><br>";
/* This is the array */
print_r($arr);
?>
If your array has a lot of elements dont use the count() function in the for loop, calculate the value first and then use it as a constant.

I'll edit as my PHP is wrong, but here's some PHP (pseudo-)code to give some direction.
$pattern = '|<name>([^<]*)</name>\s*<value>([^<]*)</value>|'
preg_match_all($pattern, $content, $matches, PREG_SET_ORDER);
for($i = 0; $i < count($matches); $i++) {
$arr[$matches[$i][1]] = $matches[$i][2];
}
$arr is the array you want to store the name/value pairs.

Using XMLReader:
$content = '<name>Manufacturer</name><value>John Deere</value><name>Year</name><value>2001</value><name>Location</name><value>NSW</value><name>Hours</name><value>6320</value>';
$content = '<content>' . $content . '</content>';
$output = array();
$reader = new XMLReader();
$reader->XML($content);
$currentKey = null;
$currentValue = null;
while ($reader->read()) {
switch ($reader->name) {
case 'name':
$reader->read();
$currentKey = $reader->value;
$reader->read();
break;
case 'value':
$reader->read();
$currentValue = $reader->value;
$reader->read();
break;
}
if (isset($currentKey) && isset($currentValue)) {
$output[$currentKey] = $currentValue;
$currentKey = null;
$currentValue = null;
}
}
print_r($output);
The output is:
Array
(
[Manufacturer] => John Deere
[Year] => 2001
[Location] => NSW
[Hours] => 6320
)

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Error when get data when using regex in php - php

The html source code contains ip adresses and ports separated in two columns, so that's why your regex doens't worK.

Related

How to scrape data from HTML Table in PHP

Optimize remote page retrieving and parsing

PHP Regex or DOMDocument for Matching & Removing URLs?

How to add rel="nofollow" to links with preg_replace()

php associative arrays, regex, array

Categories

Resources