PHP - preg_match() result in 0 matching values

PHP - preg_match() result in 0 matching values - php

I was trying to do web scraping for my personal webpage, using the bio and pics from a website profile (http://about.me/fernandocaldas) so whenever I change that profile the content in my web bio will also do.
The desired values are between
<script type="text/json" class="json user" data-scope="view_profile" data-lowercase_user_name="fernandocaldas">
and
</script>
Here is my code:
$thtml = file_get_contents('http://about.me/fernandocaldas');
$matchval = '/\<script type=\"text\/json\" class=\"json.*?>(.*?)\<\/script\>/i';
preg_match($matchval, $thtml, $match);
var_dump($match);
if($match){
echo "match!\n";
foreach($match[1] as $val)
{
echo $val."<br>";
}
}
But the result is always array(0) {} for the var_dump.

Regular expressions are never a good idea for HTML: today regex seems to work, but tomorrow they will fail!1
Frequently programmers think: “why I have to init a parser, load the HTML, performs a lot of queries if I can do it with only one line of regex code?”. The answer is “why choose the road that leads you in the wrong direction, although shorter?”.
In your case by using a Parser you can also shorten your code.
First, load your HTML page, init a new DOMDocument object, load HTML string into it and init a DOMXPath object (DOMXPath permits to perform complex HTML queries):
$dom = new DOMDocument();
libxml_use_internal_errors(1);
$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );
Search for the element(s) with tag <script> and class “json user”:
$found = $xpath->query( '//script[#class="json user"]' );
if( !$found->length ) die( 'Error retrieving JSON' );
Put the node value of first (and unique, in your page) node in a variable (I also trim it, but it is unnecessary) and decode it with json_decode():
$json = trim( $found->item(0)->nodeValue );
$user = json_decode( $json );
Now, in $user object, you have all the data you need. In $user->first_name you have your first name, in $user->bio you have your biography. By a print_r( $user ) you can display the complete $user structure to see how to access to each element.
Read more about DOMDocument
Read more about DOMXPath
Read why you can't parse [X]HTML with regular expressions
1 If the HTML structure change, also a parser will fail.

Related

How do I replace part of this string with a .* type of regex in php?

I am using explode to manipulate information I am scraping from a website. I am trying to eliminate something specific from the string so that it will return what I want and also add the rest of the items to the array.
$pageArray = explode('<td class="player-label"><a href="/nfl/players/antonio-brown.php?type=overall&week=draft">', $fantasyPros);
I would like to skip the antonio-brown section and use a regular expression or whatever is best to replace it so that it will not look for a specific name but every name on the list and add them to my array. Do you have any suggestions on what I should use here? I appreciate any assistance.

Seems like a parser job to me with appropriate xpath functions, e.g. not().
Consider the following code:
<?php
$data = <<<DATA
<td class="player-label">
Some brown link here
Some green link here
</td>
DATA;
$dom = new DOMDocument();
$dom->loadHTML($data);
$xpath = new DOMXPath($dom);
$green_links = $xpath->query("//a[not(contains(#href, 'antonio-brown'))]");
foreach ($green_links as $link) {
// do sth. useful here
}
?>
This prints out every link where there's no antonio-brown in it.
You can easily adjust this to td or any other element.

Reading inside tag and adding an ID to the tag

I have a problem, I am wanting to do the following but I cannot work out how preg_replace can do it. I have only ever been able to add something not read then add behind.
I have a site with 300,000+ pages and we are trying to make an anchor text link side bar. But first I need to add an ID to all my h2 tags so I have for example <h2>this is the title</h2> and I need PHP to automatically on page render output <h2 id="this-is-the-title">This is the title</h2>
Unfortunatelly all attempts have failed. I have tried google but it's a hard one to search for as I am not exactly sure what it's called.
Any ideas on what this is called or code snippets?

The rule is — as usual — no regular expression with HTML. Use DOMDocument for this.
First create a function to generate the title-based id:
function value2id( $text )
{
$retval = preg_replace( '/ +/', '-', trim( $text ) );
if( preg_match( '/^[^a-z]/i', $retval ) ) $retval = "a$retval";
return $retval;
}
Above function will return an id HTML5 compatible. If your HTML code version is lower, there are more restriction to allowed characters in id. You can modify the function as you prefer.
Then, load your entire old page (I don't know if in db you have the complete code or only the <body>) in a DOMDocument object, search for all <h2> elements and add id attribute calling the custom function:
$dom = new DomDocument();
libxml_use_internal_errors(1);
$dom->loadHTML( $html );
foreach( $dom->getElementsByTagName( 'h2' ) as $h2 )
{
$h2->setAttribute( 'id', value2id( $h2->nodeValue ) );
}
Now, you can print your modified HTML by:
echo $dom->saveHTML();

PHP XPath query returns nothing

I've been recently playing with DOMXpath in PHP and had success with it, trying to get more experience with it I've been playing grabbing certain elements of different sites. I am having trouble getting the weather marker off of http://www.theweathernetwork.com/weather/cape0005 this website.
Specifically I want
//*[#id='theTemperature']
Here is what I have
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $tag){
echo $tag->nodeValue;
}
Is there something I am doing wrong here? I am able to produce actual results on other tags on the page but specifically not this one.
Thanks in advance.

You might want to improve your DOMDocument debugging skills, here some hints (Demo):
<?php
header('Content-Type: text/plain;');
$url = file_get_contents('http://www.theweathernetwork.com/weather/cape0005');
$dom = new DOMDocument();
#$dom->loadHTML($url);
$xpath = new DOMXPath($dom);
$tags = $xpath->query("//*[#id='theTemperature']");
foreach ($tags as $i => $tag){
echo $i, ': ', var_dump($tag->nodeValue), ' HTML: ', $dom->saveHTML($tag), "\n";
}
Output the number of the found node, I do it here with $i in the foreach.
var_dump the ->nodeValue, it helps to show what exactly it is.
Output the HTML by making use of the saveHTML function which shows a better picture.
The actual output:
0: string(0) ""
HTML: <p id="theTemperature"></p>
You can easily spot that the element is empty, so the temperature must go in from somewhere else, e.g. via javascript. Check the Network tools of your browser.

what happens is straightforward, the page contains an empty id="theTemperature" element which is a placeholder to be populated with javascript. file_get_contents() will just download the page, not executing javascript, so the element remains empty. Try to load the page in the browser with javascript disabled to see it yourself

The element you're trying to select is indeed empty. The page loads the temperature into that id through ajax. Specifically this script:
http://www.theweathernetwork.com/common/js/master/citypage_ajax.js?cb=201301231338
but when you do a file_get_contents those scripts obviously don't get resolved. I'd go with guido's solution of using the RSS

PHP - Extracting two values from a line

I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!

Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.

If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png

PHP DOM - stripping span tags, leaving their contents

I am looking to take markup like:
<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>
and find the best method in PHP for stripping the span so that what is left is this:
Some text that is <strong>bolded</strong> and contains a link.
I have read many of the other questions regarding parsing HTML using PHP DOM instead of regex, but have been unable to figure out a way to strip the spans with PHP DOM, leaving the HTML contents intact. The ultimate goal is to be able to strip the document of all span tags, leaving their contents. Can this be done with PHP DOM? Is there a method that provides better performance and does not rely on string parsing instead of DOM parsing?
I've used regex to do so, without any issues thus far:
/<(\/)?(span)[^>]*>/i
But my interest here is in becoming a better PHP programmer. And since it is always possible to trip up a regex with badly formatted markup, I'm looking for a better way. I have also considered using strip_tags(), doing something like the following:
public function strip_tags( $content, $tags_to_strip = array() )
{
// All Valid XHTML tags
$valid_tags = array(
'a','abbr','acronym','address','area','b','base','bdo','big','blockquote','body','br','button','caption','cite',
'code','col','colgroup','dd','del','dfn','div','dl','DOCTYPE','dt','em','fieldset','form','h1','h2','h3','h4',
'h5','h6','head','html','hr','i','img','input','ins','kbd','label','legend','li','link','map','meta','noscript',
'object','ol','optgroup','option','p','param','pre','q','samp','script','select','small','span','strong','style',
'sub','sup','table','tbody','td','textarea','tfoot','th','thead','title','tr','tt','ul','var'
);
// Remove each tag to strip from the valid_tags array
foreach ( $tags_to_strip as $tag ){
$ndx = array_search( $tag, $valid_tags );
if ( $ndx !== false ){
unset( $valid_tags[ $ndx ] );
}
}
// convert valid_tags array into param for strip_tags
$valid_tags = implode( '><', $valid_tags );
$valid_tags = "<$valid_tags>";
$content = strip_tags( $content, $valid_tags );
return $content;
}
But this is still parsing the string, and not DOM parsing. So if the text is mal-formed, it is possible to strip too much. Many people are quick to suggest using Simple HTML DOM Parser, but looking at the source code, it seems to be using regex to parse the html as well.
Can this be done with PHP5's DOM, or is there a better way to strip tags leaving their contents intact. Would it be bad practice to use Tidy or HTML Purifier to clean the text and then use regex / HTML Simple HTML DOM parser on it?
Libraries like phpQuery seem to be too heavy weight for what seems like it should be a simple task.

I use the following function to remove a node without removing its children:
function DOMRemove(DOMNode $from) {
$sibling = $from->firstChild;
do {
$next = $sibling->nextSibling;
$from->parentNode->insertBefore($sibling, $from);
} while ($sibling = $next);
$from->parentNode->removeChild($from);
}
Per example:
$dom = new DOMDocument;
$dom->load('myhtml.html');
$nodes = $dom->getElementsByTagName('span');
foreach ($nodes as $node) {
DOMRemove($node);
}
echo $dom->saveHTML();
Would give you:
Some text that is <strong>bolded</strong> and contains a link.
While this:
$nodes = $dom->getElementsByTagName('a');
foreach ($nodes as $node) {
DOMRemove($node);
}
echo $dom->saveHTML();
Would give you:
<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>

Well,
In my experience, every time I worked with DOM, I los a little bit in performance when comparing with simple stri operations.
With your function, you tried to filter strictly the valid XHTML tags, but you don't need a loop with manual comparison since you can assign all this task to PHP interpreter through native functions.
Of course, you have combined well to achieve a very good performance (to me, 0.0002 miliseconds), but you could try to combine functions, in a single line, allowing each function do your own natural job.
Take a look and you will understand what I'm talking about:
$text = '<span class="test">Some text that is <strong>bolded</strong> and contains a link.</span>';
$validTags = array( 'a','abbr','acronym','address','area','b','base','bdo','big','blockquote','body','br','button','caption','cite',
'code','col','colgroup','dd','del','dfn','div','dl','DOCTYPE','dt','em','fieldset','form','h1','h2','h3','h4',
'h5','h6','head','html','hr','i','img','input','ins','kbd','label','legend','li','link','map','meta','noscript',
'object','ol','optgroup','option','p','param','pre','q','samp','script','select','small','span','strong','style',
'sub','sup','table','tbody','td','textarea','tfoot','th','thead','title','tr','tt','ul','var'
);
$tagsToStrip = array( 'span' );
var_dump( strip_tags( $text, sprintf( '<%s>', implode( '><', array_diff( $validTags, $tagsToStrip ) ) ) ) );
I used your own list, but I combined sprintf(), implode() and array_diff() to do specific tasks for, together, achieve the goal.
Hope it helped.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP - preg_match() result in 0 matching values - php

Related

How do I replace part of this string with a .* type of regex in php?

Reading inside tag and adding an ID to the tag

PHP XPath query returns nothing

PHP - Extracting two values from a line

PHP DOM - stripping span tags, leaving their contents

Categories

Resources