Extract node from XML like data without extra PHP libs - php

I am returned the following:
<links>
<image_link>http://img357.imageshack.us/img357/9606/48444016.jpg</image_link>
<thumb_link>http://img357.imageshack.us/img357/9606/48444016.th.jpg</thumb_link>
<ad_link>http://img357.imageshack.us/my.php?image=48444016.jpg</ad_link>
<thumb_exists>yes</thumb_exists>
<total_raters>0</total_raters>
<ave_rating>0.0</ave_rating>
<image_location>img357/9606/48444016.jpg</image_location>
<thumb_location>img357/9606/48444016.th.jpg</thumb_location>
<server>img357</server>
<image_name>48444016.jpg</image_name>
<done_page>http://img357.imageshack.us/content.php?page=done&l=img357/9606/48444016.jpg</done_page>
<resolution>800x600</resolution>
<filesize>38477</filesize>
<image_class>r</image_class>
</links>
I wish to extract the image_link in PHP as simply and as easily as possible. How can I do this?
Assume, I can not make use of any extra libs/plugins for PHP. :)
Thanks all

At Josh's answer, the problem was not escaping the "/" character. So the code Josh submitted would become:
$text = 'string_input';
preg_match('/<image_link>([^<]+)<\/image_link>/gi', $text, $regs);
$result = $regs[0];
Taking usoban's answer, an example would be:
<?php
// Load the file into $content
$xml = new SimpleXMLElement($content) or die('Error creating a SimpleXML instance');
$imagelink = (string) $xml->image_link; // This is the image link
?>
I recommend using SimpleXML because it's very easy and, as usoban said, it's builtin, that means that it doesn't need external libraries in any way.

You can use SimpleXML as it is built in PHP.

use regular expressions
$text = 'string_input';
preg_match('/<image_link>([^<]+)</image_link>/gi', $text, $regs);
$result = $regs[0];

Related

How do I cut this string in PHP?

I'm currently working on a script to archive an imageboard.
I'm kinda stuck on making links reference correctly, so I could use some help.
I receive this string:
>>10028949<br><br>who that guy???
In said string, I need to alter this part:
<a href="10028949#p10028949"
to become this:
<a href="#p10028949"
using PHP.
This part may appear more than once in the string, or might not appear at all.
I'd really appreciate it if you had a code snippet I could use for this purpose.
Thanks in advance!
Kenny
Disclaimer: as it'll be said in the comments, using a DOM parser is better to parse HTML.
That being said:
"/(<a[^>]*?href=")\d+(#[^"]+")/"
replaced by $1$2
So...
$myString = preg_replace("/(<a[^>]*?href=\")\d+(#[^\"]+\")/", "$1$2", $myString);
try this
>>10028949<br><br>who that guy???
Although you have the question already answered I invite you to see what would (approximately xD) be the correct approach, parsing it with DOM:
$string = '>>10028949<br><br>who that guy???';
$dom = new DOMDocument();
$dom->loadHTML($string);
$links = $dom->getElementsByTagName('a'); // This stores all the links in an array (actually a nodeList Object)
foreach($links as $link){
$href = $link->getAttribute('href'); //getting the href
$cut = strpos($href, '#');
$new_href = substr($href, $cut); //cutting the string by the #
$link->setAttribute('href', $new_href); //setting the good href
}
$body = $dom->getElementsByTagName('body')->item(0); //selecting everything
$output = $dom->saveHTML($body); //passing it into a string
echo $output;
The advantages of doing it this way is:
More organized / Cleaner
Easier to read by others
You could for example, have mixed links, and you only want to modify some of them. Using Dom you can actually select certain classes only
You can change other attributes as well, or the selected tag's siblings, parents, children, etc...
Of course you could achieve the last 2 points with regex as well but it would be a complete mess...

How to Ignore Whitespaces using preg_match()

I have a string that looks like:
">ANY CONTENT</span>(<a id="show
I need to fetch ANY CONTENT. However, there are spaces in between
</span> and (<a id="show
Here is my preg_match:
$success = preg_match('#">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);
\s* represents spaces. I get an empty array!
Any idea how to fetch CONTENT?
Use a real HTML parser. Regular expressions are not really suitable for the job. See this answer for more detail.
You can use DOMDocument::loadHTML() to parse into a structured DOM object that you can then query, like this very basic example (you need to do error checking though):
$dom = new DOMDocument;
$dom->loadHTML($data);
$span = $dom->getElementsByTagName('span');
$content = $span->item(0)->textContent;
I just had to:
">
define the above properly, because "> were too many in the page, so it didn't know which one to choose specficially. Therefore, it returned everything before "> until it hits (
Solution:
.">
Sample:
$success = preg_match('#\.">(.*?)</span>\s*\(<a id="show#s', $basicPage, $content);

Strip directory structure in HTML

I have a PHP application that reads in a bit of HTML. In this HTML there may be an img tag. What I want to do is strip the directory structure from the src of the image tag e.g.
<img src="dir1/dir2/dir3/image1.jpg>
to
<img src="image1.jpg">
Anyone have any pointers?
Thanks,
Mark
As a suggestion, rather than using regex, you may be better off using something like the SimpleXML class to traverse the HTML, that way you'd be able to find the img tags and their src attribute then change it easily. Rather than having to try and parse a whole document with regex. After you've done that you'd be able to just explode the string using the "/" delimiter and use the last value of the exploded array as the src attribute.
PHP.net's SimpleXML Manual: http://php.net/manual/en/book.simplexml.php
This is a tutorial how to change all links in a HTMl document: Scraping Links From HTML.
With a slight modification of the example, this could do it:
<?php
require('FluentDOM/FluentDOM.php');
$html = '<img src="dir1/dir2/dir3/image1.jpg">';
$fd = FluentDOM($html, 'html')->find('//img[#src]')->each(
function ($node) use ($url) {
$item = FluentDOM($node);
$item->attr('href', basename($item->attr('src')));
}
);
$fd->contentType = 'xml';
header('Content-type: text/xml');
echo $fd;
?>
If you want to try this with regexp this could work:
$subject = "dir1/dir2/dir3/image1.jpg";
$pattern = '/^.*\//';
$result = preg_replace($pattern, '', $subject);

Syntax regex to extract the source of an image

Ahoy there!
I can't "guess" witch syntax should I use to be able to extract the source of an image but simply the web address not the src= neither the quotes?
Here is my piece of code:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
echo $path[0];
}
}
When I use it I got this printed:
src="http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg"
And I wish to get only this:
http://project.bechade.fr/wp-content/uploads/2009/09/mer-300x225.jpg
Any idea?
Thanks for your help.
Not exactly an answer to your question, but when parsing html, consider using a proper html parser:
foreach($html->find('img') as $element) {
echo $element->src . '<br />';
}
See: http://simplehtmldom.sourceforge.net/
$path[1] instead of $path[0]
echo $path[1];
$path[0] is the full string matched. $path[1] is the first grouping.
You could explode the string using " as a delimeter and then the second item in the array you get would be the right string:
$array = explode('"',$full_src);
$bit_you_want = $array[1];
Reworking your original function, it would be:
function get_all_images_src() {
$content = get_the_content();
preg_match_all('|src="(.*?)"|i', $content, $matches, PREG_SET_ORDER);
foreach($matches as $path) {
$src = explode('"', $path);
echo $src[1];
}
}
Thanks Ithcy for his right answer.
I guess I've been too long to respond because he deleted it, I just don't know where his answer's gone...
So here is the one I've received by mail:
'|src="(.*?)"|i' makes no sense as a
regex. try '|src="([^"]+)"|i' instead.
(Which still isn't the most robust
solution but is better than what
you've got.)
Also, what everyone else said. You
want $path1, NOT $path[0]. You're
already extracting all the src
attributes into $matches[]. That has
nothing to do with $path[0]. If you're
not getting all of the src attributes
in the text, there is a problem
somewhere else in your code.
One more thing - you should use a real
HTML parser for this, because img tags
are not the only tags with src
attributes. If you're using this code
on raw HTML source, it's going to
match not just but
tags, etc.
— ithcy
I did everything he told me to do including using a HTML parser from Bart (2nd answer).
It works like a charm ! Thank you mate...

PHP: using preg_replace with htmlentities

I'm writing an RSS to JSON parser and as a part of that, I need to use htmlentities() on any tag found inside the description tag. Currently, I'm trying to use preg_replace(), but I'm struggling a little with it. My current (non-working) code looks like:
$pattern[0] = "/\<description\>(.*?)\<\/description\>/is";
$replace[0] = '<description>'.htmlentities("$1").'</description>';
$rawFeed = preg_replace($pattern, $replace, $rawFeed);
If you have a more elegant solution to this as well, please share. Thanks.
Simple. Use preg_replace_callback:
function _handle_match($match)
{
return '<description>' . htmlentities($match[1]) . '</description>';
}
$pattern = "/\<description\>(.*?)\<\/description\>/is";
$rawFeed = preg_replace_callback($pattern, '_handle_match', $rawFeed);
It accepts any callback type, so also methods in classes.
The more elegant solution would be to employ SimpleXML. Or a third party library such as XML_Feed_Parser or Zend_Feed to parse the feed.
Here is a SimpleXML example:
<?php
$rss = file_get_contents('http://rss.slashdot.org/Slashdot/slashdot');
$xml = simplexml_load_string($rss);
foreach ($xml->item as $item) {
echo "{$item->description}\n\n";
}
?>
Keep in mind that RSS and RDF and Atom look different, which is why it can make sense to employ one of the above libraries I mentioned.

Categories