I'm a beginner with regular expressions and am working on a server where I cannot instal anything (does using DOM methods require the instal of anything?).
I have a problem that I cannot solve with my current knowledge.
I would like to extract from the line below the album id and image url.
There are more lines and other url elements in the string (file), but the album ids and image urls I need are all in strings similar to the one below:
<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">
So in this case I would like to get '774' and 'http://img255.imageshack.us/img00/000/000001.png'
I've seen multiple examples of extracting just the url or one other element from a string, but I really need to keep these both together and store these in one record of the database.
Any help is really appreciated!
Since you are new to this, I'll explain that you can use PHP's HTML parser known as DOMDocument to extract what you need. You should not use a regular expression as they are inherently error prone when it comes to parsing HTML, and can easily result in many false positives.
To start, lets say you have your HTML:
$html = '<img alt="/" src="http://img255.imageshack.us/img00/000/000001.png" height="133" width="113">';
And now, we load that into DOMDocument:
$doc = new DOMDocument;
$doc->loadHTML( $html);
Now, we have that HTML loaded, it's time to find the elements that we need. Let's assume that you can encounter other <a> tags within your document, so we want to find those <a> tags that have a direct <img> tag as a child. Then, check to make sure we have the correct nodes, we need to make sure we extract the correct information. So, let's have at it:
$results = array();
// Loop over all of the <a> tags in the document
foreach( $doc->getElementsByTagName( 'a') as $a) {
// If there are no children, continue on
if( !$a->hasChildNodes()) continue;
// Find the child <img> tag, if it exists
foreach( $a->childNodes as $child) {
if( $child->nodeType == XML_ELEMENT_NODE && $child->tagName == 'img') {
// Now we have the <a> tag in $a and the <img> tag in $child
// Get the information we need:
parse_str( parse_url( $a->getAttribute('href'), PHP_URL_QUERY), $a_params);
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
}
}
A print_r( $results); now leaves us with:
Array
(
[0] => Array
(
[0] => 774
[1] => http://img255.imageshack.us/img00/000/000001.png
)
)
Note that this omits basic error checking. One thing you can add is in the inner foreach loop, you can check to make sure you successfully parsed an album parameter from the <a>'s href attribute, like so:
if( isset( $a_params['album'])) {
$results[] = array( $a_params['album'], $child->getAttribute('src'));
}
Every function I've used in this can be found in the PHP documentation.
If you've already narrowed it down to this line, then you can use a regex like the following:
$matches = array();
preg_match('#.+album=(\d+).+src="([^"]+)#', $yourHtmlLineHere, $matches);
Now if you
echo $matches[1];
echo " ";
echo $matches[2];
You'll get the following:
774 http://img255.imageshack.us/img00/000/000001.png
Related
I want to find all href tags that include my URL in any html source.
I used this code:
preg_match_all("'<a.*?href=\"(http[s]*://[^>\"]*?)\"[^>]*?>(.*?)</a>'si", $target_source, $matches);
Example, I try to find a href tags that include http://www.emrekadan.com
How can I do it ?
I'd simply use PHP's DOM Parser for this purpose. This may seem harder than regex, but it's actually a lot more easier and is the correct way to parse HTML.
$url = 'WEBSITE_TO_SEARCH_FOR';
$searchstring = 'YOUR_SEARCH_STRING';
$dom = new DOMDocument();
#$dom->loadHTMLFile($url);
$result = array();
foreach($dom->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if(stripos($href, $searchstring) !== FALSE) {
$result[] = $href;
}
}
if(!empty($result)) print_r($result);
Explanation:
Loads the given URL using loadHTMLfile() method
Finds all <a> tags and loops through them
Uses stripos() to case-insensitively check if the href contains the given search term
If it does, it's pushed into the $result array
Note: If an empty string is passed as the filename or an empty file is named, a warning will be generated. I've used # to hide that message, but it's generally regarded as a bad practice. You can add additional checks to make sure the URL exists before trying to load it.
I have:
<span>something or other</span>
<b>blarg</b>
<b>blarg and stuff</b>
<span>blarg</span>
<em>wakka wakka</em>
<em>wakka blarg</em>
<em>blarg</em>
and I just want to get the elements that ONLY contain "blarg" and no other text, so:
<b>blarg</b>
<span>blarg</span>
<em>blarg</em>
The important issue here is that I'm trying to check if blarg exists within one element alone on the page or not. I've had some general luck with regex but I'd rather do it with simple_html_dom so that I can look at child and sibling elements as well.
Does anyone know what is the simplest way to do this with simple_html_dom?
A way to do it, is to parse every tag, and test if it contains 'blarg'...
Here's a working example:
$text = '<span>something or other</span>
<b>blarg</b>
<b>blarg and stuff</b>
<span>blarg</span>
<em>wakka wakka</em>
<em>wakka blarg</em>
<em>blarg</em>';
echo "<div>Original Text: <xmp>$text</xmp></div>";
$html = str_get_html($text);
// Find all elements
$tags = $html->find('*');
foreach ($tags as $key => $tag) {
// If text in tag contains 'blarg'
if (strcmp(trim($tag->plaintext),'blarg') == 0) {
echo "<div> 'blarg' found in \$tags[$key]: <xmp>".$tag->outertext."</xmp></div>";
}
}
I don't know what you want to do with, but this may be a start :)
I have a string of data that is set as $content, an example of this data is as follows
This is some sample data which is going to contain an image in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">. It will also contain lots of other text and maybe another image or two.
I am trying to grab just the <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg"> and save it as another string for example $extracted_image
I have this so far....
if( preg_match_all( '/<img[^>]+src\s*=\s*["\']?([^"\' ]+)[^>]*>/', $content, $extracted_image ) ) {
$new_content .= 'NEW CONTENT IS '.$extracted_image.'';
All it is returning is...
NEW CONTENT IS Array
I realise my attempt is probably completly wrong but can someone tell me where I am going wrong?
Your first problem is that http://php.net/manual/en/function.preg-match-all.php places an array into $matches, so you should be outputting the individual item(s) from the array. Try $extracted_image[0] to start.
You need to use a different function, if you only want one result:
preg_match() returns the first and only the first match.
preg_match_all() returns an array with all the matches.
Using regex to parse valid html is ill-advised. Because there can be unexpected attributes before the src attribute, because non-img tags can trick the regular expression into false-positive matching, and because attribute values can be quoted with single or double quotes, you should use a dom parser. It is clean, reliable, and easy to read.
Code: (Demo)
$string = <<<HTML
This is some sample data which is going to contain an image
in the format <img src="http://www.randomdomain.com/randomfolder/randomimagename.jpg">.
It will also contain lots of other text and maybe another image or two
like this: <img alt='another image' src='http://www.example.com/randomfolder/randomimagename.jpg'>
HTML;
$srcs = [];
$dom=new DOMDocument;
$dom->loadHTML($string);
foreach ($dom->getElementsByTagName('img') as $img) {
$srcs[] = $img->getAttribute('src');
}
var_export($srcs);
Output:
array (
0 => 'http://www.randomdomain.com/randomfolder/randomimagename.jpg',
1 => 'http://www.example.com/randomfolder/randomimagename.jpg',
)
I have a little issue as my preg_match_all is not running properly.
what I want to do is extract the src parameter of all the images in the post_content from the wordpress which is a string - not a complete html document/DOM (thus cannot use a document parser function)
I am currently using the below code which is unfortunately too untidy and works for only 1 image src, where I want all image sources from that string
preg_match_all( '/src="([^"]*)"/', $search->post_content, $matches);
if ( isset( $matches ) )
{
foreach ($matches as $match)
{
if(strpos($match[0], "src")!==false)
{
$res = explode("\"", $match[0]);
echo $res[1];
}
}
}
can someone please help here...
Using regular expressions to parse an HTML document can be very error prone. Like in your case where not only IMG elements have an SRC attribute (in fact, that doesn’t even need to be an HTML attribute at all). Besides that, it also might be possible that the attribute value is not enclosed in double quote.
Better use a HTML DOM parser like PHP’s DOMDocument and its methods:
$doc = new DOMDocument();
$doc->loadHTML($search->post_content);
foreach ($doc->getElementsByTagName('img') as $img) {
if ($img->hasAttribute('src')) {
echo $img->getAttribute('src');
}
}
You can use a DOM parser with HTML strings, it is not necessary to have a complete HTML document. http://simplehtmldom.sourceforge.net/
The site I'm working on has a database table filled with glossary terms. I am building a function that will take some HTML and replace the first instances of the glossary terms with tooltip links.
I am running into a problem though. Since it's not just one replace, the function is replacing text that has been inserted in previous iterations, so the HTML is getting mucked up.
I guess the bottom line is, I need to ignore text if it:
Appears within the < and > of any HTML tag, or
Appears within the text of an <a></a> tag.
Here's what I have so far. I was hoping someone out there would have a clever solution.
function insertGlossaryLinks($html)
{
// Get glossary terms from database, once per request
static $terms;
if (is_null($terms)) {
$query = Doctrine_Query::create()
->select('gt.title, gt.alternate_spellings, gt.description')
->from('GlossaryTerm gt');
$glossaryTerms = $query->rows();
// Create whole list in $terms, including alternate spellings
$terms = array();
foreach ($glossaryTerms as $glossaryTerm) {
// Initialize with title
$term = array(
'wordsHtml' => array(
h(trim($glossaryTerm['title']))
),
'descriptionHtml' => h($glossaryTerm['description'])
);
// Add alternate spellings
foreach (explode(',', $glossaryTerm['alternate_spellings']) as $alternateSpelling) {
$alternateSpelling = h(trim($alternateSpelling));
if (empty($alternateSpelling)) {
continue;
}
$term['wordsHtml'][] = $alternateSpelling;
}
$terms[] = $term;
}
}
// Do replacements on this HTML
$newHtml = $html;
foreach ($terms as $term) {
$callback = create_function('$m', 'return \'<span>\'.$m[0].\'</span>\';');
$term['wordsHtmlPreg'] = array_map('preg_quote', $term['wordsHtml']);
$pattern = '/\b('.implode('|', $term['wordsHtmlPreg']).')\b/i';
$newHtml = preg_replace_callback($pattern, $callback, $newHtml, 1);
}
return $newHtml;
}
Using Regexes to process HTML is always risky business. You will spend a long time fiddling with the greediness and laziness of your Regexes to only capture text that is not in a tag, and not in a tag name itself. My recommendation would be to ditch the method you are currently using and parse your HTML with an HTML parser, like this one: http://simplehtmldom.sourceforge.net/. I have used it before and have recommended it to others. It is a much simpler way of dealing with complex HTML.
I ended up using preg_replace_callback to replace all existing links with placeholders. Then I inserted the new glossary term links. Then I put back the links that I had replaced.
It's working great!