I am working on moving some blog-ish articles to a new third-party home, and need to replace some existing URLs with new ones. I cannot use XML, and am being forced to use a wrapper class that requires this search to happen in regex. I'm currently having trouble regex-ing for the URLs that exist in the html. For example if the html is:
<h1>Whatever</h1>
I need my regex to return:
http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345
The beginning part of the URL never changes (the "http://www.website.com/article/" part). However, I have no clue what the slug phrases are going to be, but do know they will contain an unknown about of hyphens between the words. The ID number at the end of the URL could be any integer.
There are multiple links of these types in each article, and there are also other types of URLs in the article that I want to be sure are ignored, so I can't just look for phrases starting with http inside of quotes.
FWIW: I'm working in php and am currently trying to use preg_match_all to return an array of the URLs needed
Here's my latest attempt:
$array_of_urls = [];
preg_match_all('/http:\/\/www\.website\.com\/article\/[^"]*/', $variable_with_html, $array_of_urls);
var_dump($array_of_urls);
And then I get nada dumped out. Any help appreciated!!!
We, StackOverflow volunteers, must insist on enjoying the stability of a dom parser rather than regex when parsing html data.
Code: (Demo)
$html=<<<HTML
<h1>Whatever</h1>
<p>Here is a url as plain text: http://www.website.com/article/sluggy-slug</p>
<div>Here is a qualifying link: Whatever</div>
HTML;
$dom = new DomDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $item) {
$output[] = $item->getAttribute('href');
}
var_export($output);
Output:
array (
0 => 'http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345',
1 => 'http://www.website.com/article/slugger-sluggington-jr/666',
)
If for some crazy reason, the above doesn't work for your project and you MUST use regex, this should suffice:
~<a.*?href="\K[^"]+~i // using case-insensitive flag in case of all-caps syntax
Pattern Demo
Related
What I am trying to achieve
I am trying to replace the 'innerHTML' of any (in my case) html tag, that has a specific class assigned to it, within a file_get_contents() string, without altering the other content. Later I will create a file (with file_put_contents()).
I am specifically trying to avoid the use of DOMDocuments, Xpath, simple_html_dom because these alter the formatting of a document.
The class markers are just a way to mark the elements in the source, like lightbox does. Marking with a class seemed most elegant, but maybe marking elements in a different way makes the solution easier? I doubt it will make a difference though.
The code should also match when:
when class="..." contains other classes
when innerHTML contains other tags
It is not necessary but it would be amazing if it even matches if:
There is php in class="..."
php inbetween class="..." and >
What I have tried
(in counter-chronological order)
1 - trying to work with the following fucntion I've found in other so answers and php.net:
function preg_replace_nth($pattern, $replacement, $subject, $nth=1) {
return preg_replace_callback($pattern,
function($found) use (&$pattern, &$replacement, &$nth) {
$nth--;
if ($nth==0) return preg_replace($pattern, $replacement, reset($found) );
return reset($found);
},$subject ,$nth );
}
I am not a regex expert and in combination with the php functions it becomes, for me, very difficult, that's why I ask for help. (I've been working on this for an hour or 8.)
I tried feeding it the following regex pattern (did many small alterations:
1 '#(?<=class=\"classToMatch\".*?>).*?(?=</)#';
For the last 30 alterations it keeps returning:
Warning: preg_replace_callback(): Compilation failed: lookbehind assertion is not fixed length at offset xx
Things I realise that are perhaps problematic for regex:
I do not have the luxury to be able to look for a specific closing tag (e.g. </h2>) because the tag could be any element. If really necesarry, maybe I should limit my request to <p>, <h(x)> and <a> elements.
I think dealing with nested elements might become problematic.
2 - working with simple_html_dom and DOMDocument
First I was delighted to see that it worked, but when I opened the source code of the edited document I was horrified because it deleted a lot of formatting.
This was the working code and should be fine for anyone working with html documents with little php and javascript.
$nth = 0; // nth occurrence (starts with 0)
$replaceWith = ''; // replacement string
$dom = new DOMDocument();
#$dom->loadHTMLFile("source.php");
// find all elements with specific class
$finder = new DomXPath($dom);
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(#class), ' '), ' classname ')]");
if (!is_int($nodes->length) || $nodes->length < 1) die('No element found');
$nodeToChange = $nodes->item($nth);
$nodeToChange ->removeChild($nodeToChange ->firstChild);
$fragment = $dom->createDocumentFragment();
$fragment->appendXML($replaceWith);
$lentNodeToEdit->appendChild($fragment);
$dom->saveHTMLFile("test.php");
3 - things with strpos etc. and I am currently considering returning to these functions.
The following regex might be helpful to you:
<(?<tag>\w*)\sclass=\"lent-editable\">(?<text>.*)</\k<tag>>
You will need to find the group name "text", which is the inner HTML you want to replace.
I need help with a REGEX that will find a link that comes in different formats based on how it got inserted to the HTML page.
I am capable of reading the pages into PHP. Just not able to the right REGEX that will find URL and insulate them.
I have a few examples on how they are getting inserted. Where sometimes they are plain text links, some of wrapped around them. There is even the odd occasion where text that is not part of the link gets inserted without spacing.
Both Article ID and Article Key are never the same. Article Key however always ends with a numeric. If this is possible I sure could use the help. Thanks
Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941
http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566
http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392
This is a link description
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.
In the end I am just looking for the URL.
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736
DO NOT USE A REGEX! Use a XML parser...
$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[#href]');
foreach($anchors as $anchor){
$href = $anchor->getAttribute('href');
if(preg_match($regexToMatchUrls, $href)){
//do stuff
}
}
So $regexToMatchUrls would be a regex jsut to match the URLs your are looking for... not any of the html which is much simpler - then you can take action when a match occurs.
This regex work for me:
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)/g
UPDATE:
I added a \d at the end of the regex.
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)\d/g
To use it in PHP you need /.../msi
PHP Example in action: http://ideone.com/N0TKM
I am trying to index some content from a series of .html's that share the same format.
So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...
And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).
So then I have something like this:
$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);
Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.
At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.
What can I do? I've been struggling with this all day.
PHP Tidy is your friend. Don't use regexes.
Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/ seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.
Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.
As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.
I suggest using an XML parser such as PHP's DomDocument.
Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.
It might look like
// Create a DomDocument object
$html = new DOMDocument();
// Load the url's contents into the DOM
$html->loadHTMLFile("http://whatever.com/some.htm");
// make an array to hold the text
$anchors = array();
//Loop through the a tags and store them in an array
foreach($html->getElementsByTagName('a') as $link) {
$anchors[] = $link->nodeValue;
}
One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.
I'm trying to extract the mileage value from different ebay pages but I'm stuck as there seem to be too many patterns because the pages are a bit different . Therefore I would like to know if you can help me with a better pattern .
Some examples of items are the following :
http://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100
http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110
http://cgi.ebay.com/ebaymotors/ws/eBayISAPI.dll?ViewItemNext&item=250647101696
Please see the patterns at the following link (I still cannot figure it out how to escape the html here
http://pastebin.com/zk4HAY3T
However they are not enough many as it seems there are still new patters....
Don't use regular expressions to parse HTML. Even for a relatively simple thing such as this, regular expressions make you highly dependent on the exact markup.
You can use DOMDocument and XPath to grab the value nicely, and it's somewhat more resilient to changes in the page:
$doc = new DOMDocument();
#$doc->loadHtmlFile($url);
$xpath = new DOMXpath($doc);
foreach ($xpath->query('//th[contains(., "Mileage")]/following-sibling::td') as $td) {
var_dump($td->textContent);
}
The XPath query searches for a <th> which contains the word "Mileage", then selects the <td>s following it.
You can then lop off the miles suffix and get rid of commas using str_replace or substr.
This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.
/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i
Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.
Recognizing the duplication there, you could simplify (logically, at least) a bit more:
/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i
You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the (?:<[^>]*>){2} part. The ?: tells it not to remember that sequence, so that $matches[1] still contains the number you're looking for, and the {2} indicates that you want to match the previous sequence exactly twice.
I have an html page loaded into a PHP variable and am using str_replace to change certain words with other words. The only problem is that if one of these words appears in an important peice of code then the whole thing falls to bits.
Is there any way to only apply the str_replace function to certain html tags? Particularly: p,h1,h2,h3,h4,h5
EDIT:
The bit of code that matters:
$yay = str_ireplace($find, $replace , $html);
cheers and thanks in advance for any answers.
EDIT - FURTHER CLARIFICATION:
$find and $replace are arrays containing words to be found and replaced (respectively). $html is the string containing all the html code.
a good example of it falling to bits would be if I were to find and replace a word that occured in e.g. the domain name. So if I wanted to replace the word 'hat' with 'cheese'. Any occurance of an absolute path like
www.worldofhat.com/images/monkey.jpg
would be replaced with:
www.worldofcheese.com/images/monkey.jpg
So if the replacements could only occur in certain tags, this could be avoided.
Do not treat the HTML document as a mere string. Like you already noticed, tags/elements (and how they are nested) have meaning in an HTML page and thus, you want to use a tool that knows what to make of an HTML document. This would be DOM then:
Here is an example. First some HTML to work with
$html = <<< HTML
<body>
<h1>Germany reached the semi finals!!!</h1>
<h2>Germany reached the semi finals!!!</h2>
<h3>Germany reached the semi finals!!!</h3>
<h4>Germany reached the semi finals!!!</h4>
<h5>Germany reached the semi finals!!!</h5>
<p>Fans in Germany are totally excited over their team's 4:0 win today</p>
</body>
HTML;
And here is the actual code you would need to make Argentina happy
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//*[self::h1 or self::h2 or self::p]');
foreach( $nodes as $node ) {
$node->nodeValue = str_replace('Germany', 'Argentina', $node->nodeValue);
}
echo $dom->saveHTML();
Just add the tags you want to replace content in the XPath query call. An alternative to using XPath would be to use DOMDocument::getElementsByTagName, which you might know from JavaScript:
$nodes = $dom->getElementsByTagName('h1');
In fact, if you know it from JavaScript, you might know a lot more of it, because DOM is actually a language agnostic API defined by the W3C and implemented in many languages. The advantage of XPath over getElementsByTagName is obviously that you can query multiple nodes in one go. The drawback is, you have to know XPath :)