I'm trying to extract the mileage value from different ebay pages but I'm stuck as there seem to be too many patterns because the pages are a bit different . Therefore I would like to know if you can help me with a better pattern .
Some examples of items are the following :
http://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100
http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110
http://cgi.ebay.com/ebaymotors/ws/eBayISAPI.dll?ViewItemNext&item=250647101696
Please see the patterns at the following link (I still cannot figure it out how to escape the html here
http://pastebin.com/zk4HAY3T
However they are not enough many as it seems there are still new patters....
Don't use regular expressions to parse HTML. Even for a relatively simple thing such as this, regular expressions make you highly dependent on the exact markup.
You can use DOMDocument and XPath to grab the value nicely, and it's somewhat more resilient to changes in the page:
$doc = new DOMDocument();
#$doc->loadHtmlFile($url);
$xpath = new DOMXpath($doc);
foreach ($xpath->query('//th[contains(., "Mileage")]/following-sibling::td') as $td) {
var_dump($td->textContent);
}
The XPath query searches for a <th> which contains the word "Mileage", then selects the <td>s following it.
You can then lop off the miles suffix and get rid of commas using str_replace or substr.
This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.
/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i
Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.
Recognizing the duplication there, you could simplify (logically, at least) a bit more:
/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i
You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the (?:<[^>]*>){2} part. The ?: tells it not to remember that sequence, so that $matches[1] still contains the number you're looking for, and the {2} indicates that you want to match the previous sequence exactly twice.
Related
I am working on moving some blog-ish articles to a new third-party home, and need to replace some existing URLs with new ones. I cannot use XML, and am being forced to use a wrapper class that requires this search to happen in regex. I'm currently having trouble regex-ing for the URLs that exist in the html. For example if the html is:
<h1>Whatever</h1>
I need my regex to return:
http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345
The beginning part of the URL never changes (the "http://www.website.com/article/" part). However, I have no clue what the slug phrases are going to be, but do know they will contain an unknown about of hyphens between the words. The ID number at the end of the URL could be any integer.
There are multiple links of these types in each article, and there are also other types of URLs in the article that I want to be sure are ignored, so I can't just look for phrases starting with http inside of quotes.
FWIW: I'm working in php and am currently trying to use preg_match_all to return an array of the URLs needed
Here's my latest attempt:
$array_of_urls = [];
preg_match_all('/http:\/\/www\.website\.com\/article\/[^"]*/', $variable_with_html, $array_of_urls);
var_dump($array_of_urls);
And then I get nada dumped out. Any help appreciated!!!
We, StackOverflow volunteers, must insist on enjoying the stability of a dom parser rather than regex when parsing html data.
Code: (Demo)
$html=<<<HTML
<h1>Whatever</h1>
<p>Here is a url as plain text: http://www.website.com/article/sluggy-slug</p>
<div>Here is a qualifying link: Whatever</div>
HTML;
$dom = new DomDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $item) {
$output[] = $item->getAttribute('href');
}
var_export($output);
Output:
array (
0 => 'http://www.website.com/article/slug-that-has-undetermined-amount-of-hyphens/12345',
1 => 'http://www.website.com/article/slugger-sluggington-jr/666',
)
If for some crazy reason, the above doesn't work for your project and you MUST use regex, this should suffice:
~<a.*?href="\K[^"]+~i // using case-insensitive flag in case of all-caps syntax
Pattern Demo
I am programmatically cleaning up some basic grammar in comments and other user submitted content. Capitalizing I, the first letter of sentence, etc. The comments and content are mixed with HTML as users have some options in formatting their text.
This is actually proving to bit a bit more challenging than expected, especially to someone new to PHP and regex.
If there a function like ucfirst that will ignore html to help capitalize sentences?
Also, any links or tutorials on cleaning up text like this in html, would be appreciated. Please leave anything you feel would help in the comments. thanks!
EDIT:
Sample Text:
<div><p>i wuz walkin thru the PaRK and found <strong>ur dog</strong>. <br />i hoPe to get a reward.<br /> plz call or text 7zero4 8two8 49 sevenseven</div>
I need for it to be (ultimately)
<div><p>I was walking through the park and found <strong>your dog<strong>. <p>I hope to get a reward.</p><p> Please call or text (704) 828-4977.</p>
I know this is going a little farther than the intended question, but my thought was to do this incrementally. ucfirst() is just one of many functions I was using to do one small cleanup at a time per scan. Even if I had to run the text 100 times through the filter, this runs on a cron run when the site has no traffic. I wish there was a discussion forum where this could continue as obviously there would be some great ideas on continuing the approach. Any thoughts on how to approach this as an overall project by all means please leave a comment.
I guess in the spirit of the question itself. ucfirst then would not be the best function for this as it could not take an argument list of things to ignore. A flag IGNORE_HTML would be great!
Given this is a PHP question, then the DOM parser recommended below sounds like the best answer? Thoughts?
You can also add a CSS pseudo-element to your desired elements like this:
div:first-letter {
text-transform: uppercase;
}
But you will probably need to change the way, you print out your senteces ( if you are printing them all in one huge tag ), since CSS lacks the ability to detect the start of a new sentence inside a single tag :(
You should probably use a DOM parser (either the built-in one or for example this one, which is really easy to use).
Walk through all of the text nodes in your HTML and perform the clean-up with preg_replace_callback, ucfirst and a regular expression like this one:
'/(\s*)([^.?!]*)/'
This will match a string of whitespace, and then as many non-sentence-ending-punctuation characters as possible. The actual sentence (starting with a letter, unless your sentence starts with ", which complicates things a bit) will then be found in the first capturing group.
But from your question, I suppose you are already doing something like the latter and your code is just choking on HTML tags. Here is some example code to get all text nodes with the second DOM parser I linked:
require 'simple_html_dom.php';
$html = new simple_html_dom();
$html->load($fullHtmlStr);
foreach($html->find('text') as $textNode)
$textNode = cleanupFunction($textNode);
$cleanedHtmlStr = $html->save();
In html it will be very difficult to do, as you will be building some kind of html parser. My suggestion would be to cleanup the text before it is transformed into html, at the moment you pull it out of the database. Or even better, cleanup the database once.
This should do it:
function html_ucfirst($s) {
return preg_replace_callback('#^((<(.+?)>)*)(.*?)$#', function ($c) {
return $c[1].ucfirst(array_pop($c));
}, $s);
}
Converts
<b>foo</b> to <b>Foo</b>,
<div><p>test</p></div> to <div><p>Test</p></div>,
but also bar to Bar.
Edit: According to your detailed question, you probably want to apply this function to each sentence. You will have to parse the text first (e.g. splitting by periods).
I need help with a REGEX that will find a link that comes in different formats based on how it got inserted to the HTML page.
I am capable of reading the pages into PHP. Just not able to the right REGEX that will find URL and insulate them.
I have a few examples on how they are getting inserted. Where sometimes they are plain text links, some of wrapped around them. There is even the odd occasion where text that is not part of the link gets inserted without spacing.
Both Article ID and Article Key are never the same. Article Key however always ends with a numeric. If this is possible I sure could use the help. Thanks
Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941
http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566
http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392
This is a link description
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.
In the end I am just looking for the URL.
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736
DO NOT USE A REGEX! Use a XML parser...
$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[#href]');
foreach($anchors as $anchor){
$href = $anchor->getAttribute('href');
if(preg_match($regexToMatchUrls, $href)){
//do stuff
}
}
So $regexToMatchUrls would be a regex jsut to match the URLs your are looking for... not any of the html which is much simpler - then you can take action when a match occurs.
This regex work for me:
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)/g
UPDATE:
I added a \d at the end of the regex.
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)\d/g
To use it in PHP you need /.../msi
PHP Example in action: http://ideone.com/N0TKM
I'm writing an application for my client that uses a WYSIWYG to allow employees to modify a letter template with certain variables that get parsed out to be information for the customer that the letter is written for.
The WYSIWYG generates HTML that I save to a SQL server database. I then use a PHP class to generate a PDF document with the template text.
Here's my issue. The PDF generation class can translate b,u,i HTML tags. That's it. This is mostly okay, except I need blockquote to be translated too. I figure the best solution would be to write a regex statement that is to take the contents of each blockquote HTML block, and replace each line within the block with five spaces. The trick is that some blockquotes might contain nested blockquotes (double indenting, and what not)
But unfortunately I have never been too well versed with regex, and I spent the last 1.5 hours experimenting with different patterns and got nothing working.
Here are the gotchyas:
String may or may not contain a blockquote block
String could contain multiple blockquotes
String could contain potentially any level of nesting of blockquotes blocks
We can rely on the HTML being properly formed
A sample input string would be look something like something like this:
Dear Charlie,<br><br>We are contacting you because blah blah blah blah.<br><br><br>To login, please use this information:<blockquote>Username: someUsername<br>Password: somePassword</blockquote><br><br>Thank you.
To simply the solution, I need to replace each HTML break inside each blockquote with 5 spaces and then the \n line break character.
You might want to check PHP Simple HTML DOM Parser out. You can use it to parse the input to an HTML DOM tree and use that.
~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~
You will need to run this regex recursively using preg_replace_callback:
const REGEX_BLOCKQUOTE = '~<blockquote>((?:[^<]*+(?:(?!<blockquote>)|(?R))*+)*+)</blockquote>~';
function blockquoteCallback($matches) {
return doIndent(preg_replace_callback(REGEX_BLOCKQUOTE, __FUNCTION__, $matches[1]));
}
$output = preg_replace_callback(REGEX_BLOCKQUOTE, 'blockQuoteCallback', $input);
My regex assumes, that there won't be any attributes on the blockquote or anywhere else.
(PS: I'll leave the "Use a DOM parser" comment to someone else.)
Regular expressions have a theory behind them, and even though the modern day's regular expresison engine provide can provide a 'Type - 2.5' level language , some things are still not doable. In your partiular case, nesting is not achievable easily.
A simple way way to explain this, is to say that regular expression can't keep a count ..
i.e. they can't count the nesting level...
what is you need is a limited CFG ( the paren-counting types ) ..
you need to somehow keep a count ..may be a stack or tree ...
I am trying to index some content from a series of .html's that share the same format.
So I get a lot of lines like this: <a href="meh">[18] blah blah blah < a...
And the idea is to extract the number (18) and the text next to it (blah...). Furthermore, I know that every qualifying line will start with "> and end with either <a or </p. The issue stems from the need to keep all other htmHTML tags as part of the text (<i>, <u>, etc.).
So then I have something like this:
$docString = file_get_contents("http://whatever.com/some.htm");
$regex="/\">\ [(.*?)\ ] (<\/a>)(.) *?(<)/";
preg_match_all($regex,$docString,$match);
Let's look at $regex for a sec. Ignore it's spaces, I just put them here because else some characters disappear. I specify that it will start with ">. Then I do the numbers inside the [] thing. Then I single out the </a>. So far so good.
At the end, I do a (.)*?(<). This is the turning point. By leaving the last bit, (<) like that, The text will be interrupted when an underline or italics tag is found. However, if I put (<a|</p) the resulting array ends up empty. I've tried changing that to only (<a), but it seems that 2 characters mess up the whole ting.
What can I do? I've been struggling with this all day.
PHP Tidy is your friend. Don't use regexes.
Something like /">\[(.*)\](.*)(?:<(?:a|\/p))/ seems to work fine for given your example and description. Perhaps adding non-capturing subpatterns does it? Please provide a counterexample wherein this doesn't work for you.
Though I agree that RegEx isn't a parser, it sounds like what you're looking for is part of a regularly behaved string - which is exactly what RegEx is strong at.
As you've found, using a regex to parse HTML is not very easy. This is because HTML is not particularly regular.
I suggest using an XML parser such as PHP's DomDocument.
Create an object, then use the loadHTMLFile method to open the file. Extract your a tags with getElementsByTagName, and then extract the content as the NodeValue property.
It might look like
// Create a DomDocument object
$html = new DOMDocument();
// Load the url's contents into the DOM
$html->loadHTMLFile("http://whatever.com/some.htm");
// make an array to hold the text
$anchors = array();
//Loop through the a tags and store them in an array
foreach($html->getElementsByTagName('a') as $link) {
$anchors[] = $link->nodeValue;
}
One alternative to this style of XML/HTML parser is phpquery. The documentation on their page should do a good job of explaining how to extract the tags. If you know jQuery, the interface may seem more natural.