PHP scraper - regular expressions

PHP scraper - regular expressions - php

I'm trying to follow a tutorial for web scraping with php.
I understand roughly whats going on, but I don't get how to filter what has been scraped to get exactly what I want. For example:
<?php
$file_string = file_get_contents('page_to_scrape.html');
preg_match('/<title>(.*)<\/title>/i', $file_string, $title);
$title_out = $title[1];
?>
I see that the (.*) will retrieve everything in between title tags, can I use regular expressions to get specific info. Say inside he title had Welcome visitor #100 how would I get the number that comes after the hash?
Or do I have to retrieve everything between the tags then manipulate it later?

Given the title "Welcome visitor #100" and the fact a <title> tag occurs no more than once, the expression should be:
preg_match('~<title>Welcome visitor #(\d+)</title>~', ...);
A lot of people on SO would argue to never use regular expressions to parse (X)HTML; for this task, however, the above should suffice.
Although - as mentioned before - a <title> tag (should) occur no more than once, the pattern
<title>(.*)</title>
would as well match this:
<title>Welcome visitor <title>#<title>100blafoobar</title>
(.*) being the part allowing this. As soon as the page you're scraping your data from changes, the regex might stop working.
EDIT: A method to correctly sift out multiple elements and their attributes:
$dom = new DomDocument;
$dom->loadHTML($page_content);
$elements = $dom->getElementsByTagName('a');
for ($n = 0; $n < $elements->length; $n++) {
$item = $elements->item($n);
$href = $item->getAttribute('href');
}

You would just need to change the regex to match whatever you need. If you are going to use the tile more than once it's better to save the whole and manipulate it later, otherwise just get what you need.
/<title>.*((?<=#)\d*).*<\/title>/i
Would specifically match a number after a hash. It would not match a number without a hash.
There are many ways to write regex, it depends on how general or specific you want to be.
You could also write like this to get any number:
/<title>.*(\d)*.*<\/title>/i

I would first fetch the title tag and then process the title further. The other answers contain perfectly valid solutions for this task.
Some further notes:
Please use DOMDocument for such things, since it is much safer (your regular expression might break on some specific HTML pages)
Please use the non-greedy version of .*: .*?, otherwise you will run into funny things like:
<html>
<head>
<title>a</title>
</head>
<body>
<title>test</title> <!-- not allowed in HTML, but since when does the web pages online actually care about that? -->
</body>
</html>
You will now match everything between <title>a</title>... up to <title>test</title>, including everything in between.

Related

Regular Expression using Preg_Match

I'm using PHP preg_match function...
How can i fetch text in between tags. The following attempt doesn't fetch the value: preg_match("/^<title>(.*)<\/title>$/", $originalHTMLBlock, $textFound);
How can i find the first occurrence of the following element and fetch (Bunch of Texts and Tags):
<div id="post_message_">Bunch of Texts and Tags</div>

This is starting to get boring. Regex is likely not the tool of choice for matching languages like HTML, and there are thousands of similar questions on this site to prove it. I'm not going to link to the answer everyone else always links to - do a little search and see for yourself.
That said, your first regex assumes that the <title> tag is the entire input. I suspect that that's not the case. So
preg_match("#<title>(.*?)</title>#", $originalHTMLBlock, $textFound);
has a bit more of a chance of working. Note the lazy quantifier which becomes important if there is more than one <title> tag in your input. Which might be unlikely for <title> but not for <div>.
For your second question, you only have a working chance with regex if you don't have any nested <div> tags inside the one you're looking for. If that's the case, then
preg_match("#<div id=\"post_message_\">(.*?)</div>#", $originalHTMLBlock, $textFound);
might work.
But all in all, you'd better be using an HTML parser.

use this: <title\b[^>]*>(.*?)</title> (are you sure you need ^ and $ ?)
you can use the same regex expression <div\b[^>]*>(.*?)</div> assuming you don't have a </div> tag in your Bunch of Texts and Tags text. If you do, maybe you should take a look at http://code.google.com/p/phpquery/

Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

I have this block of html:
<div>
<p>First, nested paragraph</p>
</div>
<p>First, non-nested paragraph.</p>
<p>Second paragraph.</p>
<p>Last paragraph.</p>
I'm trying to select the first, non-nested paragraph in that block. I'm using PHP's (perl style) preg_match to find it, but can't seem to figure out how to ignore the p tag contained within the div.
This is what I have so far, but it selects the contents of the first paragraph contained above.
/<p>(.+?)<\/p>/is
Thanks!
EDIT
Unfortunately, I don't have the luxury of a DOM Parser.
I completely appreciate the suggestions to not use RegEx to parse HTML, but that's not really helping my particular use case. I have a very controlled case where an internal application generated structured text. I'm trying to replace some text if it matches a certain pattern. This is a simplified case where I'm trying to ignore text nested within other text and HTML was the simplest case I could think of to explain. My actual case looks something a little more like this (But a lot more data and minified):
#[BILLINGCODE|12345|11|15|2001|15|26|50]#
[ITEM1|{{Escaped Description}}|1|1|4031|NONE|15]
#[{{Additional Details }}]#
[ITEM2|{{Escaped Description}}|3|1|7331|NONE|15]
[ITEM3|{{Escaped Description}}|1|1|9431|NONE|15]
[ITEM4|{{Escaped Description}}|1|1|5131|NONE|15]
I have to reformat a certain column of certain rows to a ton of rows similar to that. Helping my first question would help actual project.

Your regex won't work. Even if you had only non nested paragraph, your capturing parentheses would match First, non-nested ... Last paragraph..
Try:
<([^>]+)>([^<]*<(?!/?\1)[^<]*)*<\1>
and grab \2 if \1 is p.
But an HTML parser would do a better job of that imho.

How about something like this?
<p>([^<>]+)<\/p>(?=(<[^\/]|$))
Does a look-ahead to make sure it is not inside a closing tag; but can be at the end of a string. There is probably a better way to look for what is in the paragraph tags but you need to avoid being too greedy (a .+? will not suffice).

Use a two three step process. First, pray that everything is well formed. Second, First, remove everything that is nested.
s{<div>.*?</div>}{}g; # HTML example
s/#.*?#//g; # 2nd example
Then get your result. Everything that is left is now not nested.
$result = m{<p>(.*?)</p>}; # HTML example
$result = m{\[(.*?)\]}; # 2nd example
(this is Perl. Don't know how different it would look in PHP).

"You shouldn't use regex to parse HTML."
It is what everybody says but nobody really offers an example of how to actually do it, they just preach it. Well, thanks to some motivation from Levi Morrison I decided to read into DomDocument and figure out how to do it.
To everybody that says "Oh, it is too hard to learn the parser, I'll just use regex." Well, I've never done anything with DomDocument or XPath before and this took me 10 minutes. Go read the docs on DomDocument and parse HTML the way you're supposed to.
$myHtml = <<<MARKUP
<html>
<head>
<title>something</title></head>
<body>
<div>
<p>not valid</p>
</div>
<p>is valid</p>
<p>is not valid</p>
<p>is not valid either</p>
<div>
<p>definitely not valid</p>
</div>
</body>
</html>
MARKUP;
$DomDocument = new DOMDocument();
$DomDocument->loadHTML($myHtml);
$DomXPath = new DOMXPath($DomDocument);
$nodeList = $DomXPath->query('body/p');
$yourNode = $DomDocument->saveHtml($nodeList->item(0));
var_dump($yourNode)
// output '<p>is valid</p>'

You might want to have a look at this post about parsing HTML with Regex.
Because HTML is not a regular language (and Regular Expressions are), you can't pares out arbitrary chunks of HTML using Regex. Use an HTML parser, it'll get the job done considerably more smoothly than trying to hack together some regex.

Count start and end html tags

I'm looking for a way to count html tags in a chunk of html using php. This may not be a full web page with a doctype body tags etc.
For example:
If I had something like this
$string = "
<div></div>
<div style='blah'></div>
<p>hello</p>
<p>its debbie mcgee
<p class='pants'>missing p above</p>
<div></div>";
I want to pass it to a function with a tag name such as
CheckHtml( $string, 'p' );
and I would like it to tell me the number of open <p> tags and the number of close p tags </p>. I don't want it to do anything fancy beyond that (no sneaky trying to fix it).
I have tried with string counts with start tags such as <p but it can too easily find things like and return wrong results.
I had a look as DOMDocument but it doesn't seem to count close tags and always expects <html> tags (although I could work around this).
Any suggestions on what to use.

To get a accurate count, you can't use string matching or regex because of the well-known problems of parsing HTML with regex
Nor can you use the output of a standard parser, because that's a DOM consisting of elements and all the information about the tags that were in the HTML has been discarded. End tags will be inferred even for valid HTML, and even some start tags (e.g. html, head, body, tbody) can be inferred. Moreover things like the adoption agency algorithm can result in there being more elements than there were tags in the HTML mark-up. For example <b><i></b>x</i> will result in there being two i elements in the DOM. At the same time, end tags that can't be matched with start tags are simply discarded, as indeed can start and end tags that appear in the wrong place. (e.g. <caption> not in <table> or <legend> not in <fieldset>)
The only way I can think you could do this in any way reliably is this:
There's an open source PHP library for parsing HTML called html5lib.
In there, there's a file called Tokenizer.php and at the end of that file there's a function called emitToken. At this point, the parser has done all the work of figuring out all the HTML weirdnesses¹, and the $token parameter contains all the information about what kind of token has been recognised, including start and end tags.
You could take the library and modify it so that it counts up the start and end tag tokens at that point, and then exposes those totals to your application code at the end of the parse process.
¹: That is, it's figured out the weirdnesses related to your counting problem. It hasn't begun to figure out the tree construction weirdnesses.

You can use substr_count() to return the number of times the needle substring occurs in the haystack $string.
$open_tag_count = substring_count( $string, '<p' );
$close_tag_count = substring_count( $string, '</p>' );
Be aware that '<param and <pre, so you may need to modify your search to handle two different specific cases:
$open_tag_count_without_attributes = substring_count( $string, '<p>' );
$open_tag_count_with_attributes = substring_count( $string, '<p ' );
$open_tag_count = $open_tag_count_without_attributes + $open_tag_count_with_attributes;
You may also wish to consider using [preg_match()][1]. Using a regular expression to parse HTML comes with a fairly substantial set of pitfalls, so use with caution.

substr_count seems like a good bet.
EDIT: You'll have to use preg_match then
I haven't tested, this but, for an idea..
function checkHTML($string,$htmlTag){
$openTags = preg_match('/<'.$htmlTag.'\b[^>]*>',$string);
$closeTags = preg_match('/<\/'.$htmlTag.'>/',$string);
return array($openTags, $closeTags);
}
$numberOfParagraphTags = checkHTML($string,'p');
echo('Open Tags:'.$numberOfParagraphTags[0].' Close Tags:'.$numberOfParagraphTags[1]);

For the chunk of HTML, try using the DomDocument PHP class instead of a string. Then you can use methods such as getElementsByTagName(); that will allow you to count the tags easier and more accurately. To load your string into a DomDocument, you could do something like this:
$doc = new DOMDocument();
$doc->loadHTML($string);
Then, to count your tags, do the following:
$tagList = $doc->getElementsByTagName($tag);
return $tagList.length;

PHP preg_match to find and locate a dynamic URL from HTML Pages

I need help with a REGEX that will find a link that comes in different formats based on how it got inserted to the HTML page.
I am capable of reading the pages into PHP. Just not able to the right REGEX that will find URL and insulate them.
I have a few examples on how they are getting inserted. Where sometimes they are plain text links, some of wrapped around them. There is even the odd occasion where text that is not part of the link gets inserted without spacing.
Both Article ID and Article Key are never the same. Article Key however always ends with a numeric. If this is possible I sure could use the help. Thanks
Here are a few examples.
http://www.example.com/ArticleDetails.aspx?ArticleID=3D10045411&AidKey=3D-2086622941
http://example.com/ArticleDetails.aspx?ArticleID=10919199&AidKey=1956996566
http://www.example.com/ArticleDetails.aspx?ArticleID=10773616&AidKey=1998267392
This is a link description
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736this is not part of the url.
In the end I am just looking for the URL.
http://example.com/ArticleDetails.aspx?ArticleID=10975137&AidKey=701321736

DO NOT USE A REGEX! Use a XML parser...
$dom = DOMDocument::loadHTMLFile($pathToFile);
$finder = new DOMXpath($dom);
$anchors = $finder->query('//a[#href]');
foreach($anchors as $anchor){
$href = $anchor->getAttribute('href');
if(preg_match($regexToMatchUrls, $href)){
//do stuff
}
}
So $regexToMatchUrls would be a regex jsut to match the URLs your are looking for... not any of the html which is much simpler - then you can take action when a match occurs.

This regex work for me:
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)/g
UPDATE:
I added a \d at the end of the regex.
/http:\/\/(www\.)?example\.com\/ArticleDetails.aspx\?ArticleID=(.*?)(\&|\&)AidKey=([\d\w-]*)\d/g
To use it in PHP you need /.../msi
PHP Example in action: http://ideone.com/N0TKM

regex , php, preg_match

I'm trying to extract the mileage value from different ebay pages but I'm stuck as there seem to be too many patterns because the pages are a bit different . Therefore I would like to know if you can help me with a better pattern .
Some examples of items are the following :
http://cgi.ebay.com/ebaymotors/1971-Chevy-C10-Shortbed-Truck-/250647101696?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4100
http://cgi.ebay.com/ebaymotors/1987-HANDICAP-LEISURE-VAN-W-WHEEL-CHAIR-LIFT-/250647101712?cmd=ViewItem&pt=US_Cars_Trucks&hash=item3a5bbb4110
http://cgi.ebay.com/ebaymotors/ws/eBayISAPI.dll?ViewItemNext&item=250647101696
Please see the patterns at the following link (I still cannot figure it out how to escape the html here
http://pastebin.com/zk4HAY3T
However they are not enough many as it seems there are still new patters....

Don't use regular expressions to parse HTML. Even for a relatively simple thing such as this, regular expressions make you highly dependent on the exact markup.
You can use DOMDocument and XPath to grab the value nicely, and it's somewhat more resilient to changes in the page:
$doc = new DOMDocument();
#$doc->loadHtmlFile($url);
$xpath = new DOMXpath($doc);
foreach ($xpath->query('//th[contains(., "Mileage")]/following-sibling::td') as $td) {
var_dump($td->textContent);
}
The XPath query searches for a <th> which contains the word "Mileage", then selects the <td>s following it.
You can then lop off the miles suffix and get rid of commas using str_replace or substr.

This should be a bit more generic - it doesn't care what's inside the html tags. It works on all three of the links you provided.
/Mileage[^<]*<[^>]*><[^>]*>(.*?)\s*miles/i
Of course, there could be better ways depending on what other constraints you have, but this is a good starting point.
Recognizing the duplication there, you could simplify (logically, at least) a bit more:
/Mileage[^<]*(?:<[^>]*>){2}(.*?)\s*miles/i
You're looking for two html tags in a row between the words 'Mileage' and 'miles'. That's the (?:<[^>]*>){2} part. The ?: tells it not to remember that sequence, so that $matches[1] still contains the number you're looking for, and the {2} indicates that you want to match the previous sequence exactly twice.

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

PHP scraper - regular expressions - php

Related

Regular Expression using Preg_Match

Select the first paragraph tag not contained in within another tag using RegEx (Perl-style)

Count start and end html tags

PHP preg_match to find and locate a dynamic URL from HTML Pages

regex , php, preg_match

Categories

Resources