I'm looking for a way to get all the form inputs and respective values from a page given a specific URL and form name.
function GetForm($url, $name)
{
return array
(
'field_name_1' => 'value_1',
'field_name_2' => 'value_2',
'select_field_name' => array('option_1', 'option_2', 'option_3'),
);
}
GetForm('http://www.google.com/', 'f');
Can anyone provide me with the necessary regular expressions to accomplish this?
EDIT: I understand that querying the DOM would be far more reliable, however what I'm looking for is a website agnostic solution that allows me to get all the fields of a given form. I don't believe this is possible with DOM without knowing the document nodes first, am I wrong?
I don't need a bullet proof solution, just something that works on standard web pages, for the FORM tag I've come up with the following RegEx;
'~<form.*?name=[\'"]?' . $name . '[\'"]?.*?>(.+?)</form>~is'
I believe that doing something similar for input fields won't be difficult, what I find most challenging is the RegEx for the select and option fields.
Using regex to parse HTML is probably not the best way to go.
You might take a look at DOMDocument::loadHTML, which will allow you to work with an HTML document using DOM methods (and XPath queries, for instance, if you know those).
You might also want to take a look at Zend_Dom and Zend_Dom_Query, btw, which are quite nice if you can use some parts of Zend Framework in your application.
They are used to get fetch data from HTML pages when doing functionnal testing with Zend_Test, for instance -- and work quite well ;-)
It may seem harder in the first place... But, considering the mess some HTML pages are, it is probably a much wiser idea...
EDIT after the comment and the edit of the OP
Here are a couple of thought about, to begin by something "simple", an input tag :
it can spread accross several lines
it can have many attributes
condirering only name and value are of interest to you, you have to deal with the fact that those two can be in any possible order
attributes can have double-quotes, single-quotes, or even nothing arround their values
tags / attributes can be both lower-case or upper-case
tags don't always have to be closed
Well, some of those points are not valid-HTML ; but still work in the most commons web-browsers, so they have to be taken into account...
Only with those points, I wouldn't like to be the one writting the regex ^^
But I suppose there might be others difficulties I didn't think about.
On the other side, you have DOM and xpath... To get the value of an input name="q" (example is this page), it's a matter of something like this :
$url = 'http://www.google.fr/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=com.ubuntu:en-US:unofficial&client=firefox-a';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (#$dom->loadHTML($html)) {
// yep, not necessarily valid-html...
$xpath = new DOMXpath($dom);
$nodeList = $xpath->query('//input[#name="q"]');
if ($nodeList->length > 0) {
for ($i=0 ; $i<$nodeList->length ; $i++) {
$node = $nodeList->item($i);
var_dump($node->getAttribute('value'));
}
}
} else {
// too bad...
}
What matters here ? The XPath query, and only that... And is there anything static/constant in it ?
Well, I say I want all <input> that have a name attribute that is equal to "q".
And it just works : I'm getting this result :
string 'test' (length=4)
string 'test' (length=4)
(I checked : there are two input name="q" on the page ^^ )
Do I know the structure of the page ? Absolutly not ;-)
I just know I/you/we want input tags named q ;-)
And that's what we get ;-)
EDIT 2 : and a bit fun with select and options :
Well, just for fun, here's what I came up for select and option :
$url = 'http://www.google.fr/language_tools?hl=fr';
$html = file_get_contents($url);
$dom = new DOMDocument();
if (#$dom->loadHTML($html)) {
// yep, not necessarily valid-html...
$xpath = new DOMXpath($dom);
$nodeListSelects = $xpath->query('//select');
if ($nodeListSelects->length > 0) {
for ($i=0 ; $i<$nodeListSelects->length ; $i++) {
$nodeSelect = $nodeListSelects->item($i);
$name = $nodeSelect->getAttribute('name');
$nodeListOptions = $xpath->query('option[#selected="selected"]', $nodeSelect); // We want options that are inside the current select
if ($nodeListOptions->length > 0) {
for ($j=0 ; $j<$nodeListOptions->length ; $j++) {
$nodeOption = $nodeListOptions->item($j);
$value = $nodeOption->getAttribute('value');
var_dump("name='$name' => value='$value'");
}
}
}
}
} else {
// too bad...
}
And I get as an output :
string 'name='sl' => value='fr'' (length=23)
string 'name='tl' => value='en'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
string 'name='sl' => value='en'' (length=23)
string 'name='tl' => value='fr'' (length=23)
Which is what I expected.
Some explanations ?
Well, first of all, I get all the select tags of the page, and keep their name in memory.
Then, for each one of those, I get the selected option tags that are its descendants (there's always only one, btw).
And here, I have the value.
A bit more complicated that the previous example... But still much more easy than regex, I believe... Took me maybe 10 minutes, not more... And I still won't have the courage (madness ?) to start thinkg about some kind of mutant regex that would be able to do that :-D
Oh, and, as a sidenote : I still have no idea what the structure of the HTML document looks like : I have not even taken a single look at it's source ^^
I hope this helps a bit more...
Who knows, maybe I'll convince you regex are not a good idea when it comes to parsing HTML... maybe ? ;-)
Still : have fun !
Related
I am making a website, and I would like to make a custom Markup type language in PHP. I want the tags to be surrounded with [ and ]. Now, I was thinking about this, like anyone would, and I could do something like this:
function formatMarkup($markup = ''){
$markup = str_replace('[color=blue]', '<span style="color: blue">', $markup);
return $markup
}
Even though that might work, it would be more progrematically correct if it would do something like explode(), but starting at every [ and ending at every ]. This would be great if I found out. Thank you for your time and effort.
EDIT:
I have decided to use preg_split(). It seems nice, and all, but I cannot get the regex. Here is my code.
EDIT #2:
I have got most of the regex done, but there are uneeded extra keys in the array. How would I fix them? Here is my new code.
I have made my Markup language. I used
$split = preg_split("/(\[|\])/", $markup);
to get the individual "tags" and used
foreach($split as $k => $v){
if(strlen($v) < 1){
continue;
}
to illiterate through each of them, and check if the value is empty. Then, after that, I would do all of my checks, and parse the code blocks together, and make line, after line, the re-constructed text.
I’m writing a custom parser/data extractor for some pretty shitty HTML.
Changing the HTML is out of the question.
I will spare you the details of the hoops I’ve had to jump through but I’ve now come pretty close to my original goal. I’m using a combination of DOMDocument getElementByName, regular expression replace (I know, I know...), and XPath queries.
I need to get all the text out of the body of the document. I would like for the navigation to remain a separate entity, at least in the abstract. Here’s what I’m doing now:
$contentnodes = $xpath->query("//body//*[not(self::a)]/text()|//body//ul/li/a");
foreach ($contentnodes as $contentnode) {
$type = $contentnode->nodeName;
$content = $contentnode->nodeValue;
$output[] = array( $type, $content);
}
This works, except that of course it treats all of the links on the page differently, and I only want it to do that to the navigation.
What XPath syntax can I use so that, in the first part of that query, before the |, I tell it to get all the text nodes of body’s children except ul > li > a.
Please note that I cannot rely on the presence of p tags or h1 tags or anything sensible like that to make educated guesses about content.
Thanks
Update: #hr_117’s answer below works. I’ve also found that you can use multiple not statements like so:
//body//text()[not(parent::a/parent::li/parent::ul)][not(parent::h1)]
You may try something like this:
//body//text()[not(parent::a/parent::li/parent::ul)]|//body//ul/li/a
//body//*[not(self::a/parent::li/parent::ul)]/text()[normalize-space()]|//body//ul/li/a
(test)
i have a problem with getting content from a XML into a mysql database.
This is the code:
$objDOM = new DOMDocument('1.0', 'UTF-8');
$objDOM->load("something.xml"); $IAutnr = $objDOM->getElementsByTagName("Data");
Now, in a for loop:
for($i=$t;$i<=$max;$i++) {
$some= $objDOM->getElementsByTagName("some");
$something = $some->item($i)->nodeValue;
$some2 = $objDOM->getElementsByTagName("some2");
$something2 = $some2->item($i)->nodeValue;
Now put $something and $something2 into the database
}
Now, what happens is, that everything works perfectly fine until one of the Elements (some,some2...) does not exist within the tag "Data". So what he does, is taking the element from the next "Data"-tag and this mixes all my data, so I have data in my database, that actually doesnt belong there. And so I have an all mixed up database.
I allready tried for several hours to change the XML manually by putting the missing tags inside, but with thousands of data records, it is not possible.
So I need to add something into my code, that will have the effect, that if the tag doesnt exist, just leave it and dont take the tag from the next "Data"-Tag.
I actually dont even understand why he is doing that, why is he just jumping into the next "Data"-tag?
Thank you very much for your help!
I'm only guessing here about the content of your XML structure, but I imagine it looks something like
...
<Data>
<some>a</some>
<some2>b</some2>
</Data>
<Data>
<some>c</some>
<some2>d</some2>
</Data>
...
If this is the case, you should be looping over the collection of Data elements in $IAutnr, eg
for($i = 0, $limit = min($IAutnr->length, $max); $i < $limit; $i++) {
$data = $IAutnr->item($i);
$some = $data->getElementsByTagName('some');
$something = $some->item(0)->nodeValue;
$some2 = $data->getElementsByTagName('some2');
$something2 = $some2->item(0)->nodeValue;
// insert
}
Unless you need some of the more advanced features of the DOM library, I'd recommend using SimpleXML.
It does that because you're asking it to extract elements with tag name "some" and "some2" from the entire XML structure, so that's what it does -- it doesn't only look into the branch you intend it to, because you never tell it to do that. One way to fix it is to look at $some->items($i)->parentNode (and maybe to that node's parent, and so on) in order to properly identify the parent $something and $something2 belong to. Of course, there's no guarantee that $something and $something2 belong to the same parent, unless your XML is somehow guaranteed to present either none or both within the same branch. I know the explanation's a bit hairy, but that's the best way I could put it into words.
I have a PHP script that pulls some content off of a server, but the problem is that the line on which the content is changes every day, so I can't just pull a specific line. However, the content is contained within a div that has a unique id. Is it possible (and is it the best way) for regex to search for this unique id and then pass the line of which it's on back to my script?
Example:
HTML file:
<html><head><title>Example</title></head>
<body>
<div id="Alpha"> Blah blah blah </div>
<div id="Beta"> Blah Blah Blah </div>
</body>
</html>
So let's say that I'm looking for the line with an opening div tag with an id of alpha. The code should return 3, because on the third line is the div with the id of alpha.
At the risk of providing more up-votes for Jeff who has already crossed the mountains of madness... see here
The argument rages back and forth, but... it's is a simple one-off or little used script you are writing then sure use regex, if it's more complex and needs to be reliable with little future tweaking then I'd suggest using an HTML parser. HTML is a nasty often non-regular beast to tame. Use the right tool for the job... maybe in your case it's regex, or maybe its a full blown parser.
Generally, NO. But if you are sure that the div will always be one line or there is not another div inside it, you can use it without problem. Something like /<div id=\"mydivid\">(.*?)</div>/ or something similar.
Otherwise, DOMDocument would be a more sane way.
EDIT See from your HTML example. My answer would be "YES". RegEx is a very good tool for this.
I assume that you have the HTML as a continuous text not as lines (which will be slightly different). I also assume that you want the line number more that the line content.
Here is a rought PHP code to extract it. (just to give some idea)
$HTML =
"<html><head><title>Example</title></head>
<body>
<div id=\"Alpha\"> Blah blah blah </div>
<div id=\"Beta\"> Blah Blah Blah </div>
</body>
</html>";
$ID = "Alpha";
function GetLineOfDIV($HTML, $ID) {
$RegEx_Alpha = '/\n(<div id="'.$ID.'">.*?<\/div>)\n/m';
$Index = preg_match($RegEx_Alpha, $HTML, $Match, PREG_OFFSET_CAPTURE);
$Match = $Match[1]; // Only the one in '(...)'
if ($Match == "")
return -1;
//$MatchStr = $Match[0]; Since you do not want it, so we comment it out.
$MatchOffset = $Match[1];
$StartLines = preg_split("/\n/", $HTML, -1, PREG_SPLIT_OFFSET_CAPTURE);
foreach($StartLines as $I => $StartLine) {
$LineOffset = $StartLine[1];
if ($MatchOffset <= $LineOffset)
return $I + 1;
}
return count($StartLines);
}
echo GetLineOfDIV($HTML, $ID);
I hope I give you some idea.
According to Jeff Atwood, you should never parse HTML using regex.
Since the line number is important to you here and not the actual contents of the div, I'd be inclined not to use regex at all. I'd probably explode() the string into an array and loop through that array looking for your marker. Like so:
<?php
$myContent = "[your string of html here]";
$myArray = explode("\n", $myContent);
$arraylen = count($myArray); // So you don't waste time counting the array at every loop
$lineNo = 0;
for($i = 0; $i < $arraylen; $i++)
{
$pos = strpos($myArray[$i], 'id="Alpha"');
if($pos !== false)
{
$lineNo = $i+1;
break;
}
}
?>
Disclaimer: I haven't got a php installation readily available to test this so some debugging may be required.
Hope this helps as I think it's probably just going to be a waste of time for you to implement a parsing engine just to do something so simple - especially if it's a one-off.
Edit: if the content is impotant to you at this stage too then you can use this in combination with the other answers which provide an adequate regex for the job.
Edit #2: Oh what the hey... here's my two cents:
"/<div.*?id=\"Alpha\".*?>.*?(<div.*//div>)*.*?//div>/m"
The (<div.*//div>) tells the regex engine that it may find nested div tags and to just incorporate them into the match if it finds them rather than just stopping at the first </div>. However this only solves the problem if there is only one level of nesting. If there's more, then regex is not for you sorry :(.
The /m also makes the regex engine ignore linebreaks so you don't have to dirty up your expressions with [\S\s] everywhere.
Again, sorry, I've no environment to test this in at the moment so you may need to debug.
Cheers
Iain
The fact that a unique id is involved, sounds promising, but since it will be a DIV, and not necessarily a single line of HTML, it will be difficult to construct a regular expression, and the usual objections to parsing HTML with regexes apply.
Not recommended.
Instead of RegEx, use a parser that is made especially to handle (messy) HTML. This will make your application less brittle in case the HTML changes slightly, and you don't have to hand-craft custom RegEx each time you want to pull out a new piece of data.
See this Stack Overflow page: Mature HTML Parsers for PHP
#OP since your requirement is that easy, you can just use string methods
$f = fopen("file","r");
if($f){
$s="";
while( !feof($f) ){
$i+=1;
$line = fgets($f,4096);
if (stripos($line,'<div id="Alpha">')!==FALSE){
print "line number: $i\n";
}
}
fclose($f);
}
I am trying to create a simple alert app for some friends.
Basically i want to be able to extract data "price" and "stock availability" from a webpage like the folowing two:
http://www.sparkfun.com/commerce/product_info.php?products_id=5
http://www.sparkfun.com/commerce/product_info.php?products_id=9279
I have made the alert via e-mail and sms part but now i want to be able to get the quantity and price out of the webpages (those 2 or any other ones) so that i can compare the price and quantity available and alert us to make an order if a product is between some thresholds.
I have tried some regex (found on some tutorials, but i an way too n00b for this) but haven't managed to get this working, any good tips or examples?
$content = file_get_contents('http://www.sparkfun.com/commerce/product_info.php?products_id=9279');
preg_match('#<tr><th>(.*)</th> <td><b>price</b></td></tr>#', $content, $match);
$price = $match[1];
preg_match('#<input type="hidden" name="quantity_on_hand" value="(.*?)">#', $content, $match);
$in_stock = $match[1];
echo "Price: $price - Availability: $in_stock\n";
It's called screen scraping, in case you need to google for it.
I would suggest that you use a dom parser and xpath expressions instead. Feed the HTML through HtmlTidy first, to ensure that it's valid markup.
For example:
$html = file_get_contents("http://www.example.com");
$html = tidy_repair_string($html);
$doc = new DomDocument();
$doc->loadHtml($html);
$xpath = new DomXPath($doc);
// Now query the document:
foreach ($xpath->query('//table[#class="pricing"]/th') as $node) {
echo $node, "\n";
}
What ever you do: Don't use regular expressions to parse HTML or bad things will happen. Use a parser instead.
1st, asking this question goes too into details. 2nd, extracting data from a website might not be legitimate. However, I have hints:
Use Firebug or Chrome/Safari Inspector to explore the HTML content and pattern of interesting information
Test your RegEx to see if the match. You may need do it many times (multi-pass parsing/extraction)
Write a client via cURL or even much simpler, use file_get_contents (NOTE that some hosting disable loading URLs with file_get_contents)
For me, I'd better use Tidy to convert to valid XHTML and then use XPath to extract data, instead of RegEx. Why? Because XHTML is not regular and XPath is very flexible. You can learn XSLT to transform.
Good luck!
You are probably best off loading the HTML code into a DOM parser like this one and searching for the "pricing" table. However, any kind of scraping you do can break whenever they change their page layout, and is probably illegal without their consent.
The best way, though, would be to talk to the people who run the site, and see whether they have alternative, more reliable forms of data delivery (Web services, RSS, or database exports come to mind).
The simplest method to extract data from Website. I've analysed that my all data is covered within <h3> tag only, so I've prepared this one.
<?php
include(‘simple_html_dom.php’);
// Create DOM from URL, paste your destined web url in $page
$page = ‘http://facebook4free.com/category/facebookstatus/amazing-facebook-status/’;
$html = new simple_html_dom();
//Within $html your webpage will be loaded for further operation
$html->load_file($page);
// Find all links
$links = array();
//Within find() function, I have written h3 so it will simply fetch the content from <h3> tag only. Change as per your requirement.
foreach($html->find(‘h3′) as $element)
{
$links[] = $element;
}
reset($links);
//$out will be having each of HTML element content you searching for, within that web page
foreach ($links as $out)
{
echo $out;
}
?>