Scrape the date from a HTML page - php

I am writing a PHP HTML page scraper program and I need to find out the date it has been updated.
I did this $html = file_get_html(xyz.com) to get the HTML. One line of the HTML has the date like this 10/24/2016.
I did this:
if (strpos($html, '7nbsp;') !== false) {
if (strpos($html, ' </a>') !== false) {
echo "How to print drawing date--here!";
}
Now here is the dilemma, I cannot search 10/24/2016 because I have no way of knowing when the new date is when the site is updated, it could be 10/30/2016 or 11/12/2016...
Ideally, I would like the date to be in a string, like $date = "11/17/2016".
How do I search the date itself?

This code will work for you:
preg_match('/\ ([0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4})/', $html, $matches);
This is a regex that searches for a date (as long as the date is in correct format). Founded matches will be stored in '$matches' variable.

#krasipenkov was close, but the OP asked for it to be in $date var:
$html = 'lblah
balh asdf asd
<mickey mouse="disney">f3rt6wergsdfg 1/19/2016 <more stuff="here">etc
asdf';
preg_match('/\ ([0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4})/', $html, $matches);
$date = $matches[1];
echo "your date found is $date";
[see it run] http://sandbox.onlinephpfunctions.com/code/27419098cf4bc48a5ca2c683b046679b6c0af85c

Related

PHP Convert Full Date To Short Date

I need a PHP script to loop through all .html files in a directory and in each one find the first instance of a long date (i.e. August 25th, 2014) and then adds a tag with that date in short format (i.e. <p class="date">08/25/14</p>).
Has anyone done something like this before? I'm guessing you'd explode the string and use a complex case statement to convert the month names and days to regular numbers and then implode using /.
But I'm having trouble figuring out the regular expression to use for finding the first long date.
Any help or advice would be greatly appreciated!
Here's how I'd do it in semi-pseudo-code...
Loop through all the files using whatever floats your boat (glob() is an obvious choice)
Load the HTML file into a DOMDocument, eg
$doc = new DOMDocument();
$doc->loadHTMLFile($filePath);
Get the body text as a string
$body = $doc->getElementsByTagName('body');
$bodyText = $body->item(0)->textContent; // assuming there's at least one body tag
Find your date string via this regex
preg_match('/(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}(st|nd|rd|th)?, \d{4}/', $bodyText, $matches);
Load this into a DateTime object and produce a short date string
$dt = DateTime::createFromFormat('F jS, Y', $matches[0]);
$shortDate = $dt->format('m/d/y');
Create a <p> DOMElement with the $shortDate text content, insert it into the DOMDocument where you want and write back to the file using $doc->saveHTMLFile($filePath)
I incorporated the helpful response above into what I already had and it seems to work. I'm sure it's far from ideal but it still serves my purpose. Maybe it might be helpful to others:
<?php
$dir = "archive";
$a = scandir($dir);
$a = array_diff($a, array(".", ".."));
foreach ($a as $value) {
echo '</br>File name is: ' . $value . "<br><br>";
$contents = file_get_contents("archive/".$value);
if (preg_match('/(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}(st|nd|rd|th)?, \d{4}/', $contents, $matches)) {
echo 'the date found is: ' . $matches[0] . "<br><br>";
$dt = DateTime::createFromFormat('F jS, Y', $matches[0]);
$shortDate = $dt->format('m/d/y');
$dateTag = "\n" . '<p class="date">' . $shortDate . '</p>';
$filename ="archive/".$value;
$file = fopen($filename, "a+");
fwrite($file, $dateTag);
fclose($file);
echo 'Date tag added<br><br>';
} else {
echo "ERROR: No date found<br><br>";
}
}
?>
The code assumes the files to modify are in a directory called "archive" that resides in the same directory as the script.
Needed the two different preg_match lines because I found out some dates are listed with the ordinal suffix (i.e. August 24th, 2005) and some are not (i.e. August 24, 2005). Couldn't quite puzzle out exactly how to get a single preg_match that handles both.
EDIT: replaced double preg_match with single one using \d{1,2}(st|nd|rd|th)? as suggested.

Conversion of text within delimeters to valid url

I have to convert an old website to a CMS and one of the challenges I have is at present there are over 900 folders that contain up to 9 text files in each folder. I need to combine the up to 9 text files into one and then use that file as the import into the CMS.
The file concatenation and import are working perfectly.
The challenge that I have is parsing some of the text in the text file.
The text file contains a url in the form of
Some text [http://xxxxx.com|About something] some more text
I am converting this with this code
if (substr ($line1, 0, 7) !=="Replace") {
$pattern = '/\\[/';
$pattern2 = '/\\]/';
$pattern3 = '/\\|/';
$replacement = '<a href="';
$replacement3 = '">';
$replacement2='</a><br>';
$subject = $line1;
$i=preg_replace($pattern, $replacement, $subject, -1 );
$i=preg_replace($pattern3, $replacement3, $i, -1 );
$i=preg_replace($pattern2, $replacement2, $i, -1 );
$line .= '<div class="'.$folders[$x].'">'.$i.'</div>' ;
}
It may not be the most efficient code but it works and as this is a one off exercise execution time etc is not an issue.
Now to the problem that I cannot seem to code around. Some of the urls in the text files are in this format
Some text [http://xxxx.com] some more text
The pattern matching that I have above finds pattern and pattern2 but as there is no pattern3 the url is malformed in the output.
Regular expressions are not my forte is there a way to modify what I have above or is there another way to get the correctly formatted url in my output or will I need to parse the output a second time looking for the malformed url and correct it before writing it to the output file?
You can use preg_replace_callback() to achieve this:
Find any string of the format [...]
Try to split them by the delimiter | using explode()
If the split array contains two pieces, then it means the [...] string contains two pieces: the link href and the link anchor text
If not, then it means the the [...] string contains only the link href part
Format and return the link
Code:
$input = <<<EOD
Some text [http://xxxxx.com|About something] some more text
Some text [http://xxxx.com] some more text
EOD;
$output = preg_replace_callback('#\[([^\]]+)\]#', function($m)
{
$parts = explode('|', $m[1]);
if (count($parts) == 2)
{
return sprintf('%s', $parts[0], $parts[1]);
}
else
{
return sprintf('%1$s', $m[1]);
}
}, $input);
echo $output;
Output:
Some text About something some more text
Some text http://xxxx.com some more text
Live demo

I need to find a string in a string then replace that and text around it

i have a string that has markers and I need to replace with text from a database. this text string is stored in a database and the markers are for auto fill with data from a different part of the database.
$text = '<span data-field="la_lname" data-table="user_properties">
{Listing Agent Last Name}
</span>
<br>RE: The new offer<br>Please find attached....'
if i can find the data marker by:
strpos($text, 'la_lname');
can i use that to select everything in and between the <span> and </span> tags..
so the new string looks like:
'Sommers<br>RE: The new offer<br>Please find attached....'
I thought I could explode the string based on the <span> tags but that opens up a lot of problems as I need to keep the text intact and formated as it is. I just want to insert the data and leave everything else untouched.
To get what's between two parts of a string
for example if you have
<span>SomeText</span>
If you want to get SomeText then I suggest using a function that gets whatever is between two parts that you put as parameters
<?php
function getbetween($content,$start,$end) {
$r = explode($start, $content);
if (isset($r[1])){
$r = explode($end, $r[1]);
return $r[0];
}
return '';
}
$text = '<span>SomeText</span>';
$start = '<span>';
$end = '</span>';
$required_text = getbetween($text,$start,$end);
$full_line = $start.$required_text.$end;
$text = str_replace($full_line, 'WHAT TO REPLACE IT WITH HERE',$text);
You could try preg_replace or use a DOM Parser, which is far more useful for navigating HTML-like-structure.
I should add that while regular expressions should work just fine in this example, you may need to do more complex things in the future or traverse more intrincate DOM structures for your replacements, so a DOM Parser is the way to go in this case.
Using PHP Simple HTML DOM Parser
$html = str_get_html('<span data-field="la_lname" data-table="user_properties">{Listing Agent Last Name}</span><br>RE: The new offer<br>Please find attached....');
$html->find('span')->innerText = 'New value of span';

How do I find scrape information between 2 tags?

I am trying to scrape information with PHP that has their data like so:
<br>1998 - <a href="http://example.com/movie/id/2345">A Night at the Roxburry<a/>
I need to get the year that is between the <br> and the <a> tag. I have gotten the title of the movie by using PHP Simple DOM HTML parser. This was the code that I used to parse the title
foreach($dom->getElementsByTagName('a') as $link){
$title = $link->getAttribute('href');
}
I tried using:
$string = '<br>1998 - <a href="http://example.com/movie/id/2345">A Night at the Roxburry<a/>';
$year = preg_match_all('/<br>(.*)<a>', $string);
But it's not finding the year that is in between the <br> and the <a> tag. Does anyone know what I could possibly do to find the year?
Try this:
<?php
$subject = '<br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry<a/>';
$pattern = '/<br>[0-9]{4}/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE);
print_r($matches);
?>
Note that you can change pattern if year is shown in some other formats. If you want to see everything between two tags you can use $pattern = '/<br>.*<a/'; or any other appropriate for you.
The expression you are using: $year = preg_match_all('/<br>(.*)<a>', $string); will find text between <br> and <a>, but in your example you do not have <a> anywhere. Try looking for text between <br> and <a like this:
$year = preg_match_all ('/<br>([^<]*)<a/', $string);
note, that I also changed . to [^<] to make sure it will stop at the next tag, otherwith it will match strings like this:
<br>foo<br><br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry<a
because they start with <br> and end with <a, but this is probably not what you need, any your year will be like this:
foo<br><br>1998 - <a href="http://site.com/movie/id/2345">A Night at the Roxburry

Bug with strtotime()

The Simple HTML DOM library is used to extract the timestamp from a webpage. strtotime is then used to convert the extracted timestamp to a MySQL timestamp.
Problem: When strtotime() is usede on a valid timestamp, NULL is returned (See 2:). However when Simple HTML DOM is not used in the 2nd example, everything works properly.
What is happening, and how can this be fixed??
Output:
1:2013-03-03, 12:06PM
2:
3:1970-01-01 00:00:00
var_dump($time)
string(25) "2013-03-03, 12:06PM"
PHP
include_once(path('app') . 'libraries/simple_html_dom.php');
// Convert to HTML DOM object
$html = new simple_html_dom();
$html_raw = '<p class="postinginfo">Posted: <date>2013-03-03, 12:06PM EST</date></p>';
$html->load($html_raw);
// Extract timestamp
$time = $html->find('.postinginfo', 0);
$pattern = '/Posted: (.*?) (.).T/s';
$matches = '';
preg_match($pattern, $time, $matches);
$time = $matches[1];
echo '1:' . $time . '<br>';
echo '2:' . strtotime($time) . '<br>';
echo '3:' . date("Y-m-d H:i:s", strtotime($time));
2nd Example
PHP (Working, without Simple HTML DOM)
// Extract posting timestamp
$time = 'Posted: 2013-03-03, 12:06PM EST';
$pattern = '/Posted: (.*?) (.).T/s';
$matches = '';
preg_match($pattern, $time, $matches);
$time = $matches[1];
echo '1:' . $time . '<br>';
echo '2:' . strtotime($time) . '<br>';
echo '3:' . date("Y-m-d H:i:s", strtotime($time));
Output (Correct)
1:2013-03-03, 12:06PM
2:1362312360
3:2013-03-03 12:06:00
var_dump($time)
string(19) "2013-03-03, 12:06PM"
According to your var_dump(), the $time string you extracted from the HTML code is 25 characters long.
The string you see, "2013-03-03, 12:06PM", is only 19 characters long.
So, where are those 6 extra characters? Well, it's pretty obvious, really: the string you're trying to parse is really "<date>2013-03-03, 12:06PM". But when you print it into an HTML document, that <date> is parsed as an HTML tag by the browser.
To see it, use the "View Source" function in your browser. Or, much better yet, use htmlspecialchars() when printing any variables that are not supposed to contain HTML code.

Categories