PHP Convert Full Date To Short Date - php

I need a PHP script to loop through all .html files in a directory and in each one find the first instance of a long date (i.e. August 25th, 2014) and then adds a tag with that date in short format (i.e. <p class="date">08/25/14</p>).
Has anyone done something like this before? I'm guessing you'd explode the string and use a complex case statement to convert the month names and days to regular numbers and then implode using /.
But I'm having trouble figuring out the regular expression to use for finding the first long date.
Any help or advice would be greatly appreciated!

Here's how I'd do it in semi-pseudo-code...
Loop through all the files using whatever floats your boat (glob() is an obvious choice)
Load the HTML file into a DOMDocument, eg
$doc = new DOMDocument();
$doc->loadHTMLFile($filePath);
Get the body text as a string
$body = $doc->getElementsByTagName('body');
$bodyText = $body->item(0)->textContent; // assuming there's at least one body tag
Find your date string via this regex
preg_match('/(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}(st|nd|rd|th)?, \d{4}/', $bodyText, $matches);
Load this into a DateTime object and produce a short date string
$dt = DateTime::createFromFormat('F jS, Y', $matches[0]);
$shortDate = $dt->format('m/d/y');
Create a <p> DOMElement with the $shortDate text content, insert it into the DOMDocument where you want and write back to the file using $doc->saveHTMLFile($filePath)

I incorporated the helpful response above into what I already had and it seems to work. I'm sure it's far from ideal but it still serves my purpose. Maybe it might be helpful to others:
<?php
$dir = "archive";
$a = scandir($dir);
$a = array_diff($a, array(".", ".."));
foreach ($a as $value) {
echo '</br>File name is: ' . $value . "<br><br>";
$contents = file_get_contents("archive/".$value);
if (preg_match('/(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}(st|nd|rd|th)?, \d{4}/', $contents, $matches)) {
echo 'the date found is: ' . $matches[0] . "<br><br>";
$dt = DateTime::createFromFormat('F jS, Y', $matches[0]);
$shortDate = $dt->format('m/d/y');
$dateTag = "\n" . '<p class="date">' . $shortDate . '</p>';
$filename ="archive/".$value;
$file = fopen($filename, "a+");
fwrite($file, $dateTag);
fclose($file);
echo 'Date tag added<br><br>';
} else {
echo "ERROR: No date found<br><br>";
}
}
?>
The code assumes the files to modify are in a directory called "archive" that resides in the same directory as the script.
Needed the two different preg_match lines because I found out some dates are listed with the ordinal suffix (i.e. August 24th, 2005) and some are not (i.e. August 24, 2005). Couldn't quite puzzle out exactly how to get a single preg_match that handles both.
EDIT: replaced double preg_match with single one using \d{1,2}(st|nd|rd|th)? as suggested.

Related

Scrape the date from a HTML page

I am writing a PHP HTML page scraper program and I need to find out the date it has been updated.
I did this $html = file_get_html(xyz.com) to get the HTML. One line of the HTML has the date like this 10/24/2016.
I did this:
if (strpos($html, '7nbsp;') !== false) {
if (strpos($html, ' </a>') !== false) {
echo "How to print drawing date--here!";
}
Now here is the dilemma, I cannot search 10/24/2016 because I have no way of knowing when the new date is when the site is updated, it could be 10/30/2016 or 11/12/2016...
Ideally, I would like the date to be in a string, like $date = "11/17/2016".
How do I search the date itself?
This code will work for you:
preg_match('/\ ([0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4})/', $html, $matches);
This is a regex that searches for a date (as long as the date is in correct format). Founded matches will be stored in '$matches' variable.
#krasipenkov was close, but the OP asked for it to be in $date var:
$html = 'lblah
balh asdf asd
<mickey mouse="disney">f3rt6wergsdfg 1/19/2016 <more stuff="here">etc
asdf';
preg_match('/\ ([0-9]{1,2}\/[0-9]{1,2}\/[0-9]{4})/', $html, $matches);
$date = $matches[1];
echo "your date found is $date";
[see it run] http://sandbox.onlinephpfunctions.com/code/27419098cf4bc48a5ca2c683b046679b6c0af85c

Bug with strtotime()

The Simple HTML DOM library is used to extract the timestamp from a webpage. strtotime is then used to convert the extracted timestamp to a MySQL timestamp.
Problem: When strtotime() is usede on a valid timestamp, NULL is returned (See 2:). However when Simple HTML DOM is not used in the 2nd example, everything works properly.
What is happening, and how can this be fixed??
Output:
1:2013-03-03, 12:06PM
2:
3:1970-01-01 00:00:00
var_dump($time)
string(25) "2013-03-03, 12:06PM"
PHP
include_once(path('app') . 'libraries/simple_html_dom.php');
// Convert to HTML DOM object
$html = new simple_html_dom();
$html_raw = '<p class="postinginfo">Posted: <date>2013-03-03, 12:06PM EST</date></p>';
$html->load($html_raw);
// Extract timestamp
$time = $html->find('.postinginfo', 0);
$pattern = '/Posted: (.*?) (.).T/s';
$matches = '';
preg_match($pattern, $time, $matches);
$time = $matches[1];
echo '1:' . $time . '<br>';
echo '2:' . strtotime($time) . '<br>';
echo '3:' . date("Y-m-d H:i:s", strtotime($time));
2nd Example
PHP (Working, without Simple HTML DOM)
// Extract posting timestamp
$time = 'Posted: 2013-03-03, 12:06PM EST';
$pattern = '/Posted: (.*?) (.).T/s';
$matches = '';
preg_match($pattern, $time, $matches);
$time = $matches[1];
echo '1:' . $time . '<br>';
echo '2:' . strtotime($time) . '<br>';
echo '3:' . date("Y-m-d H:i:s", strtotime($time));
Output (Correct)
1:2013-03-03, 12:06PM
2:1362312360
3:2013-03-03 12:06:00
var_dump($time)
string(19) "2013-03-03, 12:06PM"
According to your var_dump(), the $time string you extracted from the HTML code is 25 characters long.
The string you see, "2013-03-03, 12:06PM", is only 19 characters long.
So, where are those 6 extra characters? Well, it's pretty obvious, really: the string you're trying to parse is really "<date>2013-03-03, 12:06PM". But when you print it into an HTML document, that <date> is parsed as an HTML tag by the browser.
To see it, use the "View Source" function in your browser. Or, much better yet, use htmlspecialchars() when printing any variables that are not supposed to contain HTML code.

PHP substr doesn't echo anything

I'm having problems with this code, and the PHP method 'substr' is playing up. I just don't get it. Here's a quick introduction what I'm trying to achieve. I have this massive XML-document with email-subscribers from Joomla. I'm trying to import it to Mailchimp, but Mailchimp have some rules for the syntax of the ways to import emails to a list. So at the moment the syntax is like this:
<subscriber>
<subscriber_id>615</subscriber_id>
<name><![CDATA[NAME OF SUBSCRIBER]]></name>
<email>THE_EMAIL#SOMETHING.COM</email>
<confirmed>1</confirmed>
<subscribe_date>THE DATE</subscribe_date>
</subscriber>
I want to make a simple PHP-script that takes all those emails and outputs them like this:
[THE_EMAIL#SOMETHING.COM] [NAME OF SUBSCRIBER]
[THE_EMAIL#SOMETHING.COM] [NAME OF SUBSCRIBER]
[THE_EMAIL#SOMETHING.COM] [NAME OF SUBSCRIBER]
[THE_EMAIL#SOMETHING.COM] [NAME OF SUBSCRIBER]
If I can do that, then I can just copy paste it into Mailchimp.
Now here's my PHP-script, so far:
$fileName = file_get_contents('emails.txt');
foreach(preg_split("/((\r?\n)|(\r\n?))/", $fileName) as $line){
if(strpos($line, '<name><![CDATA[')){
$name = strpos($line, '<name><![CDATA[');
$nameEnd = strpos($line, ']]></name>', $name);
$nameLength = $nameEnd-$name;
echo "<br />";
echo " " . strlen(substr($line, $name, $nameLength));
echo " " . gettype(substr($line, $name, $nameLength));
echo " " . substr($line, $name, $nameLength);
}
if(strpos($line, '<email>')){
$var1 = strpos($line, '<email>');
$var2 = strpos($line, '</email>', $var1);
$length = $var2-$var1;
echo substr($line, $var1, $length);
}
}
The first if-statement works as it should. It identifies, if there's an ''-tag on the line, and if there is, then it finds the end-tag and outputs the email with the substr-method.
The second if-statement is annoying me. If should do the same thing as the first if-statement, but it doesn't. The length is the correct length (I've checked). The type is the correct type (I've checked). But when I try to echo it, then nothing happens. The script still runs, but it doesn't write anything.
I've played around with it quite a lot and seem to have tried everything - but I can't figure it out.
Warning
This function may return Boolean FALSE, but may also return a non-Boolean value which evaluates to FALSE. Please read the section on Booleans for more information. Use the === operator for testing the return value of this function.
You should be using if(strpos($line,'...') !== false) {
That aside, your file seems to be XML, so you should use an XML parser lest you fall under the pony he comes.
DOMDocument is a good one. You could do something like this:
$dom = new DOMDocument();
$dom->load("emails.txt");
$subs = $dom->getElementsByTagName('subscriber');
$count = $subs->length;
for( $i=0; $i<$l; $i++) {
$sub = $subs->item($i);
echo $sub->getElementsByTagName('email')->item(0)->nodeValue;
echo " ";
echo $sub->getElementsByTagName('name')->item(0)->nodeValue;
echo "\n";
}
This will output the names and emails in the format you described.
So there's a few things wrong with this, including the strpos command which will actually return 0 if it finds the tag at the beginning of the line, which doesn't appear to be what you intend.
Also, if the XML is not formatted exactly as you have, with each opening and closing tag on the one line, then your logic will fail as well.
It's not a good idea to re-invent XML processing for this reason...
Here as others have proposed, is a better solution to the problem*.
$xml = simplexml_load_file('emails.txt');
foreach( $xml->subscriber as $sub )
{
// Note that SimpleXML is aware of CDATA, and only outputs the text
$output = '[' . $sub->name . ']' . ' ' . '[' . $sub->email . ']';
}
*This assumes that you XML is valid, i.e. "subscriber" blocks are contained in a single parent at the top level. You can of course use simplexml documentation to adjust for your use case.

Characters intermittently disappearing from PHP-built XML

The XML output from this loop was failing to validate but the validator was giving me different errors each time. Each time it had to do with the opening < of an element closure being missing. A different one each time...
Every time I refresh and re-validate the output there is at least one of these and it has never yet been in the same member record.
Initially I was adding tags everywhere which is why you will see many of them wrapping things where they should not be needed.
The XML is built by this loop:
if ($members) {
$xml = '<api><response status="ok"><users>';
foreach ($members as $m) {
$join_date = date("Y-m-d H:i:s", $m->join_date);
list($md) = $mdObj->retrieve("member_id = '$m->member_id'");
$join_date = ($m->join_date > 0) ? date("Y-m-d H:i:s", $m->join_date) : '0000-00-00 00:00:00';
$address = preg_replace('/\R/', '', $md->m_field_id_3);
$xml .= "<user id=\"$m->member_id\"><admin>0</admin><name><![CDATA[$m->username]]></name><company>$md->m_field_id_9</company><company_id>$md->m_field_id_28</company_id><address><![CDATA[$address]]></address><city>$md->m_field_id_5</city><region>$md->m_field_id_6</region><postal_code>$md->m_field_id_7</postal_code><email><![CDATA[$m->email]]></email><phone>$md->m_field_id_10</phone><first>$md->m_field_id_1</first><last>$md->m_field_id_1 $md->m_field_id_2</last><url></url><description><![CDATA[]]></description><status>active</status><date>$join_date</date><modified>0000-00-00 00:00:00</modified></user>";
}
$xml .= '</users></response></api>';
return $xml;
}
Has anyone seen this before? Have any advice?
Here's a little PHP info:
PHP Version 5.2.17
Linux foo.foo.com 2.6.18-274.17.1.el5 #1 SMP Wed Jan 4 22:45:44 EST 2012 x86_64
Build Date Feb 8 2012 14:19:50
I suspect the database entries you're including into you XML might contain unescaped characters which have special meaning, e.g. &, <, >, " and ' which need to be encoded.
I would also break up that long string into
$xml .= "<user id=\"" . $m->member_id . "\"><admin>0</admin><name><![CDATA[";
$xml .= $m->username . "]]></name><company>" . $md->m_field_id_9 . "</company>";
$xml .= "<company_id>" . $md->m_field_id_28 . "</company_id><address><![CDATA[";
$xml .= $address . "]]></address><city>" . $md->m_field_id_5 . "</city><region>";
$xml .= $md->m_field_id_6 . "</region><postal_code>" . $md->m_field_id_7;
$xml .= "</postal_code><email><![CDATA[" . $m->email . "]]></email><phone>";
$xml .= $md->m_field_id_10 . "</phone><first>" . $md->m_field_id_1 . "</first>";
$xml .= "<last>" . $md->m_field_id_1 . $md->m_field_id_2 . "</last><url></url>";
$xml .= "<description><![CDATA[]]></description><status>active</status><date>";
$xml .= $join_date . "</date><modified>0000-00-00 00:00:00</modified></user>";
and then use str_replace() to specifically encode the above-mentioned characters.
What could be happening is that you data contains invisible whitespace, most notably DEL characters .. I suppose that would cause this precise behaviour.
To check, loop over each character in the string and print the character code to check if a string contains any hidden whitespace.
This appears to be a bug in Chrome's view source routine on a large XML file. XML obtained from the same source via IE and FireFox was valid across repeated tests.
Additionally Chrome's normal view did not display these aberrations and did not report errors in the XML in normal view.

Switch gettext translated language with original language

I started my PHP application with all text in German, then used gettext to extract all strings and translate them to English.
So, now I have a .po file with all msgids in German and msgstrs in English. I want to switch them, so that my source code contains the English as msgids for two main reasons:
More translators will know English, so it is only appropriate to serve them up a file with msgids in English. I could always switch the file before I give it out and after I receive it, but naaah.
It would help me to write English object & function names and comments if the content text was also English. I'd like to do that, so the project is more open to other Open Source collaborators (more likely to know English than German).
I could do this manually and this is the sort of task where I anticipate it will take me more time to write an automated routine for it (because I'm very bad with shell scripts) than do it by hand. But I also anticipate despising every minute of manual computer labour (feels like an oxymoron, right?) like I always do.
Has someone done this before? I figured this would be a common problem, but couldn't find anything. Many thanks ahead.
Sample Problem:
<title><?=_('Routinen')?></title>
#: /users/ruben/sites/v/routinen.php:43
msgid "Routinen"
msgstr "Routines"
I thought I'd narrow the problem down. The switch in the .po-file is no issue of course, it is as simple as
preg_replace('/msgid "(.+)"\nmsgstr "(.+)"/', '/msgid "$2"\nmsgstr "$1"/', $str);
The problem for me is the routine that searches my project folder files for _('$msgid') and substitutes _('msgstr') while parsing the .po-file (which is probably not even the most elegant way, after all the .po-file contains comments which contain all file paths where the msgid occurs).
After fooling around with akirk's answer a little, I ran into some more problems.
Because I have a mixture of _('xxx') and _("xxx") calls, I have to be careful about (un)escaping.
Double quotes " in msgids and msgstrs have to be unescaped, but the slashes can't be stripped, because it may be that the double quote was also escaped in PHP
Single quotes have to be escaped when they're replaced into PHP, but then they also have to be changed in the .po-file. Luckily for me, single quotes only appear in English text.
msgids and msgstrs can have multiple lines, then they look like this
msgid = ""
"line 1\n"
"line 2\n"
msgstr = ""
"line 1\n"
"line 2\n"
plural forms are of course skipped at the moment, but in my case that's not an issue
poedit wants to remove strings as obsolete that seem successfully switched and I have no idea why this happens in (many) cases.
I'll have to stop working on this for tonight. Still it seems using the parser instead of RegExps wouldn't be overkill.
I built on akirk's answer and wanted to preserve what I came up with as an answer here, in case somebody has the same problem.
This is not recursive, but that could easily change of course. Feel free to comment with improvements, I will be watching and editing this post.
$po = file_get_contents("locale/en_GB/LC_MESSAGES/messages.po");
$translations = array(); // german => english
$rawmsgids = array(); // find later
$msgidhits = array(); // record success
$msgstrs = array(); // find later
preg_match_all('/msgid "(.+)"\nmsgstr "(.+)"/', $po, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$german = str_replace('\"','"',$match[1]); // unescape double quotes (could misfire if you escaped double quotes in PHP _("bla") but in my case that was one case versus many)
$english = str_replace('\"','"',$match[2]);
$en_sq_e = str_replace("'","\'",$english); // escape single quotes
$translations['_(\''. $german . '\''] = '_(\'' . $en_sq_e . '\'';
$rawmsgids['_(\''. $german . '\''] = $match[1]; // find raw msgid with searchstr as key
$translations['_("'. $match[1] . '"'] = '_("' . $match[2] . '"';
$rawmsgids['_("'. $match[1] . '"'] = $match[1];
$translations['__(\''. $german . '\''] = '__(\'' . $en_sq_e . '\'';
$rawmsgids['__(\''. $german . '\''] = $match[1];
$translations['__("'. $match[1] . '"'] = '__("' . $match[2] . '"';
$rawmsgids['__("'. $match[1] . '"'] = $match[1];
$msgstrs[$match[1]] = $match[2]; // msgid => msgstr
}
foreach (glob("*.php") as $file) {
$code = file_get_contents($file);
$filehits = 0; // how many replacements per file
foreach($translations AS $msgid => $msgstr) {
$hits = 0;
$code = str_replace($msgid,$msgstr,$code,$hits);
$filehits += $hits;
if($hits!=0) $msgidhits[$rawmsgids[$msgid]] = 1; // this serves to record if the msgid was found in at least one incarnation
elseif(!isset($msgidhits[$rawmsgids[$msgid]])) $msgidhits[$rawmsgids[$msgid]] = 0;
}
// file_put_contents($file, $code); // be careful to test this first before doing the actual replace (and do use a version control system!)
echo "$file : $filehits <br>";
echo $code;
}
/* debug */
$found = array_keys($msgidhits, 1, true);
foreach($found AS $mid) echo $mid . " => " . $msgstrs[$mid] . "\n\n";
echo "Not Found: <br>";
$notfound = array_keys($msgidhits, 0, true);
foreach($notfound AS $mid) echo $mid . " => " . $msgstrs[$mid] . "\n\n";
/*
following steps are still needed:
* convert plurals (ngettext)
* convert multi-line msgids and msgstrs (format mentioned in question)
* resolve uniqueness conflict (msgids are unique, msgstrs are not), so you may have duplicate msgids (poedit finds these)
*/
See http://code.activestate.com/recipes/475109-regular-expression-for-python-string-literals/ for a good python-based regular expression for finding string literals, taking escapes into account. Although it's python, this might be quite good for multiline strings and other corner cases.
See http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/poswap.html for a ready, out-of-the-box base language swapper for .po files.
For instance, the following command line will convert german-based spanish translation to english-based spanish translation. You just have to ensure that your new base language (english) is 100% translated before starting conversion:
poswap -i de-en.po -t de-es.po -o en-es.po
And finally to swap english po file to german po file, use swappo:
http://manpages.ubuntu.com/manpages/hardy/man1/swappo.1.html
After swapping files, some manual polishing of resultant files might be required. For instance headers might be broken and some duplicate texts might occur.
So if I understand you correctly you'd like to replace all German gettext calls with English ones. To replace the contents in the directory, something like this could work.
$po = file_get_contents("translation.pot");
$translations = array(); // german => english
preg_match_all('/msgid "(.+)"\nmsgstr "(.+)"/', $po, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$translations['_("'. $match[1] . '")'] = '_("' . $match[2] . '")';
$translations['_(\''. $match[1] . '\')'] = '_(\'' . $match[2] . '\')';
}
foreach (glob("*.php") as $file) {
$code = file_get_contents($file);
$code = str_replace(array_keys($translations), array_values($translations), $code);
//file_put_contents($file, $code);
echo $code; // be careful to test this first before doing the actual replace (and do use a version control system!)
}

Categories