I have an array of symbols (not only characters, but also syllables, such as 'p', 'pa', etc.) and I'm trying to come up with a good algorithm to identify words that can be created by concatenating those symbols.
e.g. given the array of symbols ('p', 'pa', 'aw'), the string 'paw' would be a positive match.
This is my current implementation (too slow):
function isValidWord($word,&$symbols){
$nodes = array($word);
while (count($nodes)>0){
$node = array_shift($nodes);
$nodeExpansions = array();
$nodeLength = strlen($node);
if (in_array($node,$symbols)) { return true; }
for ($len=$nodeLength-1;$len>0;$len--){
if (in_array(substr($node, 0, $len), $symbols)){
$nodeExpansions[] = substr($node, $len-$nodeLength);
}
}
$nodes = array_merge($nodeExpansions,$nodes);
}
return false;
}
It doesn't seem like a difficult problem, it's just a depth-first search implementation on an acyclic? tree, but I'm struggling to come up with an implementation which is both memory and CPU efficient. Where can I find resources to learn about this kind of problem?
Also, here is a link to a script for testing it and comparing it to the solutions proposed in the comments below: http://ideone.com/zQ9Cie
And here an album showing captures of really odd results: How can my current iterative method be 12x faster than the recursive one (proposed by #Waleed Khan) when I run them on my dev server, but 2x slower when I run them on my production server, considering both servers have almost identical configurations? (One is an EC2 micro instance and the other a VirtualBox container, but they both have the same OS, config, updates, PHP version and config, number of cores and available RAM)
Not sure wether it's very efficient but I guess I would create a loop with an inner loop which goes through the given array containg the symbols.
<?php
$aSymbols = array('p', 'pa', 'aw');
$aDatabase = array('paw');
$aMatches = array();
for ($iCounter = 0; $iCounter < count($aSymbols); $iCounter++)
{
for ($yCounter = 0; $yCounter < count($aSymbols); $yCounter++)
{
$sString = $aSymbols[$iCounter].$aSymbols[$yCounter];
if (in_array($sString, $aDatabase))
{
$aMatches[] = $sString;
}
}
}
?>
The if query can be replaced by a regex query, too.
As #Waleed Khan suggested, I've tried improving my algorithm using a Trie structure for the dictionary instead of a plain array to speed up the search for matches.
function generateTrie(&$dictionary){
if (is_string($dictionary)){
$dictionary = array($dictionary);
}
if (!is_array($dictionary)){
throw new Exception(
"Invalid input argument for \$dictionary (must be array)",
500
);
}
$trie = array();
$dictionaryCount = count($dictionary);
$f = false;
for ($i=0;$i<$dictionaryCount;$i++){
$word = $dictionary[$i];
if ($f&&!inTrie('in',$trie)){
var_export($trie);
exit;
}
if (!is_string($word)){
throw new Exception(
"Invalid input argument for \$word (must be string)",
500
);
}
$wordLength = strlen($word);
$subTrie = &$trie;
for ($j=1;$j<$wordLength;$j++){
if (array_key_exists($subWord = substr($word,0,$j),$subTrie)){
$subTrie = &$subTrie[$subWord];
}
}
if (array_key_exists($word,$subTrie)){
continue;
}
$keys = array_keys($subTrie);
if (!array_key_exists($word,$subTrie)) {
$subTrie[$word] = array();
}
foreach ($keys as $testWordForPrefix){
if (substr($testWordForPrefix,0,$wordLength) === $word){
$subTrie[$word][$testWordForPrefix] = &$subTrie[$testWordForPrefix];
unset($subTrie[$testWordForPrefix]);
}
}
}
return $trie;
}
/**
* Checks if word is on dictionary trie
*/
function inTrie($word, &$trie){
$wordLen = strlen($word);
$node = &$trie;
$found = false;
for ($i=1;$i<=$wordLen;$i++){
$index = substr($word,0,$i);
if (isset($node[$index])){
$node = &$node[$index];
$found = true;
} else {
$found = false;
}
}
return $found;
}
/**
* Checks if a $word is a concatenation of valid $symbols using inTrie()
*
* E.g. `$word = 'paw'`, `$symbols = array('p', 'pa', 'aw')` would return
* true, because `$word = 'p'.'aw'`
*
*/
function isValidTrieWord($word,&$trie){
$nodes = array($word);
while (count($nodes)>0){
$node = array_shift($nodes);
if (inTrie($node,$trie)) { return true; }
$nodeExpansions = array();
$nodeLength = strlen($node);
for ($len=$nodeLength-1;$len>0;$len--){
if (inTrie(substr($node, 0, $len), $trie)){
$nodeExpansions[] = substr($node, $len-$nodeLength);
}
}
$nodes = array_merge($nodeExpansions,$nodes);
}
return false;
}
It doesn't make much of a difference for small dictionary sizes (where preg_match is still the fastest implementation by several orders of magnitude), but for medium sized dictionaries (~10000 symbols) where longer symbols are usually a combination of shorter ones (which is where preg breaks and the other two implementations can take close to 25 seconds per 2-6 symbols word), the Trie search takes only about 1 second. That's close enough for my needs (check if a given password is a combination of symbols from a given dictionary or not).
(See the whole script on http://ideone.com/zQ9Cie)
Results on my local dev VM:
Results on my AWS EC2 test server:
Related
I've written the following block of code to find if a word exists in a grid of nodes.
function findWord() {
$notInLoc = [];
if ($list = $this->findAll("a")) {
foreach($list as $node) {
$notInLoc[] = $node->loc;
if ($list2 = $this->findAllConnectedTo($node, "r", $notInLoc)) {
foreach($list2 as $node2) {
$notInLoc[] = $node2->loc;
if ($list3 = $this->findAllConnectedTo($node2, "t", $notInLoc)) {
foreach($list3 as $node3) {
return true;
}
}
}
}
}
}
return false;
}
This "works" and passes all my 3-letter word test cases because I've hard-coded the characters I'm looking for and I know how long the word is. But what I need to do is pass in any word, regardless of length and letters, and return true if I found the word against all these restrictions.
To summarize the algorithm here:
1) I find all the nodes that contain the first character "a" and get a list of those nodes. That's my starting point.
2) For each "a" I'm looking for all the "r"s that are connected to it but not in a location I'm already using. (Each node has a location key, and that key is stored in the notInLoc array while looking through it. I realize that this may break though because notInLoc is only being reset the first time I enter the function so every time I go through the foreach it keeps pushing the same location in.
3) Once I've found all the "r"s connected to the "a" I'm currently on, I check to see if there are any "t"s connected to the "r"s. If there is at least 1 "t" connected, then I know the word has been found.
I'm having trouble refactoring this to make it dynamic. I'll give you the idea I was working with, but it is broken.
function inner($word, $list, $i = 0, $notInLoc = []) {
$i++;
foreach($list as $node) {
$notInLoc[] = $node->loc;
if ($list2 = $this->findAllConnectedTo($node, $word[$i], $notInLoc)) {
if ($i == (strlen($word) - 1)) {
return true;
} else {
$this->inner($word, $list2, $i, $notInLoc);
}
}
}
return false;
}
function findWord2($word) {
if ($list = $this->findAll($word[0])) {
return $this->inner($word, $list);
}
return false;
}
I understand that there are other ways to solve problems like this, but I need it to work using only the functions findAll which returns all nodes with a specific value, or false and findAllConnectedTo which returns all nodes with a specific value connected to a node that are not contained on the "Do Not Use" notInLoc list.
You need to pass result through all nested contexts to the top, because found word will eventually return true, but it will vanish in upper level (continue loop and return false). Try this:
if ($list2 = $this->findAllConnectedTo($node, $word[$i], $notInLoc)) {
if ($i == strlen($word) - 1 || $this->inner($word, $list2, $i, $notInLoc)) {
return true;
}
}
Next I'd take care of $word needlesly passed around. It stays the same for all contexts - only pointer changes.
I know many of the users have asked this type of question but I am stuck in an odd situation.
I am trying a logic where multiple occurance of a specific pattern having unique identifier will be replaced with some conditional database content if there match is found.
My regex pattern is
'/{code#(\d+)}/'
where the 'd+' will be my unique identifier of the above mentioned pattern.
My Php code is:
<?php
$text="The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}";
$newsld=preg_match_all('/{code#(\d+)}/',$text,$arr);
$data = array("first Replace","Second Replace", "Third Replace");
echo $data=str_replace($arr[0], $data, $text);
?>
This works but it is not at all dynamic, the numbers after #tag from pattern are ids i.e 1,2 & 3 and their respective data is stored in database.
how could I access the content from DB of respective ID mentioned in the pattern and would replace the entire pattern with respective content.
I am really not getting a way of it. Thank you in advance
It's not that difficult if you think about it. I'll be using PDO with prepared statements. So let's set it up:
$db = new PDO( // New PDO object
'mysql:host=localhost;dbname=projectn;charset=utf8', // Important: utf8 all the way through
'username',
'password',
array(
PDO::ATTR_EMULATE_PREPARES => false, // Turn off prepare emulation
PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION
)
);
This is the most basic setup for our DB. Check out this thread to learn more about emulated prepared statements and this external link to get started with PDO.
We got our input from somewhere, for the sake of simplicity we'll define it:
$text = 'The old version is {code#1}, The new version is {code#2}, The stable version {code#3}';
Now there are several ways to achieve our goal. I'll show you two:
1. Using preg_replace_callback():
$output = preg_replace_callback('/{code#(\d+)}/', function($m) use($db) {
$stmt = $db->prepare('SELECT `content` FROM `footable` WHERE `id`=?');
$stmt->execute(array($m[1]));
$row = $stmt->fetch(PDO::FETCH_ASSOC);
if($row === false){
return $m[0]; // Default value is the code we captured if there's no match in de DB
}else{
return $row['content'];
}
}, $text);
echo $output;
Note how we use use() to get $db inside the scope of the anonymous function. global is evil
Now the downside is that this code is going to query the database for every single code it encounters to replace it. The advantage would be setting a default value in case there's no match in the database. If you don't have that many codes to replace, I would go for this solution.
2. Using preg_match_all():
if(preg_match_all('/{code#(\d+)}/', $text, $m)){
$codes = $m[1]; // For sanity/tracking purposes
$inQuery = implode(',', array_fill(0, count($codes), '?')); // Nice common trick: https://stackoverflow.com/a/10722827
$stmt = $db->prepare('SELECT `content` FROM `footable` WHERE `id` IN(' . $inQuery . ')');
$stmt->execute($codes);
$rows = $stmt->fetchAll(PDO::FETCH_ASSOC);
$contents = array_map(function($v){
return $v['content'];
}, $rows); // Get the content in a nice (numbered) array
$patterns = array_fill(0, count($codes), '/{code#(\d+)}/'); // Create an array of the same pattern N times (N = the amount of codes we have)
$text = preg_replace($patterns, $contents, $text, 1); // Do not forget to limit a replace to 1 (for each code)
echo $text;
}else{
echo 'no match';
}
The problem with the code above is that it replaces the code with an empty value if there's no match in the database. This could also shift up the values and thus could result in a shifted replacement. Example (code#2 doesn't exist in db):
Input: foo {code#1}, bar {code#2}, baz {code#3}
Output: foo AAA, bar CCC, baz
Expected output: foo AAA, bar , baz CCC
The preg_replace_callback() works as expected. Maybe you could think of a hybrid solution. I'll let that as a homework for you :)
Here is another variant on how to solve the problem: As access to the database is most expensive, I would choose a design that allows you to query the database once for all codes used.
The text you've got could be represented with various segments, that is any combination of <TEXT> and <CODE> tokens:
The old version is {code#1}, The new version is {code#2}, ...
<TEXT_____________><CODE__><TEXT_______________><CODE__><TEXT_ ...
Tokenizing your string buffer into such a sequence allows you to obtain the codes used in the document and index which segments a code relates to.
You can then fetch the replacements for each code and then replace all segments of that code with the replacement.
Let's set this up and defined the input text, your pattern and the token-types:
$input = <<<BUFFER
The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}
BUFFER;
$regex = '/{code#(\d+)}/';
const TOKEN_TEXT = 1;
const TOKEN_CODE = 2;
Next is the part to put the input apart into the tokens, I use two arrays for that. One is to store the type of the token ($tokens; text or code) and the other array contains the string data ($segments). The input is copied into a buffer and the buffer is consumed until it is empty:
$tokens = [];
$segments = [];
$buffer = $input;
while (preg_match($regex, $buffer, $matches, PREG_OFFSET_CAPTURE, 0)) {
if ($matches[0][1]) {
$tokens[] = TOKEN_TEXT;
$segments[] = substr($buffer, 0, $matches[0][1]);
}
$tokens[] = TOKEN_CODE;
$segments[] = $matches[0][0];
$buffer = substr($buffer, $matches[0][1] + strlen($matches[0][0]));
}
if (strlen($buffer)) {
$tokens[] = TOKEN_TEXT;
$segments[] = $buffer;
$buffer = "";
}
Now all the input has been processed and is turned into tokens and segments.
Now this "token-stream" can be used to obtain all codes used. Additionally all code-tokens are indexed so that with the number of the code it's possible to say which segments need to be replaced. The indexing is done in the $patterns array:
$patterns = [];
foreach ($tokens as $index => $token) {
if ($token !== TOKEN_CODE) {
continue;
}
preg_match($regex, $segments[$index], $matches);
$code = (int)$matches[1];
$patterns[$code][] = $index;
}
Now as all codes have been obtained from the string, a database query could be formulated to obtain the replacement values. I mock that functionality by creating a result array of rows. That should do it for the example. Technically you'll fire a a SELECT ... FROM ... WHERE code IN (12, 44, ...) query that allows to fetch all results at once. I fake this by calculating a result:
$result = [];
foreach (array_keys($patterns) as $code) {
$result[] = [
'id' => $code,
'text' => sprintf('v%d.%d.%d%s', $code * 2 % 5 + $code % 2, 7 - 2 * $code % 5, 13 + $code, $code === 3 ? '' : '-beta'),
];
}
Then it's only left to process the database result and replace those segments the result has codes for:
foreach ($result as $row) {
foreach ($patterns[$row['id']] as $index) {
$segments[$index] = $row['text'];
}
}
And then do the output:
echo implode("", $segments);
And that's it then. The output for this example:
The old version is v3.5.14-beta, The new version is v4.3.15-beta, The stable version is v2.6.16
The whole example in full:
<?php
/**
* Simultaneous Preg_replace operation in php and regex
*
* #link http://stackoverflow.com/a/29474371/367456
*/
$input = <<<BUFFER
The old version is {code#1}, The new version is {code#2}, The stable version is {code#3}
BUFFER;
$regex = '/{code#(\d+)}/';
const TOKEN_TEXT = 1;
const TOKEN_CODE = 2;
// convert the input into a stream of tokens - normal text or fields for replacement
$tokens = [];
$segments = [];
$buffer = $input;
while (preg_match($regex, $buffer, $matches, PREG_OFFSET_CAPTURE, 0)) {
if ($matches[0][1]) {
$tokens[] = TOKEN_TEXT;
$segments[] = substr($buffer, 0, $matches[0][1]);
}
$tokens[] = TOKEN_CODE;
$segments[] = $matches[0][0];
$buffer = substr($buffer, $matches[0][1] + strlen($matches[0][0]));
}
if (strlen($buffer)) {
$tokens[] = TOKEN_TEXT;
$segments[] = $buffer;
$buffer = "";
}
// index which tokens represent which codes
$patterns = [];
foreach ($tokens as $index => $token) {
if ($token !== TOKEN_CODE) {
continue;
}
preg_match($regex, $segments[$index], $matches);
$code = (int)$matches[1];
$patterns[$code][] = $index;
}
// lookup all codes in a database at once (simulated)
// SELECT id, text FROM replacements_table WHERE id IN (array_keys($patterns))
$result = [];
foreach (array_keys($patterns) as $code) {
$result[] = [
'id' => $code,
'text' => sprintf('v%d.%d.%d%s', $code * 2 % 5 + $code % 2, 7 - 2 * $code % 5, 13 + $code, $code === 3 ? '' : '-beta'),
];
}
// process the database result
foreach ($result as $row) {
foreach ($patterns[$row['id']] as $index) {
$segments[$index] = $row['text'];
}
}
// output the replacement result
echo implode("", $segments);
Given following (infix) expression:
(country = be or country = nl) and
(language = en or language = nl) and
message contains twitter
I'd like to create the following 4 infix notations:
message contains twitter and country = be and language = en
message contains twitter and country = be and language = en
message contains twitter and country = nl and language = nl
message contains twitter and country = nl and language = nl
So, basically, I would like to get rid of all OR's.
I already have a postfix notation for the first expression, so I'm currently trying to process that to get the desired notation. This particular situation, however, causes trouble.
(For illustration purposes, the postfix notation for this query would be:)
country be = country nl = or language en = language = nl or and message twitter contains and
Does anyone know of an algorithm to achieve this?
Break the problem into two steps: postfix to multiple postfix, postfix to infix. Each step is performed by "interpreting" a postfix expression.
For the postfix to multiple postfix interpreter: the stack values are collections of postfix expressions. The interpretation rules are as follows.
<predicate>: push a one-element collection containing <predicate>.
AND: pop the top two collections into C1 and C2. With two nested loops,
create a collection containing x y AND for all x in C1 and y in C2.
Push this collection.
OR: pop the top two collections into C1 and C2. Push the union of C1 and C2.
For the postfix to infix interpreter: the stack values are infix expressions.
<predicate>: push <predicate>.
AND: pop two expressions into x and y. Push the expression (x) and (y).
These steps could be combined, but I wanted to present two examples of this technique.
It might be easiest to work with a tree representation. Use the shunting yard algorithm to build a binary tree representing the equation. A node in the tree might be:
class Node {
const OP = 'operator';
const LEAF = 'leaf';
$type = null; // Will be eight Node::OP or Node::LEAF
$op = null; // could be 'or' or 'and' 'contains';
$value = null; // used for leaf eg 'twitter'
$left = null;
$right = null;
}
although you could use sub-classes. In the shunting yard algorithm you want the change the output steps to produce a tree.
Once you have a tree representation you need several algorithms.
First you need an algorithm to copy a tree
public function copy($node) {
if($node->type == Node::LEAF) {
$node2 = new Node();
$node2->type = Node::LEAF;
$node2->value = $node->value;
return $node2;
}
else {
$left = copy($node->left);
$right = copy($node->right);
$node2 = new Node();
$node2->type = Node::OP;
$node2->op = $node->op;
$node2->left = $node->left;
$node2->right = $node->right;
return $node2;
}
}
Next the algorithm to find the first 'or' operator node.
function findOr($node) {
if($node->type == Node::OP && $node->op == 'or') {
return $node;
} else if($node->type == Node::OP ) {
$leftRes = findOr($node->$left);
if( is_null($leftRes) ) {
$rightRes = findOr($node->$right); // will be null or a found node
return $rightRes;
} else {
return $leftRes; // found one on the left, no need to walk rest of tree
}
} else {
return null;
}
}
and finally an algorithm copyLR giving either the left (true) or right (false) branch. It behaves as copy unless the node matches $target when either the left or right branch is returned.
public function copyLR($node,$target,$leftRight) {
if($node == $target) {
if($leftRight)
return copy($node->left);
else
return copy($node->right);
}
else if($node->type == Node::LEAF) {
$node2 = new Node();
$node2->type = Node::LEAF;
$node2->value = $node->value;
return $node2;
}
else {
$left = copy($node->left,$target,$leftRight);
$right = copy($node->right,$target,$leftRight);
$node2 = new Node();
$node2->type = Node::OP;
$node2->op = $node->op;
$node2->left = $node->left;
$node2->right = $node->right;
return $node2;
}
}
The pieces are now put together
$root = parse(); // result from the parsing step
$queue = array($root);
$output = array();
while( count($queue) > 0) {
$base = array_shift($queue);
$target = findOr($base);
if(is_null($target)) {
$output[] = $base; // no or operators found so output
} else {
// an 'or' operator found
$left = copyLR($base,$target,true); // copy the left
$right = copyLR($base,$target,false); // copy the right
array_push($left); // push both onto the end of the queue
array_push($right);
}
}
I have a particularly large graph, making it nearly impossible to traverse using recursion because of the excessive amount of memory it uses.
Below is my depth-first function, using recursion:
public function find_all_paths($start, $path)
{
$path[] = $start;
if (count($path)==25) /* Only want a path of maximum 25 vertices*/ {
$this->stacks[] = $path;
return $path;
}
$paths = array();
for($i = 0; $i < count($this->graph[$start])-1; $i++) {
if (!in_array($this->graph[$start][$i], $path)) {
$paths[] = $this->find_all_paths($this->graph[$start][$i], $path);
}
}
return $paths;
}
I would like to rewrite this function so it is non-recursive. I assume I will need to make a queue of some sort, and pop off values using array_shift() but in which part of the function, and how do I make sure the queued vertices are preserved (to put the final pathway on $this->stacks)?
It doesn't take exponential space, number of paths in a tree is equal to number of leaves, every leaf has only 1 path from the root ..
Here is a DFS simple search for an arbitrary binary tree:
// DFS: Parent-Left-Right
public function dfs_search ( $head, $key )
{
var $stack = array($head);
var $solution = array();
while (count($stack) > 0)
{
$node = array_pop($stack);
if ($node.val == $key)
{
$solution[] = $node;
}
if ($node.left != null)
{
array_push($stack, $node.left);
}
if ($node.right != null)
{
array_push($stack, $node.right);
}
}
return $solution;
}
What you need to find all paths in a tree is simply Branch & Fork, meaning whenever you branch, each branch takes a copy of the current path .. here is a 1-line recursive branch & fork I wrote:
// Branch & Fork
public function dfs_branchFork ( $node, $path )
{
return array($path)
+($node.right!=null?dfs_branchFork($node.right, $path+array($node)):null)
+($node.left!=null?dfs_branchFork($node.left, $path+array($node)):null);
}
I'm using cURL to pull a webpage from a server. I pass it to Tidy and throw the output into a DOMDocument. Then the trouble starts.
The webpage contains about three thousand (yikes) table tags, and I'm scraping data from them. There are two kinds of tables, where one or more type B follow a type A.
I've profiled my script using microtome(true) calls. I've placed calls before and after each stage of my script and subtracted the times from each other. So, if you'll follow me through my code, I'll explain it, share the profile results, and point out where the problem is. Maybe you can even help me solve the problem. Here we go:
First, I include two files. One handles some parsing, and the other defines two "data structure" classes.
// Imports
include('./course.php');
include('./utils.php');
Includes are inconsequential as far as I know, and so let's proceed to the cURL import.
// Execute cURL
$response = curl_exec($curl_handle);
I've configured cURL to not time out, and to post some header data, which is required to get a meaningful response. Next, I clean up the data to prepare it for DOMDocument.
// Run about 25 str_replace calls here, to clean up
// then run tidy.
$html = $response;
//
// Prepare some config for tidy
//
$config = array(
'indent' => true,
'output-xhtml' => true,
'wrap' => 200);
//
// Tidy up the HTML
//
$tidy = new tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();
$html = $tidy;
Up until now, the code has taken about nine seconds. Considering this to be a cron job, running infrequently, I'm fine with that. However, the next part of the code really barfs. Here's where I take what I want from the HTML and shove it into my custom classes. (I plan to stuff this into a MySQL database too, but this is a first step.)
// Get all of the tables in the page
$tables = $dom->getElementsByTagName('table');
// Create a buffer for the courses
$courses = array();
// Iterate
$numberOfTables = $tables->length;
for ($i=1; $i <$numberOfTables ; $i++) {
$sectionTable = $tables->item($i);
$courseTable = $tables->item($i-1);
// We've found a course table, parse it.
if (elementIsACourseSectionTable($sectionTable)) {
$course = courseFromTable($courseTable);
$course = addSectionsToCourseUsingTable($course, $sectionTable);
$courses[] = $course;
}
}
For reference, here's the utility functions that I call:
//
// Tell us if a given element is
// a course section table.
//
function elementIsACourseSectionTable(DOMElement $element){
$tableHasClass = $element->hasAttribute('class');
$tableIsCourseTable = $element->getAttribute("class") == "coursetable";
return $tableHasClass && $tableIsCourseTable;
}
//
// Takes a table and parses it into an
// instance of the Course class.
//
function courseFromTable(DOMElement $table){
$secondRow = $table->getElementsByTagName('tr')->item(1);
$cells = $secondRow->getElementsByTagName('td');
$course = new Course;
$course->startDate = valueForElementInList(0, $cells);
$course->endDate = valueForElementInList(1, $cells);
$course->name = valueForElementInList(2, $cells);
$course->description = valueForElementInList(3, $cells);
$course->credits = valueForElementInList(4, $cells);
$course->hours = valueForElementInList(5, $cells);
$course->division = valueForElementInList(6, $cells);
$course->subject = valueForElementInList(7, $cells);
return $course;
}
//
// Takes a table and parses it into an
// instance of the Section class.
//
function sectionFromRow(DOMElement $row){
$cells = $row->getElementsByTagName('td');
//
// Skip any row with a single cell
//
if ($cells->length == 1) {
# code...
return NULL;
}
//
// Skip header rows
//
if (valueForElementInList(0, $cells) == "Section" || valueForElementInList(0, $cells) == "") {
return NULL;
}
$section = new Section;
$section->section = valueForElementInList(0, $cells);
$section->code = valueForElementInList(1, $cells);
$section->openSeats = valueForElementInList(2, $cells);
$section->dayAndTime = valueForElementInList(3, $cells);
$section->instructor = valueForElementInList(4, $cells);
$section->buildingAndRoom = valueForElementInList(5, $cells);
$section->isOnline = valueForElementInList(6, $cells);
return $section;
}
//
// Take a table containing course sections
// and parse it put the results into a
// give course object.
//
function addSectionsToCourseUsingTable(Course $course, DOMElement $table){
$rows = $table->getElementsByTagName('tr');
$numRows = $rows->length;
for ($i=0; $i < $numRows; $i++) {
$section = sectionFromRow($rows->item($i));
// Make sure we have an array to put sections into
if (is_null($course->sections)) {
$course->sections = array();
}
// Skip "meta" rows, since they're not really sections
if (is_null($section)) {
continue;
}
$course->addSection($section);
}
return $course;
}
//
// Returns the text from a cell
// with a
//
function valueForElementInList($index, $list){
$value = $list->item($index)->nodeValue;
$value = trim($value);
return $value;
}
This code takes 63 seconds. That's over a minute for a PHP script to pull data from a webpage. Sheesh!
I've been advised to split up the workload of my main work loop, but considering the homogenous nature of my data, I'm not entirely sure how. Any suggestions on improving this code are greatly appreciated.
What can I do to improve my code execution time?
It turns out that my loop is terribly inefficient.
Using a foreach cut time in half to about 31 seconds. But that wasn't fast enough. So I reticulated some splines and did some brainstorming with about half of the programmers that I know how to poke online. Here's what we found:
Using DOMNodeList's item() accessor is linear, producing exponentially slow processing times in loops. So, removing the first element after each iteration makes the loop faster. Now, we always access the first element of the list. This brought me down to 8 seconds.
After playing some more, I realized that the ->length property of DOMNodeList is just as bad as item(), since it also incurs linear cost. So I changed my for loop to this:
$table = $tables->item(0);
while ($table != NULL) {
$table = $tables->item(0);
if ($table === NULL) {
break;
}
//
// We've found a section table, parse it.
//
if (elementIsACourseSectionTable($table)) {
$course = addSectionsToCourseUsingTable($course, $table);
}
//
// Skip the last table if it's not a course section
//
else if(elementIsCourseHeaderTable($table)){
$course = courseFromTable($table);
$courses[] = $course;
}
//
// Remove the first item from the list
//
$first = $tables->item(0);
$first->parentNode->removeChild($first);
//
// Get the next table to parse
//
$table = $tables->item(0);
}
Note that I've done some other optimizations in terms of targeting the data I want, but the relevant part is how I handle progressing from one item to the next.