Add id to parser html element? - php

I'm going to write a tool that extracts data from soccerway.com. In fact I'm going to create a sort of historical.
As you can see the data is grouped into football seasons, so there would be database 2015/2016, 2016/2017 and so on. What I do then is take the data, enter them in the database and then run a cron job that goes to update these values. The problem is that I should have a key recognition on them to upgrade or something. I currently have in mind only the operation of the parser, but I do not know how I can create a key for each item parserd. For example, take the league standings from the link that I have provided you, how can I (once entered data), in the future, check with cronjob that there are updates and replace the values?
I know that to see if there are updates could exploit the field lastUpdate header and save it somewhere in the database, then the cronjob going to check this field for each league. The most important point, however, is to recognize the values ​​to be updated because I have no id to reference.
Some idea?

While parsing the data, you can store date & time of the forthcoming matches and set the script to run then (the won't be updates in the meanwhile). If you directly parse HTML code it shouldn't take long.

Related

reading and storing a list as an array

I have a script that receives json data from various sources and processes it.
I have a list in a database and also as a text file of known good sources. The list has thousands of records.
Before processing I want to compare the source value from json with the source value in the list. Data is received every 10 sec. The list does not change often.
At the moment I can make this work either by querying the database for the sources list or read the list from a text file, however it seems redundant to do this every 10 sec upon receiving json since the list is going to be the same 99% of the time.
The question is - what is the good way to do this?
Assuming this DB is something you have more than read access - you mentioned the database records do not change often, you could add a trigger on the DB for any changes. Have the trigger update a single row in a new table called "listUpdated" to True.
Load the list into an array in your PHP and use that to bump your data against. Every 10 seconds you can just check if the "listUpdated" field has been set to True. If it is, update your array and change the value back to False.

SQlite3 is there anyway to know the amount of entries that have been added?

I was looking into using a database to store a generated link to that database entry that holds more information about the database entry. So you would see a bit of the database, then click on the entry and open a new page that holds more information about that entry.
What I was looking for was something to keep track of the amount of entries that have been entered, even if one of the entries have been removed. I know SQlite3 has count, but I haven't seen anything that would keep track of this. I was thinking in order reach my goal I would have to just set a counter and write it to a file and pull that counter when I am making a new entry, just wondering if anyone know something else I can do instead of reading/writing a file for one number.
Should be noted this is on a server that can be shutdown and restarted, the user must enter in the information that goes into the database and the server will log it for the user. And I don't want to every repeat the same entry number.
I have mainly used PHP, HTML, and Python for the current project I am working on.
I looked into this out of curiosity because you can do 'post save' and 'pre save' in most ORM-based webapps.
"A trigger may be specified to fire whenever a DELETE, INSERT, or UPDATE of a particular database table occurs"
https://sqlite.org/lang_createtrigger.html
CREATE TRIGGER aft_insert AFTER INSERT ON emp_details
BEGIN
INSERT INTO emp_log(emp_id,salary,edittime)
VALUES(NEW.employee_id,NEW.salary,current_date);
END;
It seems like the answer I was looking for was built into SQLite3, which is the best kind of answer.Auto Increment in SQlite3 allows for me to do what I was looking for. It will keep track of the amount of entries that have been added meaning I will be able to generate a link from the ROWID that I set to Auto Increment and not have a chance to repeat that it again.

RSS aggregator; how to insert only new items

A tutorial here shows how to build an agregator in PHP but I'm having some trouble finding the best way not to insert the same items in my database.
If I were to run the script on http://visualwebsiteoptimizer.com/split-testing-blog/feed/ and then run it again in 5 minutes it'll just insert the same items again.
That tutorial just has an interval time specified in wich it will reload the RSS feed and save all the items.
I was wondering if RSS implement some request header that will only send the items after a certain date. I see here that I could use lastBuildDate and mabe ignore channels that have a date older than last fetched but it doesn't say if that is mandatory.
My question here is: how can I check RSS feeds regularly and insert it in a database without inserting the same item more than once?
I'm thinking the only way to do it is to check if a record already exist using link and only insert if it doesn't exist already. I know link is optional but I won't save items that don't have one anyway. This seems a bit inefficient though; checking before every insert might be fine in the beginning but when the database starts filling up it might get very slow.
You might have to use a few different strategies depending on how well the site you are consuming has implemented the spec.
First I would try adding a unique index on the database for the GUID value, GUIDs by there nature should be unique, http://en.wikipedia.org/wiki/Globally_unique_identifier - then depending on which DB you are using you should be able to use syntax like INSERT IGNORE INTO... or INSERT ... ON DUPLICATE KEY UPDATE... and just have the update syntax not really do anything
If some sites don't have a guid field (I am assuming you will end up consuming more than just the example) you could add the unique on the siteId field and the either the time or the title, both are less than ideal of course contacting the site own to get them to implement a guid might work too ;)
You could also run an md5 hash on the post content and store that alongside the post, that should stop duplicates too.
How big are you expecting the DB to get? with proper indexing I would have thought that it would have to be huge before it runs slow; indexes on siteId, guid, time and/or hash and limited to just 1 row and just the rowId should be quick enough, epscialyl if you can get your script to run commandline / on a cron job rather than through a webserver

Save changes in contenteditable with timestamp

I have a <div contenteditable="true" /> that the user can write in that is pretty much unlimited in length.
The data in the div is saved on change with a timestamp in a MySQL database.
Now my goal is to have a little note on the left that tells the user when each part of the document has been created (resolution should be in days).
Now, the question is: How can I save the information (what part has changed when) best?
I considered the following options so far which both seem improvable:
Every time the user visits the site at the end of the document I insert a flag (e.g. an emtpy span with a class and a data attribute that stores the start of editing). This flag is then saved into the database when the save script is called. This option would make it very easy to show the date on the side - I would just put them on the same height as the empty span and the span tells me the date. Downsides are: The user might accidently delete the timestamp span and if the user doesn't close the window for a long time no new timespan spans are inserted (this could probably be avoid by inserting new timestamp spans every X minutes so the deleting part is more relevant)
Trying to do a string diff comparision each time the data is passed to the saving script and only save the diff with a timestamp. Then when the page is loaded put all parts together in the right order and in Javascript put the date notes in the right place. This sounds like a lot of overhead for me though + when older parts are changed two parts could become one, etc. All in all this options sounds very complicated.
Any input / ideas / suggestions highly appreciated!
What you are trying to implement is the feature called "annotate" or "blame" in the source code control world (though you just want the date of update rather than date+author or simply author).
To do that properly, you need a way to make diffs (php-diff might do the job) and obtain the versions of the text.
There are several strategy:
store the latest version and keep only deltas (such as unified diffs, my preference)
store all the versions and compute deltas on the fly
Once you have your current version and the list of deltas (you can definitely shorten the list if more than say a few dozen delta and let the user ask more if really important). You compose the deltas together, this is where the annotation phase happens as you can do this part remembering from which version comes each line. Composing is pretty simple in fact (start from latest, all lines added in the patch leading to it are from the latest, the other need to be explained, so start again with next patch until you reach the most ancient patch that you want to treat, the remaining lines are from that version or earliest so some flag saying 'at or before ' can be used).
Google has a diff library for Javascript so could do all the hard work on user machine. They have the patch part in this library as well. I did not find an annotate/blame library.
One way that you could do it is by having a table for revisions of the div. Every so often, you can add a new entry into this table, containing the content and the timestamp, and therefore keep track of all the revisions of editing. Then, to find out what is changed, you can simply compare two of the entries to find out what has been added, stayed the same, and removed.
You should save this information when the user submits the information. The moment that the user wants to see the information, there should be hardly any computation.
In the backend you create two tables. In one table, lets call it 'currentdocs', you store always the latest version of the data. When the user loads the document, all the information is coming from this table 'currentdocs'.
In the other table, lets call it 'docsintime', you save every new save. It has a foreign key to the table 'currentdocs' and you can find the last row in this table 'docsintime' by selecting the maximum id with that foreign key. A select statement can be something like:
select id from docsintime where cur_key = x order desc limit 1;
In both tables do you store the for each relevant part the latest timestamp that it has been changed.
When a new document is saved, you get that last saved version in the table 'docsintime'. You compare all relevant parts with the data from that record. If it does not differ then you copy the timestamp of that relevant part in the new record to be saved. If it does differ then do you create a new timestamp for that relevant part.
After the comparison do you save the new record in both tables 'currentdocs' and 'docsintime'. You update the table 'currentdocs' and insert a new record in the table 'docsintime'. Only with a new document will you make an insertion in the table 'currentdocs'.
With the next request for that document will you only have to collect the information from the table 'currentdocs'. And the process starts all over again.

Basic version control for MySQL table

I'm trying to setup a (I thought) fairly simple versioning system for static html pages on a site. The goal is to keep previous versions of the content, then restore to them if needed (I guess basically creating a new version that's a duplicate of an old one), and optionally to toss out data older than X versions ago.
The table's setup is fairly straightforward:
id
reference_id (string/used to determine what page the item pertains to)
content (document/html page sized amount of data)
e_user (user who changed it last)
e_timestamp (when it was changed)
I just want to have something setup to create a previous version for each edit to the content, then be able to restore to it if needed.
What's the best method for accomplishing this? Should everything be in the same table, or spread across a few different ones?
I read through a few pages on the subject, but a lot of them seemed like overkill for what i'm trying to accomplish (ex http://www.jasny.net/articles/versioning-mysql-data/ )
Are there any platforms/guides about that will help me in this endeavorer?
Ideally you would want everything in the same table with something in your query to get the correct version, however you should be careful how you do this as an inefficient query will put extra load on your server. If normally you would select a single item like this:
SELECT * FROM your_table WHERE id = 42
This would then become:
SELECT * FROM your_table
WHERE id = 42
AND date < '2010-10-12 15:23:24'
ORDER BY date DESC
LIMIT 1
Index (id, e_timestamp) to allow this to perform efficiently.
Selecting multiple rows in a single query is more tricky and requires a groupwise-maximum approach but it can be done.
You can use a technique called "auditing". You would set up audit tables. Then you would either write it into your code or setup triggers on the DB side so that every time a change is made, an entry is added into the appropriate audit table. Then you can go back through the audit table and see things like:
"Oh, yesterday Sue went in and fixed a typo"
"Uh oh, steve wiped out an entire paragraph by accident earlier today while trying to rewrite this section"
Your primary table that stores the data doesn't keep all that data, so it can stay slim. If you ever need to look at that data and say roll stuff back, you can go look in your audit table and do that. You can setup the audit table however you want, so each audit row can have the entire content BEFORE edit, and not just what was edited. That should make "rolling back" fairly easy.
Add a version column and a delete column (bool) and create some functions that compare the versions of rows with the same id. You'll definitely want to be able to easily find the current version and the previous version. To get rid of the data you'll want to write another function that sorts all of the versions of id, figures out which are old enough to be deleted, and marks them for deletion by another function. You'll probably want to have an option to make certain pages immune to deletion or postpone it.

Categories