Best way to check if content of page has been changed? - php

I have a crawler that crawl hundred of thousands of pages and index/parse the contents of the page, and one thing I'm struggling with is to check if the content of a page has been updated, in an efficient way, without having to crawl it and check the content of the page.
Obviously I could just load the whole page, and re-parse everything and compare it all to what I have stored in my database. However that is very inefficient and use a lot of computing resulting in high hosting bills.
I'm thinking of comparing hashes, the problem with this is that if the page has changed a single byte or character, the hash would be different. So for example if the page display the current date on the page, every single time the hash would be different and tell me that the content has been updated.
So... How would you do this? Would you look at the kb size of the HTML? Would you look at the string length and check if for example the length has changed more than 5%, the content has been "changed"? Or is there some kind of hashing algoritm where the hashes stay the same if only small parts of the string/content has been changed?

You could try and use the value contained in the "last-mofidied" header in the response from the server. Parsing this into a nice object would allow for simple date comparisons, letting you check if you should re-scrape. For example (in Python using the brilliant requests library:
import requests
r = requests.get('http://en.wikipedia.org/wiki/Monty_Python')
site_last_modified_date = r.headers["Last-Modified"]
# from here, just parse the date and compare it with the last recorded date

Related

Is there a way to store multiple records rather than using multiple rows in MySQL?

I would like to make full use out of MySQL for the purpose of a (web) application I have developed for a chiropractor.
So far I have been storing in a single row for [every year] for what are called progress notes. The table structure looks something like this (progress_note_id, patient_id, date (Y-0-0), progress_note). When the client wishes to append for the year of the current progress notes, he simply clicks at the top of a textarea (html), which I use TinyMCE JavaScript library, to make a new entry date along with the shorthand notes to go at the beginning of the column (progress_note). So far its been working ok, if there are 900+ clients (est.) there could potentially be 1300+ progress notes, for each year since the beginning of the application (2018).
Now the client wishes to be able to see previous progress notes (history), but is unable to modify any previous notes, while still be able to write new ones. The solution I have come up with is to use XML inside the textarea, and use PHP to decipher the new notes from the old ones.
My problem however is if I should have to convert my entire table from a yearly to a daily, that it could take up a lot of time and energy to convert multiple notes into each single rows, (est. 10x) Which could end up being 13,000+ rows. I realize that no matter what method I choose to do is going to be a lot of work. Another way around this perhaps I found was to use XML column type in MySQL to potentially store multiple records, and if I wish to append it, all I would need is PHP to interpret the entire XML and add a new child node, to the beginning. Each progress note is 255 - 500 chars. And in worst case scenario, if the patient was to be 52 times a year (1 for every week), there shouldn't be a large enough overhead.
Is this the correct way to solving this problem? I do wish to keep with MySQL DB and I realize that MySQL is not an intended for XML. And for some clarification, what I hope to accomplish is the same thing I intended to do with current progress notes, but with XML. I believe in ascending order (newer -> oldest).
<xml_result>
<progress_note>
<date>2020-08-16</date>
<content></content>
</progress_note>
<xml_result>
Thank-you for any of your time and for any suggestions.
Firstly, 13000+ is not a problem for mysql. In most case for web application, mysql can handle more than 10m+ records for a single instance with a good performance.
Secondly, you can use either XML or JSON format in a text field and handle the decoding in your application.

How to use php to store and display any visitor's input

I want to create a very simple site, but unfortunately my php skills are weak. Basically, when a user shows up, I want to have a page with text and a blinking cursor (I can probably figure the cursor part out myself, but feel free to suggest). When a user types, I want it to show the text as they type, and when they hit enter (or click something/whatever), the text just typed will be sent to a database and then the page will update with that new text, for anybody else to see. The cursor will then be blinking on the next line down. So basically it's like a really simple wiki, where anyone can add anything, but nobody can ever remove what has been typed before. No logging in or anything. Can someone suggest the best way to go about this? I assume it will require a php call to the database to display the initial page, then another php request to send data, then another php request to display the new page. I just don't know the details. Thanks so much!
Bonus question 1: How can the page be updated dynamically, so if A sends text while B is typing, B sees the text A sent on B's page immediately?
Bonus question 2: What sorts of issues might arise if this database grows extremely large (say, millions of words), and how might I address these up front? If necessary, I could show only a small chunk of the (text-only) database on any given page, then have pagination.
If you only have one page, you don't need a database. All you need to do is save a text file on the server (use fopen() and related functions) that only gets appended to. If you have multiple pages, then a simple id (INTEGER), filetext (LARGEBLOB). (Note largeblob has a limit of 2^32 bytes).
For the user's browser part, you'll need to use Javascript and AJAX to inform the server of any updates. Just get in touch with a PHP script that (1) accepts the input and (2) appends it to a file.
Bonus question 1: How can the page be updated dynamically, so if A sends text while B is typing, B sees the text A sent on B's page immediately?
Also use the AJAX call to fetch new content (e.g. if you assign line numbers, then the browser just tells the script the last line it read, and the script returns all new lines past that point).
I assume it will require a php call to the database to display the initial page, then another php request to send data, then another php request to display the new page.
Pretty much. But only send the last 50 lines or so of the file when the browser visits it. You don't want to crash the browser.
Bonus question 2: What sorts of issues might arise if this database grows extremely large (say, millions of words), and how might I address these up front? If necessary, I could show only a small chunk of the (text-only) database on any given page, then have pagination.
Think in terms of bytes, not words, and you'll likely run into performance issues. You could cap file sizes or split up the storage into multiple files at a certain size so you don't have to scan pass content that will rarely be fetched.

Showing a changing php value with Javascript?

I've got a script in php that continually grows an array as it's results are updated. It executes for a very long time on purpose as it needs to filter a few million strings.
As it loops through results it prints out strings and fills up the page until the scroll bar is super tiny. Instead of printing out the strings, I want to just show the number of successful results dynamically as the php script continues. I did echo(count($array)); and found the number at 1,232,907... 1,233,192 ... 1,234,874 and so forth printed out on many lines.
So, how do I display this increasing php variable as a single growing number on my webpage with Javascript?
Have your PHP script store that number somewhere, then use AJAX to retrieve it every so often.
You need to find a way to interface with the process, to get the current state out of it. Your script needs to export the status periodically, e.g. by writing it to a database.
The easiest way is to write the status to a text file every so often and poll this text file periodically using AJAX.
You can use the Forever Frame technique. Basically, you have a main page containing an iframe. The iframe loads gradually, intermittently adding an additional script tag. Each script tag modifies the content of the parent page.
There is a complete guide available.
That said, there are many good reasons to consider doing more pre-computation (e.g. in a cron job) to avoid doing the actual work during the request.
This isn't what you're looking for (I'm as interested in an answer to this..), but a solution that I've found works is to keep track of the count server-side, and only print every 1000/5000/whatever number works best, rather than one-by-one.
I'd suggest that you have a PHP script that returns the value in JSON format. Then in another php page you can do an AJAX call to the page and fetch the JSON value and display it. Your AJAX call can be programmed to run perhaps every 5 seconds or so depending on how fast your numbers output. Iframe though easier, is a bit outdated.

Is is possible to parse a web page from the client side for a large number of words and if so, how?

I have a list of keywords, about 25,000 of them. I would like people who add a certain < script> tag on their web page to have these keywords transformed into links. What would be the best way to go and achieve this?
I have tried the simple javascript approach (an array with lots of elements and regexping/replacing each) and it obviously slows down the browser.
I could always process the content server-side if there was a way, from the client, to send the page's content to a cross-domain server script (I'm partial to PHP but it could be anything) but I don't know of any way to do this.
Any other working solution is also welcome.
I would allow the remote site add a javascript file and using ajax connect to your site to get a list of only specific terms. Which terms?
Categories: Now if this is for advertising (where this concept has been done a lot) let them specify what category their site falls into and group your terms into those categories. Then only send those groups of terms. It would be in their best interest to choose the right categories because the more links they have the more income they can generate.
Indexing: If that wouldn't work, you can maybe when the first time someone tries to load the page, on your server index a copy of it and index all the words on their page with the terms you have and for any subsequent loads you have a list of terms to send them based on what their page contains. ideally after that you would have some background process that indexes their pages with your script like once a day or every few days to catch any updates. Possibly use the script to get a hash of the page contents and if changed at all you can then update your indexed copy.
I'm sure there are other methods, which is best is really just preference. Try looking at a few other advertising-link sites/scripts and see how they do it.

Find duplicate content using MySQL and PHP

I am facing a problem on developing my web app, here is the description:
This webapp (still in alpha) is based on user generated content (usually short articles although their length can become quite large, about one quarter of screen), every user submits at least 10 of these articles, so the number should grow pretty fast. By nature, about 10% of the articles will be duplicated, so I need an algorithm to fetch them.
I have come up with the following steps:
On submission fetch a length of text and store it in a separated table (article_id,length), the problem is the articles are encoded using PHP special_entities() function, and users post content with slight modifications (some one will miss the comma, accent or even skip some words)
Then retrieve all the entries from database with length range = new_post_length +/- 5% (should I use another threshold, keeping in mind that human factor on articles submission?)
Fetch the first 3 keywords and compare them against the articles fetched in the step 2
Having a final array with the most probable matches compare the new entry using PHP's levenstein() function
This process must be executed on article submission, not using cron. However I suspect it will create heavy loads on the server.
Could you provide any idea please?
Thank you!
Mike
Text similarity/plagiat/duplicate is a big topic. There are so many algos and solutions.
Lenvenstein will not work in your case. You can only use it on small texts (due to its "complexity" it would kill your CPU).
Some projects use the "adaptive local alignment of keywords" (you will find info on that on google.)
Also, you can check this (Check the 3 links in the answer, very instructive):
Cosine similarity vs Hamming distance
Hope this will help.
I'd like to point out that git, the version control system, has excellent algorithms for detecting duplicate or near-duplicate content. When you make a commit, it will show you the files modified (regardless of rename), and what percentage changed.
It's open source, and largely written in small, focused C programs. Perhaps there is something you could use.
You could design your app to reduce the load by not having to check text strings and keywords against all other posts in the same category. What if you had the users submit the third party content they are referencing as urls? See Tumblr implementation-- basically there is a free-form text field so each user can comment and create their own narrative portion of the post content, but then there are formatted fields also depending on the type of reference the user is adding (video, image, link, quote, etc.) An improvement on Tumblr would be letting the user add as many/few types of formatted content as they want in any given post.
Then you are only checking against known types like a url or embed video code. Combine that with rexem's suggestion to force users to classify by category or genre of some kind, and you'll have a much smaller scope to search for duplicates.
Also if you can give each user some way of posting to their own "stream" then it doesn't matter if many people duplicate the same content. Give people some way to vote up from the individual streams to a main "front page" level stream so the community can regulate when they see duplicate items. Instead of a vote up/down like Digg or Reddit, you could add a way for people to merge/append posts to related posts (letting them sort and manage the content as an activity on your app rather than making it an issue of behind the scenes processing).

Categories