URL Pattern Matching (PHP)? - php

(Programming Language: PHP v5.3)
I am working on this website where I make search on specific websites using google and bing search APIs.
The Project:
A user can select a website to search from a drop-down list. We have an admin panel on this website. If the admin wants to add a new website to the drop-down list, he has to provide two sample URLs from the site as shown below.
On the submit of form a code goes through input and generates a regex that we later use for pattern matching. The regex is stored in database for later use.
In a different form the visiting user selects a website from the drop-down list. He then enters the search "query" in a text box. We fetch results as JSON using search APIs(as mentioned above) where we use the following query syntax as search string:
"site:website query"
(where we replace "website" with the website user chose for search and replace "query" with user's search query).
The Problem
Now what we have to do is get the best match of the url. The reason for doing a pattern match is that some times there are unwanted links in search results. For example lets say I search on website "www.example.com" for an article names "abcd". Search engines might return these two urls:
1) www.example.com/articles/854/abcd
2) www.example.com/search/abcd
The first url is the one that I want. Now I have two issues to resolve.
1) I know that the code that I wrote to make a regex pattern from sample URLs is never going to be perfect considering that the admin adds websites on regular basis. There can never be enough conditions to check for creating a pattern for different websites from same code. Is there a better way to do this or regex is my only option?
2) I am developing on a machine running Windows 7 OS. preg_match_all() returns results here. But when I move the code to server which is running Linux OS, preg_match_all() does not return any results for the same parameters? I can't seem to get why that is happening. Anyone knows why is this happening?
I have been working on web technologies for only past few weeks, so I don't know if I have better options than regex. I would be very grateful if you could assist me or guide me towards resources where I can find solution for my problems.

About question 1:
I can't quite grasp what you're trying to accomplish so I can't give any valid opinion.
Regarding question 2:
If both servers are running the same version of PHP, the regex library used ought to be the same. You can test this, however, by making a mock static file or string to test against the regex and see if the results are the same.
Since you're grabbing results from the search engines and then parsing them, the data retrieve might not be the same. Google/Bing change part of the data regarding the OS you use and that might alter preg results.

Related

Search On Website using php variable - DOM Parsing

I want to search on website pragmatically using PHP like as we search on website manually, enter query on search box press search and result came out.
Suppose I want to search on this website by products names or model number that are stored in my csv file.
if the products number or model number match with website data then result page should be displayed ..
I search on below question but not able to implement.
Creating a 'robot' to fill form with some pages in
Autofill a form of another website and send it
Please let me know how we can do this PHP ..
Thanks
You want to create a “crawler” for websites.
There are some things to consider first:
You code will never be generic. Each site has proper structure and you can not assume any thing (Example: craigslist “encode” emails with a simple method)
You need to select an objective (Emails ? Items information ? Links ?)
PHP is by far one of the worst languages to do that.
I’ll suggest using C# and the library called AgilityHtmlPack. It allows you to parse HTML pages as XML documents (So you can do XPath expressions and more to retrieve information).
It surely can be done in PHP, but I think it will take at least 10x time in php compared to c#.

How to compare strings based on caracters similarity in SQL?

I'm working on redirecting people if they type a "not really wrong url".
For example I have a good URL http://www.website.com/category/foo-bar-if-bar-foo/.
This one works so if a user enter to my website with it, I can retrieve the article corresponding.
But if someone enter to my website with a not really wrong url like http://www.website.com/category/foo-bar-foo/ because an another website has referenced a wrong url, I should redirect him to the right one instead of having a 404 status code...
So how should I do this? and Most important, should I do this ?
I actually use Eloquent with Laravel 4.2.
Thank you in advance.
EDIT
I was wrong about stackoverflow, thanks for your comment. It uses the unique ID of a post.
EDIT 2
I Looked at SOUNDEX function in SQL, it's really good if there is a small difference like a character or two missing. But if my url is as broken as my example, it's not working anymore obviously. But thanks it's gonna be usefull.
Just thinking off the top of my head, you could create a SQL table (with Full-Text indexing enabled) containing all your paths (it might already exist).
In the event that a 404 is triggered, hijack that and do a MATCH (Full Text Search) and return the path with the highest scoring MATCH (you can also consider using a score threshold to prevent nonsensical matches).

AmazonECS php not finding products when code moved to alternate domain

I'm running AmazonECS.class.php on a site, but when I copy the code over to a new website, the api returns no results for the same search string that works on the first site. The api seems to be working, I can var_dump the $response and it has all the fields, but the results are "nothing found".
Are the AWS_API_KEY or AWS_API_SECRET_KEY tied to a domain? Do I have to apply for new ones, and if I do, will the old ones still work?
Incredibly, the problem was that the search texts were different in a small way. The one that worked spelled out a state name and the one that didn't used a state abbreviation. It fails on a manual search on Amazon's website also. Why on earth isn't their search more robust and fuzzy?

Regex PHP tweet filter separate into columns

I'm working on my school project about showing public transport delays in a table on my website.
I'm crawling a twitter profile, using Rstudio, Postgreql, PHP, cmd
I'm trying to filter tweets from a twitter profile(public transport profile, showing delays etc.). twitter.com/TotalOVNL
I'm using php with postgresql . I already have the tweets in a table on my website using SQL Query see link for screenshot http://prntscr.com/385squ . I would like to use Regex to filter the tweettext. I dont want to show all of the tweets' text from the tweet, just the ones containg delays and seperate it into few collumns. Ive made in paint an example of the table that i would like to have on my website, see the link for screenshot http://prntscr.com/3866u6.
I know that i have to use regex, but dont know the language that well.
Would someone be able to help me?
There doesn't seem to be a pattern in the last two (pt number and delay).
I tried to find the best pattern matching possible, hope this helps. Look at the match groups in the bottom right to find the matches.
#(\w+)\s#(\w+)\sis op (\d+-\d+)\s(\d+:\d+) (\w+)lijn
Example on some of the tweets on regex101.

Which third party search engine (free) should I use?

As the title says, I need a search engine... for mysql searching.
My website is PHP based.
I was going with sphinx but my hosting company doesn't support full-text indexes!
So a search engine to be used without full-text!
It should be pretty powerful, and must include atleast these functions below:
When searching for 'bmw 520' only matches where these two words come in exactly this order is returned. not matches for only 'bmw' or only '520'.
When searching for 'bmw 330ci' results as the above will be returned, but, WITH AND WITHOUT the ci extension. There are a nr of extensions in cars as you all know (i, ci, si, fi etc).
I want the 'minus sign' to 'exclude' all returns containing the word after the sign, ex: 'bmw -330' will return all 'bmw' results without the '330' ones. (a NOT instead of minus sign is also ok)
all special character accents like 'é' are converted to their simple values, in this case 'e'.
list of words to ignore completely in the search
Thanks guys!
The Zend_Lucene search competent works fairly well. I am not sure how it would cope with your second requirement, however if you customized the tokenized you should be able to do it by treating a change from letters to numbers as a new word.
The one I am really not sure about is the top requirement. Given how it is indexed, order becomes irreverent in the search, so you may not be able to do it without heavy editing of Lucene, writing a filter (using lucene to pull the matches, then checking the order), or writing your own solution. All of these will slow the search down, and add load to your server.
There is also solr, but I have never used it and don't know anything about it. Sphinx was another one, but I see you have already ruled that out.
Xapian is very good (very comprehensive) if you have the time for the initial setup.
It functions as you would expect a search engine to work, tell the indexer what bits of information to index under what namespace/table/object (Page, Profile, Products etc), then issue a query for your users based on keywords, it also supports google style tags e.g. "profile:Mark icecream" would search my profile for the word icecream, i seem to remember it supporting ranges too for data you specify as numeric.
Can be used in local mode which can offer spelling modifications (Did you mean?), or remote mode that many sites can index to and query from.
What really saved me one time was the ability to attach transient non searchable data to an indexed item, e.g. attaching the DB id to all data indexed for that record, very good for then going and getting the whole record from the DB when your matches come back from xapian.
I have used a couple of Search Engines on my site during it's time, but in the next rebuild I'm planning to move to Google Site Search.
There are several reasons for this:
Users are very familiar with the Google style of search result listings which improves usability and hence click-through rates
The Google engine is very good at guessing when to use the page description and when to use a fragment of the page (it also very good at getting relevant fragments compared to some other engines)
It's used by thousands of very popular websites
Google is the most popular search engine around so you know their technology is both reliable and accurate
Google Site Search begins at $100 per annum for 1000 pages or less (and a limit on queries)
or you can use the free Google Custom Search Engine (but this has much less customizability)

Categories