MySQL exact URL search - php

So I'm trying to merge two databases of company information (Table A and Table B from here on out) where the most common (and reliable) single reference point is the website URL. Table A is up-to-date, and Table B is to be updated.
I've extracted the URLs from Table A and cleaned them up using PHP (about 6000 URLs) and the plan is to find and update some information in Table B based on the URLs found (but not the URL itself).
In Table A the URLs are all either domain.com or www.domain.com or www.subdomain.domain.com without http:// or any trailing /'s or other URL data. In Table B they are raw URLs which might contain any extra information with them such as http:// etc.
Now I've tried searching for the company by the URL in Table B like so:
SELECT * FROM companies WHERE website LIKE '%$url1%' OR website LIKE '%$url2%'...
While this works, it is also pulling out information that isn't correct. For example, I don't have bt.com (or any variation of) in the list from Table A, yet it is matching on it in Table B (there is a www.corporate.bt.com in Table A which I think it is matching on).
So, how can I stop this from happening? It's clearly finding something LIKE it in the URL list, but I only want to match on the exact string. So in the example above, if I'm searching for www.corporate.bt.com it should only return that if it finds it within a string (http://www.corporate.bt.com/ is fine, http://bt.com/ is not)
Also, what would be the best possible way of performing this action with a dataset this large? Table A has around 6,000 URLs, Table B has 14,000 (not all of Table A will be in Table B).

LIKE won't return exact search but you can use MySql REGEXP for exact search, it will find exact result in search filed and return only exact url
SELECT * FROM companies WHERE website REGEXP '[[:<:]]$url1[[:>:]]' OR
website REGEXP '[[:<:]]$url2[[:>:]]'
Or if filed have only single url then you can use = operator
SELECT * FROM companies WHERE website = '$url1' OR website = '$url2'
UPDATE
In this you can expend REGEX serarch and input only SERVER_NAME e.g domain.com, domain1.com, abc.domain.com, see below query
$url = "doamin.com";
$url1 = "domain1.com";
SELECT * FROM companies WHERE
website REGEXP '^(htt(p|ps):\/\/|htt(p|ps):\/\/www\.)($url)$' OR
website REGEXP '^(htt(p|ps):\/\/|htt(p|ps):\/\/www\.)($url1)$'

So it turns out that I hadn't filtered through the list of address' in Table A well enough, and it appears that a url of 'http' had slipped through - which meant that every url that contained 'http' was being found...
So I added another filter which checked for the presence of a . in the URL, which ensured that it was at least something.something
if (strpos($domain, ".") !== false) {
// It has a .±
}

Related

CodeIgniter Query on table without primary key

For some customers I am working on a project which is using a MySQL database.
I need to implement a search functionality which should be able to search in the database the devices with all the features selected. I am using CodeIgniter. The problem is the structure of the table.
I've found out that the table contains 2 columns: ID_D (the id of the device) and ID_F (the id of the feature). Basically the table doesn't contain any primary key (that's why I cannot execute any join at all).
So, it's also possible that a device id can appear in 10 rows for each feature it has. When I execute the search, I have a list of the features ID and I should be able to read only the devices with all the features selected.
if (isset($feature_array)) {
foreach($feature_array as $key => $row) {
$this->db->where('f_id',$row['f_id']);
}
}
Naturally, something like that won't work. Any ideas?
I think this might solve your problem, in case the feature ids are unique and do not contain spaces. In case they do contain spaces you should select a separator, that is not in the range of the feature ids.
Basically the code uses the mySQL group_concat function to concatenate all feature ids and matches the created string by all searched features. This finds all devices, that support at least the given set of features.
CI itself does not support that fucntion, so it is added by a workarround in the select method.
Also this might be a bit load intensive if it is called on a large table. Maybe someone else got a faster solution?
$this->db->select(['ID_D', 'GROUP_CONCAT(ID_F SEPARATOR \' \']) as id_f_concatenated']);
foreach($feature_array as $row) {
$this->db->where("id_f_concatenated REGEXP '{$row['id_f']} | {$row['id_f']}|^{$row['id_f']}$'");
}
$this->db->group_by('ID_D');
$result = $this->db-get('tablename');
EDIT:
Changed the LIKE to an REGEXP to make sure, that the feature id 112 is not matched by the id 12, without making the expression to complicated.

get row in mysql using data from friendly url and using regular expression

I need to write code for getting data from a database by a friendly url.
I have a table company with the field title for storing some info about a company. I want to get data by title using friendly url. E.g. example.com/company/aurum-1. First I tried to change some defined symbols to - :
function seoUrl2CompanyName($string) {
$string = preg_replace("/[-ecsui]/","%",$string);
return $string."%";
}
[-ecsui] is used because my native language has non-standard symbols like šįėęų which I cant use in my friendly url, so I tried to change them to % and use the following mysql to find the company by title:
$SQL = "SELECT * FROM company
"WHERE title LIKE '".seoUrl2CompanyName($_GET['company'])."';
But if I use this logic I meet with some difficulties when select return more than one row. E.g
example.com/company/aurum -> seoUrl2CompanyName('aurum')-> a%r%m% ->
like a%r%m% -> 24 rows in my table for match this pattern
My goal is to create the fastest way to find the company from company table by name using data from url.
I would take the suggest from #AgeDeO but expand your SQL like this, that you take the data the company-name reflects AND the ID you get from your URL:
$SQL = "SELECT *
FROM company
WHERE title LIKE '".seoUrl2CompanyName($_GET['company'])."'
AND ID = ".$myId.";
With these 2 factors, you should only get one row and can be sure, that no one just replace his 1 with a 2 and gets other companys data.
NEVER EVER DO THIS:
I want get data by title
What are you going to do when the company name changes? What when two companies have the same name? What happens with spaces, special characters etc...
I understand that you want friendly urls and that is possible, just add the company name as dummy data in the url. Show the company name but do not use it.
Use example.com/1/company/aurum-1 instead, where the 1 is the actual company id.
Beware that it is fairly easy to guess other companies like this. When I change the 1 in a 2 I could have access to the other company like this. If you do not want this, make sure you check for permissions on page load.

MySQL Query to Replace http:// and www in website field

I have a column in my database for website URL and there are many different types of results. Some with www, some with http:// and some without.
I really need to clean up the field, so is there a query I can run to:
Replace all domains with just domain.com format. So remove any www or http://'s
If there is any fields with invalid format like "N/A" or something, so anything without a "." I need to empty it.
And then of course I will update my PHP code to automatically strip it from now on. But for the current entries I need to clean those up.
You can use the REPLACE function to achieve your first point - see the other answers for this. However, I would seriously consider leaving www in the entries as is; because, as the first comment points out, there are actual differences. You might also miss url's like www2.domain.com for example. If you wanted to display them in your app, you can simply remove them in the text presentation (by substringing after the first '.' for example) but leave the href consistent (if displayed as links).
Your second point can be achieved using the INSTR or LOCATE functions.
Simply:
UPDATE table SET url = 'N/A' WHERE LOCATE('.', url) = 0
Read more about both functions here
UPDATE table SET column = REPLACE(column, 'http://', '');
UPDATE table SET column = REPLACE(column, 'www.', '');

user-friendly URLs reliable with the database

I have a database containing a table named songs with a field title.
Now If my url is http://www.foo.com/songs/xxx (xxx = the title of the song),
apache is silently redirecting to a page that looks similar to : /song.php?title=xxx.
To embellish the URLs I convert spaces into underscores (cause I know some browser display %20 instead of space, not%20really%20user%20friendly%20ya%20know%20what%20i%20mean).
There's a snag cause if the title contains spaces and underscores (e.g. DJ_underscore fx) and the script converts it into DJ_underscore_fx the sql :
select * from songs where songs.title=xxx
can't find it.
here's the sketch to be more specific:
a script fetches the different titles in the database
converts all the space into underscore ( e.g. name_of the song ->
name_of_the_song )
echo them as links ( e.g. name_of_the_song )
the user clicks on the link and requests the document
apache is silently redirecting ( e.g. /songs/name_of_the_son ->
/song.php?title=name_of_the_song )
song.php fetches the specific data ( e.g. select * from songs where songs.title=name_of_the_song )
ok you see that there's no entry in the database that looks like name_of_the_song but name_of the song.
How can I manage the whole so that my URL remains clear and the title field is not restricted to a certain amount of values (can have spaces, underscore, dashes, well anything)?
Use something like /1234/name-of-page/ where 1234 is the primary key ID of the row and name-of-page is ignored by your script.
This gives a link directly to the primary key of the entry in the table, which will give you several benefits:
No need to have duplicate ID fields.
Fast indexing on SELECT queries.
You still get the readability and SEO benefits of a "pretty" URL.
You might notice that StackOverflow itself does exactly this:
/questions/8211267/user-friendly-urls-reliable-with-the-database/
Which probably gets re-written to something like:
question.php?id=8211267
Just add another field that will keep the exact name used in URL. And when you have some "duplicates" - just append them with _2, _3 etc or give a way for user to edit and give another name manually.
What your trying to achieve is definetly the wrong way, you could have hundreds of variations to lookup in your database and is also bad for SEO.
Start by setting a rule that all URL's have _ to seperate the space, that's how most site URL's are done (digg.com being an example).
Then create a seperate field that stores the URL e.g.
title | url
song name | song_name
Then do your lookup based on the URL field.
For legacy reasons you could also replace any spaces with _ in your lookup script when you receive the title from the GET before doing the database query.
well, if you want spaces in the url, people will have it uri encoded for transit. if rather than replacing all _ with spaces, just use a uridecoder (can't remember the exact title). it would still allow for spaces to be typed. On the displayer, the shown text in the link, cant you do an str_replace to convert %20 in spaces?
Either that of have a computer friendly version of the title (that doesn't use spaces, but underscores) and a user friendly column that does have the spaces

Two-part MySQL question: Accessing specific MySQL row, and column performance

I have a table with about 150 websites listed in it with the columns "site_name", "visible_name" (basically a formatted name), and "description." For a given page on my site, I want to pull site_name and visible_name for every site in the table, and I want to pull all three columns for the selected site, which comes from the $_GET array (a URL parameter).
Right now I'm using 2 queries to do this, one that says "Get site_name and visible_name for all sites" and another that says "Get all 3 fields for one specific site." I'm guess a better way to do it is:
SELECT * FROM site_list;
thus reducing to 1 query, and then doing the rest post-query, which brings up 2 questions:
The "description" field for each site is about 200-300 characters. Is it bad from a performance standpoint to pull this for all 150 sites if I'm only using it for 1 site?
How do I reference the specific row from the MySQL result set for the site specificed in the URL? For example, if the URL is "mysite.com/results?site_name=foo" how do I do the post-query equivalent of SELECT * FROM site_list where site_name=foo; ?
I don't know how to get the data for "site_name=foo" without looping through the entire result array and checking to see if site_name matches the URL parameter. Isn't there a more efficient way to do it?
Thanks,
Chris
PS: I noticed a similar question on stackoverflow and read through all the answers but it didn't help in my situation, which is why I'm posting this.
Thanks,
Chris
I believe what you do now, keeping sperated queries for getting a list of sites with just titles and one detailed view with description for a single given site, is good. You don't pull any unneeded data and both queries being very simple are fast.
It is possible to combine both your queries into one, using left join, something maybe like:
SELECT s1.site_name, s1.visible_name, s2.description
FROM site_list s1
LEFT JOIN
( SELECT site_name, description
FROM site_list
WHERE site_name = 'this site should go with description' ) s2
ON s2.site_name = s1.site_name
resulting in all sites without matching name having NULL as description, you could even sort it using
ORDER BY description DESC, site_name
to get the site with description as first fetched row, thus eliminating need to iterate through results to find it, but mysql would have to do a lot more work to give you this result, negating any possible gain you could hope for. So basically stick to what you have now, its good.
Generally, it's good practice to have an 'id' field in the table as an auto_increment value. Then, you would:
SELECT id,url,display_name FROM table;
and you'd have the 'id' value there to later:
SELECT * FROM table WHERE id=123;
That's probably your most efficient method if you had WAAAY more entries in the table.
However, with only 150 rows in the table, you're probably just fine doing
SELECT * FROM table;
and only accessing that last field for a matching row based on your criteria.
If you only need the description for the site named foo you could just query the database with SELECT * FROM site_list WHERE site_name = 'foo' LIMIT 1
Otherwise you would have to loop though the result array and do a string comparison on site_name to find the correct description.

Categories