I have a column in my database for website URL and there are many different types of results. Some with www, some with http:// and some without.
I really need to clean up the field, so is there a query I can run to:
Replace all domains with just domain.com format. So remove any www or http://'s
If there is any fields with invalid format like "N/A" or something, so anything without a "." I need to empty it.
And then of course I will update my PHP code to automatically strip it from now on. But for the current entries I need to clean those up.
You can use the REPLACE function to achieve your first point - see the other answers for this. However, I would seriously consider leaving www in the entries as is; because, as the first comment points out, there are actual differences. You might also miss url's like www2.domain.com for example. If you wanted to display them in your app, you can simply remove them in the text presentation (by substringing after the first '.' for example) but leave the href consistent (if displayed as links).
Your second point can be achieved using the INSTR or LOCATE functions.
Simply:
UPDATE table SET url = 'N/A' WHERE LOCATE('.', url) = 0
Read more about both functions here
UPDATE table SET column = REPLACE(column, 'http://', '');
UPDATE table SET column = REPLACE(column, 'www.', '');
Related
The table (images_list is the name of the table) I have to update has over 500 rows with a certain link which I have to replace to a url connected to a local folder.
For example a field will contain www.google.com/img/test-more-text.gif and this has to be replaced to /image/test-more-text.gif. The prefix link is exactly the same for each row, the only variable part is the image name (test-more-text.gif for example is the only variable part in the example given above)
I've looked up multiple tutorials but the only things I can find replace the complete field whereas I need to keep the suffix so to speak.
This image obviously has a different name aswell so I can't simply do
UPDATE images_list
SET image_link = '/image/test-more-text.gif'
WHERE image_link = 'www.google.com/img/test-more-text.gif'
I know how to lookup text with the LIKE statement but I've never had to update something like this before.
If anyone knows how to do this that would safe me a ton of work
Use the REPLACE function:
UPDATE images_list
SET image_link = REPLACE(image_link, 'www.google.com/img/', '/image/');
WHERE image_link LIKE 'www.google.com/img/%'
So I'm trying to merge two databases of company information (Table A and Table B from here on out) where the most common (and reliable) single reference point is the website URL. Table A is up-to-date, and Table B is to be updated.
I've extracted the URLs from Table A and cleaned them up using PHP (about 6000 URLs) and the plan is to find and update some information in Table B based on the URLs found (but not the URL itself).
In Table A the URLs are all either domain.com or www.domain.com or www.subdomain.domain.com without http:// or any trailing /'s or other URL data. In Table B they are raw URLs which might contain any extra information with them such as http:// etc.
Now I've tried searching for the company by the URL in Table B like so:
SELECT * FROM companies WHERE website LIKE '%$url1%' OR website LIKE '%$url2%'...
While this works, it is also pulling out information that isn't correct. For example, I don't have bt.com (or any variation of) in the list from Table A, yet it is matching on it in Table B (there is a www.corporate.bt.com in Table A which I think it is matching on).
So, how can I stop this from happening? It's clearly finding something LIKE it in the URL list, but I only want to match on the exact string. So in the example above, if I'm searching for www.corporate.bt.com it should only return that if it finds it within a string (http://www.corporate.bt.com/ is fine, http://bt.com/ is not)
Also, what would be the best possible way of performing this action with a dataset this large? Table A has around 6,000 URLs, Table B has 14,000 (not all of Table A will be in Table B).
LIKE won't return exact search but you can use MySql REGEXP for exact search, it will find exact result in search filed and return only exact url
SELECT * FROM companies WHERE website REGEXP '[[:<:]]$url1[[:>:]]' OR
website REGEXP '[[:<:]]$url2[[:>:]]'
Or if filed have only single url then you can use = operator
SELECT * FROM companies WHERE website = '$url1' OR website = '$url2'
UPDATE
In this you can expend REGEX serarch and input only SERVER_NAME e.g domain.com, domain1.com, abc.domain.com, see below query
$url = "doamin.com";
$url1 = "domain1.com";
SELECT * FROM companies WHERE
website REGEXP '^(htt(p|ps):\/\/|htt(p|ps):\/\/www\.)($url)$' OR
website REGEXP '^(htt(p|ps):\/\/|htt(p|ps):\/\/www\.)($url1)$'
So it turns out that I hadn't filtered through the list of address' in Table A well enough, and it appears that a url of 'http' had slipped through - which meant that every url that contained 'http' was being found...
So I added another filter which checked for the presence of a . in the URL, which ensured that it was at least something.something
if (strpos($domain, ".") !== false) {
// It has a .±
}
I have a database containing a table named songs with a field title.
Now If my url is http://www.foo.com/songs/xxx (xxx = the title of the song),
apache is silently redirecting to a page that looks similar to : /song.php?title=xxx.
To embellish the URLs I convert spaces into underscores (cause I know some browser display %20 instead of space, not%20really%20user%20friendly%20ya%20know%20what%20i%20mean).
There's a snag cause if the title contains spaces and underscores (e.g. DJ_underscore fx) and the script converts it into DJ_underscore_fx the sql :
select * from songs where songs.title=xxx
can't find it.
here's the sketch to be more specific:
a script fetches the different titles in the database
converts all the space into underscore ( e.g. name_of the song ->
name_of_the_song )
echo them as links ( e.g. name_of_the_song )
the user clicks on the link and requests the document
apache is silently redirecting ( e.g. /songs/name_of_the_son ->
/song.php?title=name_of_the_song )
song.php fetches the specific data ( e.g. select * from songs where songs.title=name_of_the_song )
ok you see that there's no entry in the database that looks like name_of_the_song but name_of the song.
How can I manage the whole so that my URL remains clear and the title field is not restricted to a certain amount of values (can have spaces, underscore, dashes, well anything)?
Use something like /1234/name-of-page/ where 1234 is the primary key ID of the row and name-of-page is ignored by your script.
This gives a link directly to the primary key of the entry in the table, which will give you several benefits:
No need to have duplicate ID fields.
Fast indexing on SELECT queries.
You still get the readability and SEO benefits of a "pretty" URL.
You might notice that StackOverflow itself does exactly this:
/questions/8211267/user-friendly-urls-reliable-with-the-database/
Which probably gets re-written to something like:
question.php?id=8211267
Just add another field that will keep the exact name used in URL. And when you have some "duplicates" - just append them with _2, _3 etc or give a way for user to edit and give another name manually.
What your trying to achieve is definetly the wrong way, you could have hundreds of variations to lookup in your database and is also bad for SEO.
Start by setting a rule that all URL's have _ to seperate the space, that's how most site URL's are done (digg.com being an example).
Then create a seperate field that stores the URL e.g.
title | url
song name | song_name
Then do your lookup based on the URL field.
For legacy reasons you could also replace any spaces with _ in your lookup script when you receive the title from the GET before doing the database query.
well, if you want spaces in the url, people will have it uri encoded for transit. if rather than replacing all _ with spaces, just use a uridecoder (can't remember the exact title). it would still allow for spaces to be typed. On the displayer, the shown text in the link, cant you do an str_replace to convert %20 in spaces?
Either that of have a computer friendly version of the title (that doesn't use spaces, but underscores) and a user friendly column that does have the spaces
I have had this problem for a while,
Let say we have a movies website
And we have a movie named Test-movies123! in the database,
now what I would do is make a URL watch/test-movie123-{$id}/ and then query DB with the ID,
Now the issue with this is that the ID shouldn't be there, how can I go around this ?
if I get the test-movie123 from url and search it, I wont find it because it has no ! unless I use LIKE but thats not very trusty...
Anyone could suggest anything ? Would be much appreciated
Well, you could create a rule for taking the movie title and turning it into a slug. So, you'd know that you always lowercased the title, removed anything other than letters, numbers and dashes, and converted whitespace into a single dash.
Then store that in another column in your database, and be sure you are forcing uniqueness. Take the URL and search that column from that.
From that point you just have to deal with what happens if you have a second video uploaded that produces the exact same slug. There are a number of options for this ... append a random number slug, increment a number and append it, etc.
To do that, you may have in your database something like the primary_key as
"test-movies123".
Imagine you have a control panel, you insert movies in a form.
Then use the title Test Movies123! to save it in the database like this example:
id: AUTO_INCREMENT NUMBER
keyname: sanityTitle("Test Movies123!") <-- this should save "test-movies123"
title: "Test Movies123!"
stuff: "blablabla"
note sanityTitle() will be your function to prepare friendly url's from titles.
Then your url will look like
watch/test-movie123/ using regex control in url's
or
watch/?id=test-movie123 raw
You will search for the INDEXED or PRIMARY key, "keyname" in the table, it will output 1 row, with all your stuff.
A group of people have been inconsistently entering data for a while.
Some people will enter this:
101mxeGte - TS 200-10
And other people will enter this
101mxeGte-TS-200-10
The sad thing is, those are supposed to be identical records.
They will also search inconsistently. If a record was entered one way, some people will search the other way.
Now, I know all about how you can fix data entry for the future, but that's NOT what I am asking about. I want to know how it is possible to:
Leave the data alone, but...
Search for the right thing.
Am I asking for the impossible here?
The best thing I found so far was a suggestion to simply muck about with the existing data, using the REPLACE function in mySQL.
I am uncomfortable with this option, as it means it will certainly actively piss off half of the users. The unfocused angst of all is less than the active ire of half.
The problem is that it has to go both ways:
Entering spaces in the query has to find both space and not-space entries,
and NOT entering spaces ALSO has to find both space and not-space entries.
Thanks for any help you can offer!
The "ideal" solution is pretty straightforward:
Decide what is the canonical way of representing a record
When someone saves a record, canonicalize it before saving
When someone searches for a record, canonicalize the input before searching for it
You could also write a small program to convert all existing data to the canonical form (you will have the code for it anyway, as "canonicalize" in steps 2 and 3 require that you write code that does so).
Edit: some specific information on how to canonicalize
With the sample data you give, the algorithm might be:
Replace all spaces with hyphens
Replace all runs of one or more hyphens with a single hyphen (a regex would be easiest for this -- actually, a regex can do both steps in one go)
Is there any practical problem with this approach?
Trim whitespaces from BOTH the existing data and the input of the search. That way the intended record(s) will always be returned. Hope your data size is small, though, because it's going to perform pretty poorly.
Edit: by "existing data" I meant "the query of existing data". My answer was based on assumption that the actual data could not be touched (which might not be correct).
If it where up to me, I'd have the data in the database updated with REPLACE, and on future searches when dealing with the given row remove all spaces in the input.
Presumably your users enter the search terms (or record details, when creating a record) in an HTML form, which then goes to a PHP script. It looks like your data can always be written in a way that contains no spaces, so why don't you do this:
Run a query that strips spaces from the existing data
Add code in the PHP script(s) that receives the form(s), so that it strips spaces from submitted data - whether that data is to be used for search or for writing new data.
Edit: I guess you would also need to change some spaces to hyphens. Shouldn't be too hard to write logic to accomplish that.
Something like this.
pseudo code:
$myinput = mysql_real_escape_string('101mxeGte-TS-200-10')
$query = " SELECT * FROM table1
WHERE REPLACE(REPLACE(f1, ' ', ''),'-','')
= REPLACE(REPLACE($myinput, ' ', ''),'-','') "
Alternatively you might write your own function to trim the data so it can be compared.
DELIMITER $$
CREATE FUNCTION myTrim(AStr varchar) RETURNS varchar
BEGIN
declare Result varchar;
SET Result = REPLACE(AStr, ' ','');
SET Result = ......
.....
RETURN Result;
END$$
DELIMITER ;
And then use this in your select
$query = " SELECT * FROM table1
WHERE MyTrim(f1) = MyTrim($myinput) "
have you ever heard of SQL's LIKE?
http://dev.mysql.com/doc/refman/4.1/en/string-comparison-functions.html
there's also regex
http://dev.mysql.com/doc/refman/4.1/en/regexp.html#operator_regexp
101mxeGte - TS 200-10
101mxeGte-TS-200-10
how about this?
SELECT 'justalnums' REGEXP '101mxeGte[[:blank:]]*(\-[[:blank:]]*)?TS[[:blank:]-]*200[[:blank:]-]*10'
digits can be represented by [0-9] and alphas as [a-z] or [A-Z] or [a-zA-Z]
append a + to make then multiple of that. perens allow you to group and even capture what is in the perens and reuse it later in a replace or something else.
RLIKE is the same as REGEXP.