I've been working on this simple script all day trying to figure it out. I'm new to regex so please keep that in mind. On top of that, I've tried just about anything and everything I could to get this to work.
I'm trying to (to learn, please don't point me to the API) download a TSV file from Yahoo Site Explorer via either cURL or file_get_contents (both work, just messing with different things) and then using regex to get only the URL column to appear. I realize I might have more luck with other functions, but I can't find anything dealing with TSV and now it's become a challenge. I've literally spent the entire day trying to get this correct.
So a URL would be:
https://siteexplorer.search.yahoo.com/search?p=www.google.com&bwm=i&bwmo=&bwmf=s
And my regex currently looks like this (I know it's horrible...it's probably the millionth attempt):
preg_match_all('((http(s?)://?(([^/]+(\/.+))))^[\t]$)', $dl, $matches);
My issue right now is that there's 4 columns. TITLE URL SIZE FORMAT. I'm able to strip out everything from the first column (TITLE) and the last (FORMAT) column, but I cannot seem to strip out the SIZE column and get rid of the last slash in case the sites linking in don't have that last slash.
Another thing - I've actually accomplished getting JUST the URL to appear, but they all had ending slashes which leave out links from, say, Twitter.
Any help would be greatly appreciated!
Don't know much about PHP, but this regex works in python (should be the same in PHP):
".+?\t(.+?)\t.*"
Just match it and get the content of group 1. FWIW, code in Python:
import re
import fileinput
urlre = re.compile(".+?\t(.+?)\t.*")
for line in fileinput.input():
m = urlre.match(line)
if m:
print m.group(1)
Personally, I'd split the lines by tab. For example:
$stuff = file_get_contents($url);
// split the whole file by newlines, to get an array of lines
$lines = explode("\n", $stuff);
// loop through the lines
foreach ($lines as $line) {
// split by tab
$parts = explode("\t", $line);
// put the URLs in a list
$urls[] = $parts[1];
// or keep track of them by title
$urls[$parts[0]] = $parts[1];
// or whatever...
}
Just use parse_url or parse_str instead. Always try to find anything else than regular expressions which are extremely slow.
Related
OK my goal is to remove all images and their tags that I specify in an array or group, it should remove the entire image and tags and work if its contained in a link or not.
so far I got this working somewhat but far from perfect this version only removes images not in an href tag, i need it to work both ways.
so if we have <img src="test1.gif" width="235"> it must remove that even if it contains other code and even if its surrounded by a link as long as the image name matches.
So any images contained in the group must be completely removed with there tags and or links that wrap that image contained in my var.
This is what I have so far.
#<img[^>]+src=".*?(test1.gif|test2.png|test3.jpg)"[^>]+>?#i
Ultimately what I am trying to do is not as simple as I hoped so I am hopping some regex guru's can help with this task as I cant find anything on here or the net most are just replacing all images on a page not specific images. Not my reason for it needing to be a Regex is because this must work in other code that's based around preg_replace and yes, I know thats not the best way to do it.
UPDATED added this as example sorry for any confusion.
This all PHP Based!
So this var will have all the images that we need to replace. with nothing.
$m_rimg = "imagewatever.gif|test.jpg|animage.png";
preg_replace('#<img[^>]+src=".*?('.$m_rimg.')"[^>]+>?#i','');
This almost works but not correctly as it must also remove images wrapped in a link href tag and remove the image along with the link if it has one. so basically I need what I have modified to work correctly with <img src="whatever.gif" width=""> or <img src="whatever.gif" width=""> but it must only replace or remove the images that match in the var list not just replacing all images, that are images ... that I can do this is more complex.
I hope this better explains it.
UPDATED 04/25/15
Ok I tried the last one that was added to test it out info below.
I had to mod it with some \ so i did not get parse error so for anyone looking to do something similar to my needs.
This worked great. I just modded what you gave me like this.
"#(?:<a\b[^>]*?>)?(<img[^>]+src=[\"'][^>]*?($m_rimg)['\"][^>]*>)(?:<\/a>)?#is"
and did not use preg_quote, not sure why but that did not work at all but without preg_quote it works so far in some tests i just did.
I was told to not use | but that is what seems to work how else would you guys suggest?
As to this being a duplicate of another answered question flagged by some, I do not think that's the case as I looked at what is said to be the answer to my question as well and it is not the same that I see at all, and is not doing the exact thing I need to do match whats in my var. while yes it is Regex related it did not help, I tried to find something on here that worked for my needs, way before ever posting.
I got a helpful answer to my problem from one user, who understood why I was doing it this way. I hope this is now acceptable to lift he dupe status as my goal was not to offend those who don't think I should use a Regex as part of an HTML parser script.
Try something like:
$DOM = new DOMDocument();
$DOM->loadXML('HTML_DOCUMENT');
$list = $DOM->getElementsByTagName('img');
foreach($list as $img){
$src = $img->getAttribute('src');
//only match if src contains `test1.gif`:
if(stringEndsWith($src, 'test1.gif') ||
stringEndsWith($src, 'test2.gif') ||
stringEndsWith($src, 'test3.gif')) {
$list->removeChild($img);
}
}
function stringEndsWith($haystack, $ending, $caseInsensitivity = false)
{
if ($caseInsensitivity)
return strcasecmp(substr($haystack, strlen($haystack) - strlen($ending)), $haystack) === 0;
else
return strpos($haystack, $ending, strlen($haystack) - strlen($ending)) !== false;
}
Or as you state you still need a regex way to remove <img> tags based on the alternative list inside a $m_rimg variable, and any <a> tags wrapped around, so use this:
$re = "#(?:<a\b[^>]*?>)?(<img[^>]+src=["'][^>]*?('.$m_rimg.')['"][^>]*>)(?:<\/a>)?#is";
$str = "<img\n att=\"value\"\n src=\"sometext3456..,gjyg&&&test1.gif\" />\n\n<img src=\"imagewatever.gif\">";
$result = preg_replace($re, "", $str);
Mind that all the items in your variable must be preg_quoted, but not the | symbols.
Demo
I am currently attempting to drag some information out of a .txt file using a php script. I have been reading about regex's and thought this would be ideal. To give you some idea the format of the text in the .txt file is as follows:
Data Rate: 20 Hz
Digital I/O Channels:
CH1_IN,0,CH2_OUT,1,CH3_IN,0,CH4_OUT,1,CH5_IN,0,CH6_IN,0,CH7_IN,0,CH8_OUT,0,CH9_IN,0,
CH10_IN,0,CH11_OUT,1,CH12_IN,0,CH13_IN,0,CH14_IN,0,CH15_IN,0,CH16_IN,0,
QEA: Enabled
I am trying to pull out the following detail for each channel:
CH(number)_(IN or OUT),(integer)
As described in various posts and some tutorials I have tried using preg_split but haven't been able to get it to work as I want. My understanding is that something like that shown below should work, although it is likely I have not used it correctly:
$log_file_data = file_get_contents('Log.txt');
$channel_detail = preg_split("/CH[0-9]{2}_[A-Z],[0-1]{1}/",$log_file_data);
My intention is that this would split the text nicely into portions as described earlier but as expected it just pretty much spews out the complete text file. Am I using the correct method or does it not suit what I am looking to achieve?
Any guidance would be appreciated.
You don't need preg_split actually but preg_match_all with an improved regex:
$line = <<< EOF
CH1_IN,0,CH2_OUT,1,CH3_IN,0,CH4_OUT,1,CH5_IN,0,CH6_IN,0,CH7_IN,0,CH8_OUT,0,CH9_IN,0,
CH10_IN,0,CH11_OUT,1,CH12_IN,0,CH13_IN,0,CH14_IN,0,CH15_IN,0,CH16_IN,0,
EOF;
if (preg_match_all('/CH([0-9]+)_(IN|OUT),([01])/', $line, $arr))
print_r($arr);
Your channel #, IN/OUT and next number is available in groups #1, #2 and #3
You really don't need regex at all. Exploding on ',' will yield an array that is Channel names for all odd indexes, and every even number will contain an integer that belong to the last index.
Cheers
Something I have noticed on the StackOverflow website:
If you visit the URL of a question on StackOverflow.com:
"https://stackoverflow.com/questions/10721603"
The website adds the name of the question to the end of the URL, so it turns into:
"https://stackoverflow.com/questions/10721603/grid-background-image-using-imagebrush"
This is great, I understand that this makes the URL more meaningful and is probably good as a technique for SEO.
What I wanted to Achieve after seeing this Implementation on StackOverflow
I wish to implement the same thing with my website. I am happy using a header() 301 redirect in order to achieve this, but I am attempting to come up with a tight script that will do the trick.
My Code so Far
Please see it working by clicking here
// Set the title of the page article (This could be from the database). Trimming any spaces either side
$original_name = trim(' How to get file creation & modification date/times in Python with-dash?');
// Replace any characters that are not A-Za-z0-9 or a dash with a space
$replace_strange_characters = preg_replace('/[^\da-z-]/i', " ", $original_name);
// Replace any spaces (or multiple spaces) with a single dash to make it URL friendly
$replace_spaces = preg_replace("/([ ]{1,})/", "-", $replace_strange_characters);
// Remove any trailing slashes
$removed_dashes = preg_replace("/^([\-]{0,})|([\-]{2,})|([\-]{0,})$/", "", $replace_spaces);
// Show the finished name on the screen
print_r($removed_dashes);
The Problem
I have created this code and it works fine by the looks of things, it makes the string URL friendly and readable to the human eye. However, it I would like to see if it is possible to simplify or "tightened it up" a bit... as I feel my code is probably over complicated.
It is not so much that I want it put onto one line, because I could do that by nesting the functions into one another, but I feel that there might be an overall simpler way of achieving it - I am looking for ideas.
In summary, the code achieves the following:
Removes any "strange" characters and replaces them with a space
Replaces any spaces with a dash to make it URL friendly
Returns a string without any spaces, with words separated with dashes and has no trailing spaces or dashes
String is readable (Doesn't contain percentage signs and + symbols like simply using urlencode()
Thanks for your help!
Potential Solutions
I found out whilst writing this that article, that I am looking for what is known as a URL 'slug' and they are indeed useful for SEO.
I found this library on Google code which appears to work well in the first instance.
There is also a notable question on this on SO which can be found here, which has other examples.
I tried to play with preg like you did. However it gets more and more complicated when you start looking at foreign languages.
What I ended up doing was simply trimming the title, and using urlencode
$url_slug = urlencode($title);
Also I had to add those:
$title = str_replace('/','',$title); //Apache doesn't like this character even encoded
$title = str_replace('\\','',$title); //Apache doesn't like this character even encoded
There are also 3rd party libraries such as: http://cubiq.org/the-perfect-php-clean-url-generator
Indeed, you can do that:
$original_name = ' How to get file creation & modification date/times in Python with-dash?';
$result = preg_replace('~[^a-z0-9]++~i', '-', $original_name);
$result = trim($result, '-');
To deal with other alphabets you can use this pattern instead:
~\P{Xan}++~u
or
~[^\pL\pN]++~u
I've used regex for ages but somehow I managed to never run into something like this.
I'm looking to do some bulk search/replace operations within a file where I need to replace some data within tag-like elements. For example, converting <DelayEvent>13A</DelayEvent> to just <DelayEvent>X</DelayEvent> where X might be different for each.
The current way I'm doing this is such:
$new_data = preg_replace('|<DelayEvent>(\w+)</DelayEvent>|', '<DelayEvent>X</DelayEvent>', $data);
I can shorten this a bit to:
$new_data = preg_replace('|(<DelayEvent>)(\w+)(</DelayEvent>)|', '${1}X${2}', $data);
But really all I want to do is simulate a "replace text between tags T with X".
Is there a way to do such a thing? In essence I'm trying to prevent having to match all the surrounding data and reassembling it later. I just want to replace a given matched sub-expression with something else.
Edit: The data is not XML, although it does what appear to be tag-like elements. I know better than parsing HTML and XML with RegEx. ;)
It is possible using lookarounds:
$new_data = preg_replace('|(?<=<DelayEvent>)\w+(?=</DelayEvent>)|', 'X', $data);
See it working online: ideone
I'm trying to find a reliable solution to extract a url from a string of characters. I have a site where users answer questions and in the source box, where they enter their source of information, I allow them to enter a url. I want to extract that url and make it a hyperlink. Similar to how Yahoo Answers does it.
Does anyone know a reliable solution that can do this?
All the solutions I have found work for some URL's but not for others.
Thanks
John Gruber has spent a fair amount of time perfecting the "one regex to rule them all" for link detection. Using preg_replace() as mentioned in the other answers, using the following regex should be one of the most accurate, if not the most accurate, method for detecting a link:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
If you only wanted to match HTTP/HTTPS:
(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
$string = preg_replace('/https?:\/\/[^\s"<>]+/', '$0', $string);
It only matches http/https, but that's really the only protocol you want to turn into a link. If you want others, you can change it like this:
$string = preg_replace('/(https?|ssh|ftp):\/\/[^\s"]+/', '$0', $string);
There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
You could extract urls from string or directly highlight them.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
// Extract urls
$urlHighlight->getUrls("This is example http://example.com.");
// return: ['http://example.com']
// Make urls as hyperlinks
$urlHighlight->highlightUrls('Hello, http://example.com.');
// return: 'Hello, http://example.com.'
For more details see readme. For covered url cases see test.
Yahoo! Answers does a fairly good job of link identification when the link is written properly and separate from other text, but it isn't very good at separating trailing punctuation. For example The links are http://example.com/somepage.php, http://example.com/somepage2.php, and http://example.com/somepage3.php. will include commas on the first two and a period on the third.
But if that is acceptable, then patterns like this should do it:
\<http:[^ ]+\>
It looks like stackoverflow's parser is better. Is is open source?
This code is worked for me.
function makeLink($string){
/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\#]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</a>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\#(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","$1",$string);
return $string;
}