preg_replace function for multiple matches in PHP - php

I am trying to get the base64 code for the insert image string using summer note js editor.
I manage to get this
preg_replace('#<img\ssrc="data:image/([\w]+);base64,([a-zA-Z0-9+/-_=]+)"\sstyle="width:\s[0-9]+px;">#',"IMAGE REMOVED FROM CHROME\r\n", $content,-1, $counter);
AS well as a few variations depending on
The position of the style and data-filename and src are always changing depending on different browsers and so i have a few variations.
1) Are there easier way to do this?
Like if i have all components of img src, style and data-filename, i will just match the string? I can create all the variations but if i were to do $content = preg_replace 10 times for 10 different variations, isn't it extremely slow just to find one match? And it becomes increasingly slower if my $string is extremely long.
2) I need to pull out the base64 string to save it as a image, how can i use the regex above to help me to fopen, fwrite, fclose?
Thanks in advanced.

You might use something like this:
<img(\s+src="data:image/([\w]+);base64,([\w+/-_=]+)")?(\s+style="width:\s*([\d]+)px;")?\s*>
You can look at this working example.
This way, it covers the absence of src and/or style attributes.
It's up to you to eventually refine it to:
change the 2 attribute groups to non-capturing, depending on how you find them useful or not
also make their contents partially optional, or even optionally richer (e.g. style might include other properties)
Some additional points:
in order to face any case, I replaced \s by \s+ between attributes, and at the opposite by \s* inside of style
I simplified [a-zA-Z0-9+/-_=]+ and [0-9]+ expressions, which become [\w+/-_=]+ and [\d]+
I added capturing parenthesis around this last one (from your question, seems that you need to get width when available)

Related

PHP Data Scraping

I would like to scrape some data from a website using PHP using
preg_match("/ /i/s", $contents, $matches);
The website I am trying to get data from looks like this
https://www.spareroom.co.uk/flatshare/?search_id=592135669&
I would like to scrape the line that says;
Showing 1-17 of 17 results
I want to use (.*?) to get the total number of properties (in this case 17) for a website to show this information separately.
How can I use preg_match when the data I am scraping varies according to the amount of properties available?
I look forward to any assistance.
David
Going by the example page it looks like this line appears once on a page. If it appears multiple times you may want preg_match_all to return multiple results. Another tricky thing about doing this is changes that get made to a web page from time to time. So here is a solution that will work right now, but you can also tweak things to account for changes in the web page (something I can't tell from a single example):
preg_match( "#<.*?>\s*(\d+)\s*<.*?>\s+results#i", $page, $results );
So I use the i flag to make it case insensitive. That way if they capitalize "results" or something it won't break.
<.*?>
Keep in mind that you are going to be getting the HTML code which has tags you can't see from the front. In this case there are strong tags around the total. But maybe they will change this to a different tag in the future? So I just used open/close angle brackets with wildcards for the contents. Oh and the question mark is so it's not greedy and stops at the closest angle bracket.
\s*
This looks for 0 or more spaces. Right now there is a single space between the strong tag and the total. What if they remove that space or add more? This should cover you in both cases.
(\d+)
The parenthesis is what captures content to the $results array. Inside it is saying 1 or more digits, so only numbers.
\s*
Like earlier, looking for 0 or more space characters.
<.*?>
This is to match the closing strong tag but takes account that they could later use a different closing tag.
\s+results
This is looking for one or more spaces before the word results. We know there has to be at least one, but they could make changes in the future that will put more spaces in there (even though the webpage will only display one).
$results will have two elements The first one will be the entire phrase, and the second element will contain just the capture phrase (between the parenthesis).
There are a million variations you can do to account for variations in the HTML, but this is one that maybe can get you started and you could tweak.

changing www*.com to a clickable URL with REGEX

I'm working on a web page and regex keeps coming up as the best way to handle string manipulation for an issue I'm trying to resolve. Unfortunately, regex is not exactly trivial and I've been having trouble. Any help is appreciated;
I would like to make strings entered from a php form into clickable links. I've received help with my first challenge; how to make strings starting with http, https or ftp into clickable links;
function make_links_clickable($message){
return preg_replace('!(((f|ht)tp(s)?://)[-a-zA-Zа-яА-Я()0-9#:%_+.~#?&;//=]+)!i', '$1', $message);
}
$message = make_links_clickable($message);
And this works well. When I look at it (and do some research), the best that I can glean from the syntax is that the first piece is matching ftp, http, and https, :, and // along with a wide range of combined patterns. I would like to know how I can;
1) Make links starting with www, or ending with .com/.net/.org/etc clickable (like google.com, or www.google.com - leaving out the http://)
2) Change youtube links like
"https://www.youtube.com/watch?v=examplevideo"
into
"<iframe width="560" height="315" src="//www.youtube.com/embed/examplevideo" frameborder="0" allowfullscreen></iframe>"
I think these two cases are basically doing the same kind of thing, but figuring out is not intuitive. Any help would be deeply appreciated.
The first regular expression there is made to match almost everything that follows ftp://, http://, https:// that occurs, so it might be best to implement the others as separate expressions since they'll only be matching hostnames.
For number 1, you'll need to decide how strictly you wish to match different TLDs (.com/.net/etc). For example, you can explicitly match them like this:
(www\.)?[a-z0-9\-]+\.(com|net|org)
However, that will only match URLs that end in .com, .net, or .org. If you want all top-level domains and only the valid ones, you'll need to manually write them all in to the end of that. Alternatively, you can do something like this,
(www\.)?[a-z0-9\-]+\.[a-z]{2,6}
which will accept anything that looks like a url and ends with "dot", and any combination of 2 to 6 letters (.museum and .travel). However, this will match strings like "fgs.fds". Depending on your application, you may need to add more characters to [a-z], to add support for extended character alphabets.
Edit (2 Aug 14): As pointed out in the comments below, this won't match TLDs like .co.uk. Here's one that will:
(www\.)?[a-z0-9\-]+\.([a-z]{2,3}(\.?[a-z]{2,3})?)
Instead of any string between two and six characters (following a period), this will match any two to three, then another one to three (if present), with or without a dividing period.
It'd be redundant, but you could instead remove the question mark after www on the second option, then do both tests; that way, you can match any string ending in a common TLD, or a string that begins with "www." and is followed by any characters with one period separating them, "gpspps.cobg". It would still match sites that might not actually exist, but at least it looks like a url, at it would look like one.
For the YouTube one, I went a little question mark crazy.
(?i:(?:(?:http(?:s)?://)?(?:www\.)?)?youtu(?:\.be/|be\.com/watch\?(?:[a-z0-9_\-\%\&\=]){0,}?v\=))([a-zA-Z0-9_\-]{11}){0,}?v\=))(?i)([a-zA-Z0-9_\-]{11})
EDIT: I just tried to use the above regex in one of my own projects, but I encountered some errors with it. I changed it a little and I think this version may be better:
(?i:(?:(?:http(?:s)?://)?(?:www\.)?)?youtu(?:\.be/|be\.com/watch\?(?:[a-z0-9_\-\%\&\=]){0,})?)(?:v=)?([a-zA-Z0-9_\-]{11})
For those not familiar with regular expressions, parentheses , ( ...regex... ), are stored as groups, which can be selectively picked out of matched strings. Parenthesis groups that begin with ?: as in most of the ones up there, (?:www\.) are however not captured within the groups. Because the end of that regex was left as a normal—"captured"—group, ([a-zA-Z0-9_\-]{11}), you use the $matches argument of functions like preg_match, then you can use $matches[1] to get the YouTube ID of the video, 'examplevide', then work with it however you'd like. Also note, the regex is only matching 11 characters for the ID.
This regex will match pretty much any of the current youtube url formats including incorrect cases, and out of (normal) order parameters:
http://youtu.be/dQw4w9WgXcQ
https://www.youtube.com/watch?v=dQw4w9WgXcQ
http://www.youtube.com/watch?v=dQw4w9WgXcQ&feature=featured
http://www.youtube.com/watch?feature=featured&v=dQw4w9WgXcQ
http://WWW.YouTube.Com/watch?v=dQw4w9WgXcQ
http://YouTube.Com/watch?v=dQw4w9WgXcQ
www.youtube.com/watch?v=dQw4w9WgXcQ

Is it possible to write a regex which checks if a string (javascript & php code) is minified?

Is it possible to write a regular expression which checks if a string (some code) is minified?
Many PHP/JS obfuscators remove white space chars (among other things).
So, the final minified code sometimes looks like this:
PHP:
$a=array();if(is_array($a)){echo'ok';}
JS:
a=[];if(typeof(a)=='object'&&(a instanceof Array){alert('ok')}
in both cases there are no space chars before and after "{", "}", ";", etc. There also some other patterns which can help. I am not expecting a high accuracy regex, just need one which checks if at least 100 chars of string looks like minified code.
Thanks in advice.
PURPOSES: web malware scanner
I think a minifier will strip all newline characters, although there might possibly be one at the end of the file still if the minified code was pasted back in a text editor. Something like this will probably be fairly accurate:
/^[^\n\r]+(\r\n?|\n)?$/
That just tests that there are no newline characters in the whole thing except for possibly one at the end. So no guarantees, but I think it will work well on any longish block of code.
The short answer is "no", regex cannot do this.
Your best bet will probably be to do a statistical analysis of the source files, and compare against some known heuristics. For instance, by comparing the variable names against those often found in minimized code. A minimized file probably has a lot of one-character variable names, for instance... and won't have two-character variable names until all the one-character variable names are exhausted... etc.
Another option would be simply to run the source file through a minimizer, and see if the output is sufficiently different from the input. If not, it was probably already minimized.
But I have to agree with sg3s's final sentence: If you can explain why you need this, we can probably provide more useful answers to your actual needs.
No. Since the syntax/code and its intention doesn't change and some people who're very familiar with the php and/or js will write simple functions on one line without any whitespace at all (me :s).
What you could do is count all the whitespace characters in a string though this would also be unreliable since for some stuff you simply need whitespace, like x instanceof y heh. Also not all code is minified and cramped into a single row (see jQuery UI) so you can't really count on that either....
Maybe you can explain why you need to know this and we can try and find an alternative?
You can't tell if it's got minified or just written like that by hand (probably only applies for smaller scripts). But you can check if it doesn't contain unnecessary whitespace.
Take a look at open source obfuscator/minifier and see what rules they use to remove the whitespace. Validating if those rules were applied should work, if regex get to complex, a simple parser might be needed.
Just make sure that string literals like a="if ( b )" are excluded.
Run it through a parser for that particular language (even a prettifier might work fine) and modify it to count the number of unused characters. Use the percentage of unused chars vs. number of chars in documents as a test for minification. I don't think you can do this accurately with regex, although counting whitespace vs. document content might be okay.

PHP: Regex replace while ignoring content between html tags

I'm looking for a regular expressions string that can find a word or regex string NOT between html tags.
Say I want to replace (alpha|beta) in: the first two letters in the greek alphabet are alpha and <b>beta</b>
I only want it to replace alpha, because beta is between <> tags. So ignore (<(.*?)>(.*?)<\/(.*?)>)
:)
I didn't test the logic used in this page - http://www.phpro.org/examples/Get-Text-Between-Tags.html But I can confirm the logical point made at the top of the page in big bold letters that says you shouldn't do what you're trying to do with regex.
Html is not uniform and edge cases will always bite you in the rear if you use regular expressions to handle the content of those tags in any real world situation. So unless your markup is extremely simplistic, uniform, 100% accurate, only contains html (not css, javascript or garbage) then your best bet is a dom parser library.
And really many dom parser libraries have problems too but you'll be miles ahead of the regex counterparts. The best way to get the text contet of tags is to render the html in a browser and access the innerText property of the given dom node (or have a human copy and paste the contents out manually) - but that isn't always an option :D
It's maybe the 'wrong' way, but it works: when I need to do something similar, I first do a preg_replace_callback to find what I don't want to match and encode it with something like base64.
Then I can happily run an ordinary preg_replace on the result, knowing that it has no chance of matching the strings I want to ignore. Then unscramble using the same pattern in preg_replace_callback, this time sending the matches to be base64 decoded.
I often do this when automatically adding keyword or glossary links or tooltips to a text - I scramble the HTML tags themselves so that I don't try to create a link or a tooltip within the title of an anchor tag or somewhere equally ridiculous, for example.

How to get sentences from the website html

Hello I want to extract all sentences from a html document. How can i perform that? as there are many conditions like first we need to strip tags, then we need to identify sentences which may end with . or ? or ! also there might be conditions like email address and website address also may have . in them How do we make some script like this?
It's called programming ;). Start by dividing the task in simpler sub-tasks and implement those. For example, in your case, I'd design the program like this:
Download and parse the HTML document
Extract all text content (pay special attention to <script> and <style> elements)
Merge the text content to one long string
Solve the problem of finding sentences in a string (likely, just parse until you find a stop character in ".!?" and then start a new sentence)
Discard false positives (Like empty sentences, number-only sentences etc.)
First you should strip certain tags which are inline formatting elemnts like:
I <b>strongly</b> agree.
But you sbhould leave in block-level elements, like DIV and P because there are even stronger delimiters than . ? and !
Then you have to process the content in these block level elements. Typically there are navigation links with one word, you might want to filter them out later, so it is not the right choice to strip away the block structure of the document.
At this point you can safely use the regex pattern to identify blocks:
>([^<]+)<
When you have your blocks you can filter out the short ones (navigation elemnts) and strip the big ones (paragraphs of text) using your sentence delimiter.
There are interesting questions when a fullstop character signals an end of the sentenct and when is it just a decimal point, but I leave that to you. :)

Categories