Matching a pattern using web scraping

Matching a pattern using web scraping - php

I want to get all the problems solved by a user on a website by using regular expression to get only the problem code through it. For example if PROBLEM is the HTML code I want to get only the PROBLEM from it. This is the php function which I wrote to accomplish this.
public function filter($s, $u) {
//$u -> username
//$s -> string containing html code
$reg= "/[^<a href=\"\/status\/(?:[A-Z]|[0-9]|\_|[a-z]|\.)*\"$u]/";
$solved = preg_split($reg, $s, -1, PREG_SPLIT_NO_EMPTY)
return $solved[0];
}
The regular expression doesn't seem to be correct and I am only getting /[^ when I print $reg. Also, I am not sure if preg_split() is the right function to return to do this. Please help.

The following works for me. Note that instead of searching for the PROBLEM text in the link as you were attempting, I search for the PROBLEM text that is being highlighted. The $u parameter no longer seems necessary.
public function filter($s) {
// $s set to 'PROBLEM';
$reg = '#(?P<problem>.+)#i';
preg_match($reg, $s, $matches);
return $matches['problem']
}

The outer brackets in your regular expression are out of place.

Related

Add text in front of found number, string manipulation

I am parsing big file with a lot of data with php. I got stuck on one case:
If in my string there is a number in curly brackets, I need to add some text first.
Example:
{4316} Test
Should be:
{=ATTRVAL("4316")} Test
Or in the middle of the string:
Some random text {2323} and {3232} I got here.
Should be:
Some random text {=ATTRVAL("2323")} and {=ATTRVAL("3232")} I got here.
I tried so far with a lot of string functions, but no luck at this time.
public static function parseStringWithAttributeValue($attributeValue)
{
preg_match_all('!\d+!', $attributeValue, $matches);
$string = '';
foreach ($matches as $match)
{
// string .= $match
}
return $string;
}
I tried first to extract only numbers, and then create the new text, but that is wrong logic. Ideal would be something with preg_replace if it is possible, but I had no luck so far.
I also tried str_replace, but I guess my knowledge only goes this far.
If anyone has idea what approach to take, I would be happy to get any suggestion.

Try this:
public static function parseStringWithAttributeValue($attributeValue){
return preg_replace('/{(\d+)}/i', '{=ATTRVAL("$1")}', $attributeValue);
}

Search and Replace URL - Regex?

I'm trying to create a WordPress shortcode (the WordPress part of it isn't that relevant) that will search within some specified text for a link and replace it with one that I specify. For example:
[scode]Click on this link[scode]
[scode]Click on this link[scode]
...will be changed to:
[scode]Click on this link[scode]
I'm trying to put together a function that will search for links and replace them with the one that I specify. Here's what I have right now:
// Adds [hide] shortcode for hiding content from non-registered users.
function hide_text( $atts,$content) {
if ( is_user_logged_in () ) {
return $content;
}
else {
$pattern = '(?<=href=("|\'))[^"\']+(?=("|\'))';
$newurl = "http://replacementurl.com";
$content = preg_replace($pattern,$newurl,$content);
echo $content;
}
}
add_shortcode( 'hide', 'hide_text' );
This just crashes the site, though. I'm not a PHP expert (much less an expert on regex), but are there at least any glaring irregularities in my code?
UPDATE:
I ran debug on the site and found out from the log that there was an extra } in there. Now the site isn't crashing, but the content being echoed is blank... Code updated above

There is syntax error in your pattern, change it to:
$pattern = "(?<=href=(\"|'))[^\"']+(?=(\"|'))";
Errors:
$pattern = "(?<=href=("|'))[^"']+(?=("|'))";
^-- ^--not escaped

http://replcaement url.com Pretty sure this is spelled incorrectly.
and there isn't an ; at the end of the line.
Looks like you've done the regex correctly for the most part you also need to escape some reserved characters look at #Akam's answer.
I suggest using preg quotes.
(?<=href=("|'))[^"']+(?=("|'))
Edit live on Debuggex

Preg_match somehow not finding part of string

I am having a problem with preg_match in that it is not returning anything. While according to: http://gskinner.com/RegExr/?=35ls9
It should be functioning properly.
This is my current code:
$string == <a class="twitter-timeline" href="https://twitter.com/...." data-widget-id="352777062139922223">....</a>...
its simply the embed code twitter throws out when creating a widget. Was also included in the example.
$string = get_field('twitter_feed'); //contains the string.
preg_match('/data-widget-id="([0-9]*)"/', $string, $match);
var_dump($match);
Its probably something really simple that i am missing. Hopefully somebody is able to help me with this problem.
edit: added the sample string.

Test it with the following string. I did, and it works fine:
$string = 'data-widget-id="352777062139922223"';
Make sure that get_field is returning a string in that form.

preg_match to array. PHP

Calling all the PHP helpers out there.
So basically I would like to give the function preg_match a variable that can contain a couple thousand lines of code) and have it search using a wildcard + strings either side of the widlcard.
For example I would like to search for strings that look like this <a href="*.pdf">
I would then like the function to return every match (along with the html shiz around the wildcard, this is to catch any directory structures too) in an array that I can loop through using a foreach(){} loop.
I'm guessing this is possible, would anyone have the time to help me with this?
I've check through all the preg_match lit' and through the answers on here, but I can't seem to get the patterns correct. Thanks in advance.
Peace out.

unset($matches);
preg_match_all('/<a href="[^"]+\.pdf">/',$text,$matches);
foreach ($matches as $match)
{
$shiz = $match[0];
// Your code here ...
}

Extract URLs from text in PHP

I have this text:
$string = "this is my friend's website http://example.com I think it is coll";
How can I extract the link into another variable?
I know it should be by using regular expression especially preg_match() but I don't know how?

Probably the safest way is using code snippets from WordPress. Download the latest one (currently 3.1.1) and see wp-includes/formatting.php. There's a function named make_clickable which has plain text for param and returns formatted string. You can grab codes for extracting URLs. It's pretty complex though.
This one line regex might be helpful.
preg_match_all('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $string, $match);
But this regex still can't remove some malformed URLs (ex. http://google:ha.ckers.org ).
See also:
How to mimic StackOverflow Auto-Link Behavior

I tried to do as Nobu said, using Wordpress, but to much dependencies to other WordPress functions I instead opted to use Nobu's regular expression for preg_match_all() and turned it into a function, using preg_replace_callback(); a function which now replaces all links in a text with clickable links. It uses anonymous functions so you'll need PHP 5.3 or you may rewrite the code to use an ordinary function instead.
<?php
/**
* Make clickable links from URLs in text.
*/
function make_clickable($text) {
$regex = '#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#';
return preg_replace_callback($regex, function ($matches) {
return "<a href=\'{$matches[0]}\'>{$matches[0]}</a>";
}, $text);
}

URLs have a quite complex definition — you must decide what you want to capture first. A simple example capturing anything starting with http:// and https:// could be:
preg_match_all('!https?://\S+!', $string, $matches);
$all_urls = $matches[0];
Note that this is very basic and could capture invalid URLs. I would recommend catching up on POSIX and PHP regular expressions for more complex things.

The code that worked for me (especially if you have several links in your $string):
$string = "this is my friend's website https://www.example.com I think it is cool, but this one is cooler https://www.stackoverflow.com :)";
$regex = '/\b(https?|ftp|file):\/\/[-A-Z0-9+&##\/%?=~_|$!:,.;]*[A-Z0-9+&##\/%=~_|$]/i';
preg_match_all($regex, $string, $matches);
$urls = $matches[0];
// go over all links
foreach($urls as $url)
{
echo $url.'<br />';
}
Hope that helps others as well.

If the text you extract the URLs from is user-submitted and you're going to display the result as links anywhere, you have to be very, VERY careful to avoid XSS vulnerabilities, most prominently "javascript:" protocol URLs, but also malformed URLs that might trick your regexp and/or the displaying browser into executing them as Javascript URLs. At the very least, you should accept only URLs that start with "http", "https" or "ftp".
There's also a blog entry by Jeff where he describes some other problems with extracting URLs.

preg_match_all('/[a-z]+:\/\/\S+/', $string, $matches);
This is an easy way that'd work for a lot of cases, not all. All the matches are put in $matches. Note that this do not cover links in anchor elements (<a href=""...), but that wasn't in your example either.

You could do like this..
<?php
$string = "this is my friend's website http://example.com I think it is coll";
echo explode(' ',strstr($string,'http://'))[0]; //"prints" http://example.com

preg_match_all ("/a[\s]+[^>]*?href[\s]?=[\s\"\']+".
"(.*?)[\"\']+.*?>"."([^<]+|.*?)?<\/a>/",
$var, &$matches);
$matches = $matches[1];
$list = array();
foreach($matches as $var)
{
print($var."<br>");
}

You could try this to find the link and revise the link (add the href link).
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// The Text you want to filter for urls
$text = "The text you want to filter goes here. http://example.com";
if(preg_match($reg_exUrl, $text, $url)) {
echo preg_replace($reg_exUrl, "{$url[0]} ", $text);
} else {
echo "No url in the text";
}
refer here: http://php.net/manual/en/function.preg-match.php

There are a lot of edge cases with urls. Like url could contain brackets or not contain protocol etc. Thats why regex is not enough.
I created a PHP library that could deal with lots of edge cases: Url highlight.
Example:
<?php
use VStelmakh\UrlHighlight\UrlHighlight;
$urlHighlight = new UrlHighlight();
$urlHighlight->getUrls("this is my friend's website http://example.com I think it is coll");
// return: ['http://example.com']
For more details see readme. For covered url cases see test.

Here is a function I use, can't remember where it came from but seems to do a pretty good job of finding links in the text. and making them links.
You can change the function to suit your needs. I just wanted to share this as I was looking around and remembered I had this in one of my helper libraries.
function make_links($str){
$pattern = '(?xi)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
$input = $matches[0];
$url = preg_match('!^https?://!i', $input) ? $input : "http://$input";
return '' . "$input";
}, $str);
}
Use:
$subject = 'this is a link http://google:ha.ckers.org maybe don't want to visit it?';
echo make_links($subject);
Output
this is a link http://google:ha.ckers.org maybe don't want to visit it?

<?php
preg_match_all('/(href|src)[\s]?=[\s\"\']?+(.*?)[\s\"\']+.*?/', $webpage_content, $link_extracted);
preview

This Regex works great for me and i have checked with all types of URL,
<?php
$string = "Thisregexfindurlhttp://www.rubular.com/r/bFHobduQ3n mixedwithstring";
preg_match_all('/(https?|ssh|ftp):\/\/[^\s"]+/', $string, $url);
$all_url = $url[0]; // Returns Array Of all Found URL's
$one_url = $url[0][0]; // Gives the First URL in Array of URL's
?>
Checked with lots of URL's can find here http://www.rubular.com/r/bFHobduQ3n

public function find_links($post_content){
$reg_exUrl = "/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/";
// Check if there is a url in the text
if(preg_match_all($reg_exUrl, $post_content, $urls)) {
// make the urls hyper links,
foreach($urls[0] as $url){
$post_content = str_replace($url, ' LINK ', $post_content);
}
//var_dump($post_content);die(); //uncomment to see result
//return text with hyper links
return $post_content;
} else {
// if no urls in the text just return the text
return $post_content;
}
}

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Matching a pattern using web scraping - php

The outer brackets in your regular expression are out of place.

Related

Add text in front of found number, string manipulation

Search and Replace URL - Regex?

Preg_match somehow not finding part of string

preg_match to array. PHP

Extract URLs from text in PHP

Categories

Resources