Unique regex replacement with capture - php

I have a text (html code) with many images like:
<img src="X" attributes />
I need the src value to be replaced by a unique identification like CID:# where # is this unique value. I don't know if the src values will be all different, maybe some of them can be equal.
Bellow is the code with the regular expression to match the images. Now, how to make the replacement?
PS: I need to store in a array the relation between the unique code created and the string that was replaced. For instance, i need to know that the 345 id is relative to the url "img/xxx.jpg".
preg_match_all('/<img src=[",\']([^>,^\',^"]*)[",\']([^>]*)/', $html, $matches);
$url_image = array();
$attr_image = array();
$cid = array();
foreach ($matches[1] as $i => $img){
$url_image[$i] = $matches[2][$i];
$attr_image[$i] = $matches[3][$i];
//How to replace the src value with the value of $cid?
$cid[$contador] = "CID:".date('YmdHms').'.'.time().$i;
}

It's generally a very bad idea to modify HTML/XML with regular expressions. It's nearly impossible to get right and tends to have unpleasant unintended side effects later.
You'd be much better off using something like the Tidy extension and a DOMDocument to parse the result and perform the attribute replacements you need to do.

Here is the solution used:
preg_match_all('/<img src=[",\']([^>,^\',^"]*)[",\']([^>]*)/', $html, $matches);
$url_image = array();
$attr_image = array();
$cid = array();
foreach ($matches[1] as $i => $img){
$url_image[$i] = $matches[1][$i];
$attr_image[$i] = $matches[2][$i];
$cid[$i] = "CID:".date('YmdHms').'.'.time().$i;
$tag_img = str_replace("/", "\/", $img);
//Replace each specific occurrence
$html = preg_replace('/'.$tag_img.'/', $cid[$i], $html, 1);
}

Related

Regex to get all href tags from text

I have huge text which contains normal text and href tags. I want to retrieve all href tags by using regular expressions.
I tried href="([^"]*)" but it is returning only one href value.
$result[] = $util->execute(self::$queryToGetContentFromPagesEng3); //getting text from database
foreach ($result as $temp) {
if(preg_match("href=\"([^\"]*)\"",$temp)) {
$storeUrl []=$temp;
}
}
I need the result like this:
href=/public/coursecontent/2017-08-03-12-bhnhlwdjzyblelskiard.docx
href=/public/coursecontent/2016-07-07-07-rncsuatxhkkbeomysbmk.docx
My first point would be that regular expressions may well not be the path you want to take in this case.
But continuing with it, you might try preg_match_all instead of preg_match to find multiple occurrences and store them in an array, and from there in your foreach you can run a preg_match_all and store it in an array and array_merge this into your $storeUrul array.
However, I believe a simpler approach to this, that is most likely more reliable as well would be to parse the HTML and work from the DOM. Here is a brief guide, that simplifies to something like this in your case:
$dom = new DOMDocument();
$dom->loadHTML($result);
$xpath = new DOMXPath($dom);
$hrefs = $xpath->evaluate("a");
for($i = 0; $i < $hrefs->length; $i++){
$href = $hrefs->item($i);
$url = $href->getAttribute('href');
$storeUrl[] = $url;
}
Since the title is js regex...
const myString = '...'
const regex = /href=".+?"/gi;
const regex2 = /(?<=href=").+?(?=")/gi;
//regex2 is without 'href' and "
myString.match(regex);

PHP regex for image name with numbers

I have images with names such as:
img-300x300.jpg
img1-250x270.jpg
These names will be stored in a string variable. My image is in Wordpress so it will be located at e.g.
mywebsite.com/wp-content/uploads/2012/11/img-300x300.jpg
and I need the string to be changed to
mywebsite.com/wp-content/uploads/2012/11/img.jpg
I need a PHP regular expression which would return img.jpg and img1.jpg as the names.
How do I do this?
Thanks
Addition
Sorry guys, I had tried this but it didn't work
$string = 'img-300x300.jpg'
$pattern = '[^0-9\.]-[^0-9\.]';
$replacement = '';
echo preg_replace($pattern, $replacement, $string);
You can do this using PHP native functions itself.
<?php
function genLink($imagelink)
{
$img1 = basename($imagelink);
$img = substr($img1,0,strrpos($img1,'-')).substr($img1,strpos($img1,'.'));
$modifiedlink = substr($imagelink,0,strrpos($imagelink,'/'))."/".$img;
return $modifiedlink;
}
echo genLink('mywebsite.com/wp-content/uploads/2012/11/flower-img-color-300x300.jpg');
OUTPUT :
mywebsite.com/wp-content/uploads/2012/11/flower-img-color.jpg
You can do that as:
(img\d*)-([^.]*)(\..*)
and \1\3 will contain what you want:
Demo: http://regex101.com/r/vU2mD4
Or, replace (img\d*)-([^.]*)(\..*) with \1\3
May be this?
(\w+)-[^.]+?(\.\w+)
The $1$2 will give you what you want.
search : \-[^.]+
replace with : ''
(.[^\-]*)(?:.[^\.]*)\.(.*)
group 1 - name before "-"
group 2 - extension. (everything after ".")
As long as there is only one - and one . then explode() should work great for this:
<?php
// array of image names
$images = array();
$images[] = 'img-300x300.jpg';
$images[] = 'img1-250x270.jpg';
// array to store new image names
$new_names = array();
// loop through images
foreach($images as $v)
{
// explode on dashes
// so we would have something like:
// $explode1[0] = 'img';
// $explode1[1] = '300x300.jpg';
$explode1 = explode('-',$v);
// explode the second piece on the period
// so we have:
// $explode2[0] = '300x300';
// $explode2[1] = 'jpg';
$explode2 = explode('.',$explode1[1]);
// now bring it all together
// this translates to
// img.jpg and img1.jpg
$new_names[] = $explode1[0].'.'.$explode2[1];
}
echo '<pre>'.print_r($new_names, true).'</pre>';
?>
That's an interesting question, and since you are using php, it can be nicely solved with a branch reset (a feature of Perl, PCRE and a few other engines).
Search: img(?|(\d+)-\d{3}x\d{3}|-\d{3}x\d{3})\.jpg
Replace: img\1.jpg
The benefit of this solution, compared with a vague replacement, is that we are sure that we are matching a file whose name matches the format you specified.

php regex - Scraping images from javascript object

I'm trying to scrape images from the mark-up of certain webpages. These webpages all have a slideshow. Their sources are contained in javascript objects on the page. I'm thinking i need to get_file_contents("http://www.example.com/page/1"); and then have a preg_match_all() function that i can input a phrase(ie. "\"LargeUrl\":\"", or "\"Description\":\"") and get the string of characters until it hits the next quotation mark it finds.
var photos = {};
photos['photo-391094'] = {"LargeUrl": "http://www.example.org/images/1.png","Description":"blah blah balh"};
photos['photo-391095'] = {"LargeUrl": "http://www.example.org/images/2.png","Description":"blah blah balh"};
photos['photo-391096'] = {"LargeUrl": "http://www.example.org/images/3.png","Description":"blah blah balh"};
I have this function, but it returns the entire line after the input phrase. How can i modify it to look for whatever's after the input phrase up until it hits the next quotation mark it finds? Or am i doing this all wrong and there's a better way?
$page = file_get_contents("http://www.example.org/page/1");
$word = "\"LargeUrl\":\"";
if(preg_match_all("/(?<=$word)\S+/i", $page, $matches))
{
echo "<pre>";
print_r($matches);
echo "</pre>";
}
Ideally the function would return a an array like the following if i inputed "\"LargeUrl\":\""
$matches[0] = "http://www.example.org/images/1.png";
$matches[1] = "http://www.example.org/images/2.png";
$matches[2] = "http://www.example.org/images/3.png";
You can use parenthesis to capture the parts you're interested in. A simple regex to do it is
$word = '"LargeUrl":';
$pattern = "$word" . '\s+"([^"]+)"';
preg_match_all("/$pattern/", $page, $matches);
print_r($matches[1]);
There is definitely a regex that will match each image URL, but you could also, if its easier for you, match the whole object and then json_decode() the matched string
I have perfect solution for you....use the following code and you will get your needed result.
preg_match_all('/{"LargeUrl":(.*?)"(.*?)"/', $page, $result, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($result[0]); $i++) {
echo "<pre>";
echo $result[2][$i];
echo "</pre>";
}
Thanks......p2c

php preg_replace words last part

$str='<p>http://domain.com/1.html?u=1234576</p><p>http://domain.com/2.html?u=2345678</p><p>http://domain.com/3.html?u=3456789</p><p>http://domain.com/4.html?u=56789</p>';
$str = preg_replace('/.html\?(.*?)/','.html',$str);
echo $str;
I need get
<p>http://domain.com/1.html</p>
<p>http://domain.com/2.html</p>
<p>http://domain.com/3.html</p>
<p>http://domain.com/4.html</p>
remove ?u=*number* from every words last part. thanks.
Change this line:
$str = preg_replace('/.html\?(.*?)/','.html',$str);
into this:
$str = preg_replace('/.html\?(.*?)</','.html<',$str);
An alternative to the other answers:
preg_replace("/<p>([^?]*)\?[^<]*<\/p>/", "<p>$1</p>", $input);
This will match all types of urls with url variables, not only the ones with html-files in them.
For example, you can also extract these types of values:
<p>http://domain.com/1.php?u=1234576</p>
<p>http://domain.com?u=1234576</p>
<p>http://domain.com</p>
<p>http://domain.com/pages/users?uid=123</p>
With an output of:
<p>http://domain.com/1.php</p>
<p>http://domain.com</p>
<p>http://domain.com</p>
<p>http://domain.com/pages/users</p>
This code will load the url's into an array so they can be handled on the fly:
$str = '<p>http://domain.com/1.html?u=1234576</p><p>http://domain.com/2.html?u=2345678</p><p>http://domain.com/3.html?u=3456789</p><p>http://domain.com/4.html?u=56789</p>';
$str = str_replace("<p>","",$str);
$links = preg_split('`\?.*?</p>`', $str,-1,PREG_SPLIT_NO_EMPTY);
foreach($links as $v) {
echo "<p>".$v."</p>";
}

Screen Scraping

Hi I'm trying to implement a screen scraping scenario on my website and have the following set so far. What I'm ultimately trying to do is replace all links in the $results variable that have "ResultsDetails.aspx?" to "results-scrape-details/" then output again. Can anyone point me in the right direction?
<?php
$url = "http://mysite:90/Testing/label/stuff/ResultsIndex.aspx";
$raw = file_get_contents($url);
$newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");
$content = str_replace($newlines, "", html_entity_decode($raw));
$start = strpos($content,"<div id='pageBack'");
$end = strpos($content,'</body>',$start) + 6;
$results = substr($content,$start,$end-$start);
$pattern = 'ResultsDetails.aspx?';
$replacement = 'results-scrape-details/';
preg_replace($pattern, $replacement, $results);
echo $results;
Use a DOM tool like PHP Simple HTML DOM. With it you can find all the links you're looking for with a Jqueryish syntax.
// Create DOM object from HTML source
$dom = file_get_html('http://www.domain.com/path/to/page');
// Iterate all matching links
foreach ($dom->find('a[href^=ResultsDetails.aspx') as $node) {
// Replace href attribute value
$node->href = 'results-scrape-detail/';
}
// Output modified DOM
echo $dom->outertext;
The ? char has special meaning in regexes - either escape it and use the same code or replace the preg_replace with str_ireplace() (I'd recommend the latter approach as it is also more efficient).
(and should the html_entity_decode call really be there?)
C.

Categories