Find all URLS in a string and encode query string? - php

I would like to find all URLs in a string (curl results) and then encode any query strings in those results, example
urls found:
http://www.example.com/index.php?favoritecolor=blue&favoritefood=sharwarma
to replace all those URLS found with encoded string (i can only do one of them)
http%3A%2F%2Fwww.example.com%2Findex.php%3Ffavoritecolor%3Dblue%26favoritefood%3Dsharwarma
but do this in a html curl response, find all URLS from html page.
Thank you in advanced, i have searched for hours.

This will do what you want if your CURL result is an HTML page and you only want a links (and not images or other clickable elements).
$xml = new DOMDocument();
// $html should be your CURL result
$xml->loadHTML($html);
// or you can do that directly by providing the requested page's URL to loadHTMLFile
// $xml->loadHTMLFile("http://...");
// this array will contain all links
$links = array();
// loop through all "a" elements
foreach ($xml->getElementsByTagName("a") as $link) {
// URL-encodes the link's URL and adds it to the previous array
$links[] = urlencode($link->getAttribute("href"));
}
// now do whatever you want with that array
The $links array will contain all the links found in the page in URL-encoded format.
Edit: if you instead want to replace all links in the page while keeping everything else, it's better to use DOMDocument than regular expressions (related : why you shouldn't use regex to handle HTML), here's an edited version of my code that replaces every link with its URL-encoded equivalent and then saves the page into a variable :
$xml = new DOMDocument();
// $html should be your CURL result
$xml->loadHTML($html);
// loop through all "a" elements
foreach ($xml->getElementsByTagName("a") as $link) {
// gets original (non URL-encoded link)
$original = $link->getAttribute("href");
// sets new link to URL-encoded format
$link->setAttribute("href", urlencode($original));
}
// save modified page to a variable
$page = $xml->saveHTML();
// now do whatever you want with that modified page, for example you can "echo" it
echo $page;
Code based on this.

Do not use php Dom directly, it will slow down your execution time, use simplehtmldom, its easy
function decodes($data){
foreach($data->find('a') as $hres){
$bbs=$hres->href;
$hres->__set("href", urlencode($bbs));
}
return $data;
}

Related

Replace a string before the user sees the page

I am trying to create a custom CMS, every page has a unique ID and on every page is a string (<--UNIQUEID-->) at the place where the CMS text has to come.
I am trying to replace that string with the text that is saved in a database for that page, but I can't get that to work. I am trying this with DOM documents.
I have this at the moment:
This is before the <html>tag:
ob_start()
And after the </html>> tag:
if ((($html = ob_get_clean()) !== false) && (ob_start() === true))
{
$dom = new DOMDocument();
$dom->loadHTML($html); // load the output HTML
/* your specific search and replace logic goes here */
$StringToReplace = '<--754764-->';
$ReplacementString = 'test';
str_replace($StringToReplace, $ReplacementString, $html);
echo $dom->saveHTML(); // output the replaced HTML
}
It is showing the page, but it's not showing the replacement string text.
You're trying to do two things and getting confused in the process.
When you load your HTML buffered output into a DOMDocument object (via DOMDocument::loadHTML), the state of that object is now the parsed HTML. You then replace your string into $html itself, and then output the HTML from the DOMDocument.
Due to the fact that by the time you get to your str_replace call, the inner state of the DOMDocument is independent from $html, that replace call effectively does nothing to it.
If you're certain that the comment will be of exactly that form, you can just echo $html; after the call to str_replace. This also saves you from having to worry about your output being compliant and parsing properly (DOMDocument is stricter than most browsers when it comes to that).
The code you posted doesn't use the DOMDocument object to do any transformation of the document. It just parses the HTML then generate another one that is functionally identical to the original.
You just don't need the DOMDocument object.
The str_replace() does the expected transformation but the value it returns is completely ignored. You have to echo it in order to get the desired result.
The following code is enough:
if (($html = ob_get_clean()) !== false) {
/* your specific search and replace logic goes here */
$StringToReplace = '<--754764-->';
$ReplacementString = 'test';
echo str_replace($StringToReplace, $ReplacementString, $html);
}

String of file_get_html can't be edited?

Consider this simple piece of code, working normally using the PHP Simple HTML DOM Parser, it outputs current community.
<?php
//PHP Simple HTML DOM Parser from simplehtmldom.sourceforge.net
include_once('simple_html_dom.php');
//Target URL
$url = 'http://stackoverflow.com/questions/ask';
//Getting content of $url
$doo = file_get_html($url);
//Passing the variable $doo to $abd
$abd = $doo ;
//Trying to find the word "current community"
echo $abd->find('a', 0)->innertext; //Output: current community.
?>
Consider this other piece of code, same as above but I add an empty space to the parsed html content (in the future, I need to edit this string, so I just added a space here to simplify things).
<?php
//PHP Simple HTML DOM Parser from simplehtmldom.sourceforge.net
include_once('simple_html_dom.php');
//Target URL
$url = 'http://stackoverflow.com/questions/ask';
//Getting content of $url
$doo = file_get_html($url);
//Passing the variable $url to $doo - and adding an empty space.
$abd = $doo . " ";
//Trying to find the word "current community"
echo $abd->find('a', 0)->innertext; //Outputs: nothing.
?>
The second code gives this error:
PHP Fatal error: Call to undefined function file_get_html() in /home/name/public_html/code.php on line 5
Why can't I edit the string gotten from file_get_html? I need to edit it for many important reasons (like removing some scripts before processing the html content of the page). I also do not understand why is it giving the error that file_get_html() could not be found (It's clear we're importing the correct parser from the first code).
Additional note:
I have tried all those variations:
include_once('simple_html_dom.php');
require_once('simple_html_dom.php');
include('simple_html_dom.php');
require('simple_html_dom.php');
file_get_html() returns an object, not a string. Attempting to concatenate a string to an object will call the object's _toString() method if it exists, and the operation returns a string. Strings do not have a find() method.
If you want to do as you have described read the file contents and concatenate the extra string first:
$content = file_get_contents('someFile.html');
$content .= "someString";
$domObject = str_get_html($content);
Alternatively, read the file with file_get_html() and manipulate it with the DOM API.
$doo is not a string! It's an object, an instance of Simple HTML DOM. You can't call -> methods on strings, only on objects. You cannot treat this object like a string. Trying to concatenate something to it makes no sense. $abd in your code is the result of an object concatenated with a string; this either results in a string or an error, depending in the details of the object. What it certainly does not do is result in a usable object, so you certainly can't do $abd->find().
If you want to modify the content of the page, do it using the DOM API which the object gives you.

Search thorugh links and identify RSS source with Regex, PHP or Javascript

I'm building a news / blog aggregator that's focusing on the Syrian conflict, and I would like to be able to identify the source. It's a simple site, and the aggregator is an external javascript that pulls RSS from my Yahoo Pipes. My problem is that I cannot find a way to identify the source (i.e. CNN, BBC, etc)
So I figured if I scan the document and identify the href source, I would be able to do something.
Let's say that we have <a href="http://foxnews.com/blahblahblah.php">, I would like to do a IF href == http://foxnews.com { logo(fox); } -- or something like this.
I'm not sure if I'm even "thinking right", but I'd really like to get my way around this problem. Any suggestions? Or are there Author info that I'm missing out on in my RSS pipe?
http://pipes.yahoo.com/pipes/pipe.run?_id=e9fdf79f13be013e7c3a2e4a7d0f2900&_render=rss
RSS feeds are just XML, so the first thing you would do is find an XML parser for the language that you are wanting to use.
PHP has SimpleXML built in and it's fast and easy to use.
You'd use that to pull out all the links like this.
foreach ($xml->channel->item as $key => $item) {
$link = $item->link
}
That's simple to understand, our root XML element is <channel> then inside that we have all of the news <item> tags. So we loop through those and pull out each child <link> element.
Then once I'd got that far, I realised it wouldn't take me much more to do the whole thing for you. I strip the links down to just the domain by replacing http:// with an empty string. And then exploding the string using / as the delimiter. Doing this splits the string into chunks that are pulled from between the slashes. Therefore, the first chunk is our domain.
<?php
$url = 'http://pipes.yahoo.com/pipes/pipe.run?_id=e9fdf79f13be013e7c3a2e4a7d0f2900&_render=rss';
$xml = simplexml_load_file($url);
foreach ($xml->channel->item as $key => $item) {
$link = $item->link;
$link = str_replace("http://", "", $link);
$parts = explode('/', $link);
$domain = $parts[0];
print($domain . "<br/>");
}
?>
This code gives me an output of:
www.ft.com
www.dailystar.com.lb
www.ft.com
www.ft.com
www.ft.com
www.ft.com
www.dailystar.com.lb
www.bbc.co.uk
....
Then it's a case of PHP switch statements to get the desired outcome for each link. Like so:
switch($domain) {
case "www.bbc.co.uk":
// Do BBC stuff
break;
case "www.dailystar.com.lb":
// Do daily star stuff
break;
default:
// Do something for domains that aren't covered above
break;
}
Good luck!

Why isn't PHP continuing to run through these urls?

I have the following PHP. Basically, I'm getting similar data from multiple pages of a website (the current number of homeruns from a website that has a bunch of baseball player profiles). The JSON that I'm bringing in has all of the URLs to all of the different profiles that I'm looking to grab from, and so I need PHP to run through the URLs and grab the data. However, the following PHP only gets the info from the very first URL. I'm probably making a stupid mistake. Can anyone see why it's not going through all the URLs?
include('simple_html_dom.php');
$json = file_get_contents("http://example.com/homeruns.json");
$elements = json_decode($json);
foreach ($elements as $element){
$html = new simple_html_dom();
$html->load_file($element->profileurl);
$currenthomeruns = $html->find('.homeruns .current',0);
echo $element->name, " currently has the following number of homeruns: ", strip_tags($currenthomeruns);
return $html;
}
Wait... You are using return $html. Why? Return is going to break out of your function, thus stopping your foreach.
If you are indeed trying to get the $html out of your function for ALL of the elements, you should push each $html into an array and then return that array after the loop.
Because you return. return leaves the current method, function, or script, which includes every loop. With PHP5.5 you can use yield to let the function behaves like an generator, but this is definitely out of scope for now.
Unless your braces are off, you return at the very end of the loop so the loop will never iterate.

Blog display code, keeping other content in post

Alright, I have some code that will find a <code></code> tag set and clean up any code inside of it so it displays instead of functioning like regular code. Everything works, but my problem is how can I find the tag set/multiple tag sets inside, say, $content. Clean the code, and still have ALL of the other content in it? Here is my code, the problem is it checks for matches, and when it finds one it cleans it. But after it cleans it it has no way to put it back into it's original position $content. ($content is being grabbed from a form)
<?php
preg_match_all("'<code>(.*?)</code>'si", $html, $match);
if ($match) {
foreach ($match[1] as $snippet) {
$fixedCode = htmlspecialchars($snippet, ENT_QUOTES);
}
}
?>
What do I do with $fixedCode, now that it is clean?
Using regex for parsing HTML is bad. I'd suggest getting familiar with a DOM parser, such as PHP's DOM module.
The DOM extension allows you to operate on XML documents through the DOM API with PHP 5.
Using the DOM module, in order to get the HTML/data from <code> tags in the document, you'd want to do something like this:
<?php
//So many variables!
$html = "<div> Testing some <code>code</code></div><div>Nother div, nother <code>Code</code> tag</div>";
$dom_doc = new DOMDocument;
$dom_doc->loadHTML($html);
$code = $dom_doc->getElementsByTagName('code');
foreach ($code as $scrap) {
echo htmlspecialchars($scrap->nodeValue, ENT_QUOTES), "<br />";
}
?>

Categories