Essentially, I have pulled the text from a URL and need to find a way to pull specific characters from the text.
The line I need pulled from is:
<p align="center">http://sitexplosion.com/?rid=1256</p>
The text I need pulled is essentially the number 1256, basically everything after ?rid= and before " target="_blank">
That number will change and will be anywhere from 1 to 6 characters in length.
If something like this has been posted already, I apologize. I have been scouring the net for the last 3 hours trying to find an answer of some sort.
If you can show me how to pull those characters from that line, I have got the rest already going.
Thanks in advance!
How about this one:-
$strout="<p align='center'><a href='http://sitexplosion.com/?rid=1256' target='_blank'>http://sitexplosion.com/?rid=1256</a></p>";
$startsAt = strpos($strout, "?rid") + strlen("?rid=");
$endsAt = strpos($strout, "{\'target}", $startsAt);
$result = substr($strout, $startsAt, ($endsAt-3) - $startsAt);
echo $result;
Output:-
Here, why not use a HTML parser or domdocument to extract the links, then get the links query params with parse_url()
$html = '
<p align="center">http://sitexplosion.com/?rid=1256</p>
<p align="center">http://sitexplosion.com/?rid=123456</p>
<p align="center">http://sitexplosion.com/</p>
';
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$link_ids = array();
foreach ($dom->getElementsByTagName('a') as $link)
{
if($query = parse_url($link->getAttribute('href'), PHP_URL_QUERY))
{
$link_ids[] = str_replace('rid=','',$query);
}
}
print_r($link_ids);
/*
Array
(
[0] => 1256
[1] => 123456
)
*/
hope it helps
Well this is much shorter
$string = strip_tags('<p align="center">http://sitexplosion.com/?rid=1256</p>');
echo str_replace('http://sitexplosion.com/?rid=','',$string);
Related
the source of this problem is because I'm running ads on my website, my content is mainly HTML stored in a database, so I decided to place "In-Text Ads", ads that are not in a fixed zone.
My solution was to explode the content by paragraphs and place the text ad in the middle of the p tags, which worked pretty cool since I use CKEditor to generate the content, I thought images, blockquotes, and other tags would be nested inside p tags (fool me) I realize now that images and blockquotes disappeared from my posts, what did I do next? I changed my code to explode using * instead of exploding by p tag, I sang victory too soon, because now I get a lot of duplicate content, for example, if I have one image now I get the same image 4 times as well as all other tags, I´m not sure about the source of this duplicates but I think It has something to do with nested HTML, I looked for a solution for hours and now I'm here asking to see whether somebody can help me solve this headache
Here is my code:
//In a helper file
function splitByHTMLTagName(string $string, string $tagName = 'p')
{
$text = <<<TEXT
$string
TEXT;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$nodes = [];
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $text);
foreach ($dom->getElementsByTagName($tagName) as $node) {
array_push($nodes, $dom->saveHTML($node));
}
libxml_clear_errors();
return $nodes;
}
//In my view
$text = nl2br($database['content']);
$nodes = splitByHTMLTagName($text, '*');
//Using var_dump($nodes); here shows the duplicates are here already.
$nodes_count = count($nodes);
$show_ad_at = -1;
$was_added = false;
if($nodes_count % 2 == 0 ){
$show_ad_at = $nodes_count /2;
}else if ($nodes_count == 1 || $nodes_count < 3){
$show_ad_at = -1; //add later
}else if ($nodes_count > 3 && $nodes_count % 2 != 0){
$show_ad_at = ceil($nodes_count/2);
}
for($i = 0; $i<count($nodes); $i++){
if(!$was_added && $i == $show_ad_at){
$was_added = true;
?>
<div>
<script></script><!--This script is provided to me, it adds the ad where it is placed, I don't show the full script, It has nothing to do with the duplicates problem-->
</div>
<?php
}
echo $nodes[$i]; //print the node that comes from $nodes array where the duplicates already exist
}
if(!$was_added){
$was_added = true;
?>
<div>
<script></script><!--This script is provided to me, it adds the ad where it is placed, I don't show the full script, It has nothing to do with the duplicates problem-->
</div>
<?php
}
What can I do?
Thanks in advance.
Postdata #1: I use codeigniter as PHP Framework
Postdata #2: My ads provider does not implement "In-Text ads" as a feature like google does.
It seems you are printing the "ads block" inside if statement.
If I don't misunderstood your code is like
foreach ... {
if (strpos($html_line, "In-Text Ads") !== FALSE) {
print($ads_html);
}
I think, you should use str_replace() instead of print() like functions, if you are using something like print() when you outputting the value...
I'm trying to scrap data from one websites. I stuck on ratings.
They have something like this:
<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>
<div class="rating-static rating-13 margin-top-none margin-bottom-sm"></div>
<div class="rating-static rating-46 margin-top-none margin-bottom-sm"></div>
Where rating-10 is actually one star, rating-13 two stars in my case, rating-46 will be five stars in my script.
Rating range can be from 0-50.
My plan is to create switch and if I get class range from 1-10 I will know how that is one star, from 11-20 two stars and so on.
Any idea, any help will be appreciated.
Try this
<?php
$data = '<div class="rating-static rating-10 margin-top-none margin-bottom-sm"></div>';
$dom = new DOMDocument;
$dom->loadHTML($data);
$xpath = new DomXpath($dom);
$div = $dom->getElementsByTagName('div')[0];
$div_style = $div->getAttribute('class');
$final_data = explode(" ",$div_style);
echo $final_data[1];
?>
this will give you expected output.
I had an similiar project, this should be the way to do it if you want to parse the whole HTML site
$dom = new DOMDocument();
$dom->loadHTML($html); // The HTML Source of the website
foreach ($dom->getElementsByTagName('div') as $node){
if($node->getAttribute("class") == "rating-static"){
$array = explode(" ", $node->getAttribute("class"));
$ratingArray = explode("-", $array[1]); // $array[1] is rating-10
//$ratingArray[1] would be 10
// do whatever you like with the information
}
}
It could be that you must change the if part to an strpos check, I haven't tested this script, but I think that getAttribute("class") returns all classes. This would be the if statement then
if(strpos($node->getAttribute("class"), "rating-static") !== false)
FYI try using Querypath for future parsing needs. Its just a wrapper around PHP DOM parser and works really really well.
I'm writing a simple web crawler to grab some links from a site.
I need to check the returned links to make sure I selectively collect what I want.
For example, here's a few links returned from http://www.polygon.com/
[0] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide#comments
[1] http://www.polygon.com/videos
[2] http://www.polygon.com/2015/5/15/8613113/destiny-queens-wrath-bounties-ether-key-guide
[3] http://www.polygon.com/features
so link 0 and 2 are links I want to grab, 1 and 3 we don't want. there's an obvious visual distinction between the links so how would I compare them?
How would I check to make sure I don't return 1 and 3? ideally i'd like to be able to input something so it could adapt to any site.
I was thinking I need to check the link to make sure its past /2015/ etc but I'm pretty lost.
here's the PHP code i'm using to grab links:
<?php
$source_url = 'http://www.polygon.com/';
$html = file_get_contents($source_url);
$dom = new DOMDocument;
#$dom->loadHTML($html);
$links = $dom->getElementsByTagName('a');
foreach ($links as $link) {
$input_url = $link->getAttribute('href');
echo $input_url . "<br>";
}
?>
It looks like regular expressions would be helpful here.
You could say, for instance:
/* if $input_url contains a 4 digit year, slash, number(s), slash, number(s) */
if (preg_match("/\/20\d\d\/\d+\/\d+\/",$input_url)) {
echo $input_url . "<br>";
}
In the testing environment $html is 20 to 30 lines or more of HTML is created by a CURL (scrape) query to another page/site, but for simplicity in the question i reduced it to this simple example:
I need to echo the DIV with ID "keepthis" and all its content with HTML structure intact, but delete everything before it and after it. The DIV with ID "deletethis" will always have that ID. I have looked at multiple posts involving substr / explode / trim but i cannot find or get to work a method that deletes everything TO THE RIGHT in $html starting from position 0 of
that div(deletethis) is not located at a fixed # of characters into the code, I am able to get the delete all before DIV(keepthis) to work, just not the other side. Any help would be appreciated.
$html = '<h1>hello world</h1><div id="keepthis"> Sample content</div><div id="deletethis">a bunch of other dynamic html here</div>';
$x = substr($html, strpos($html, '<div id="keepthis">')); //cleans up the BEFORE code
echo $x;
So based on the link try this :
$html = '<h1>hello world</h1><div id="keepthis"> Sample content</div><div id="deletethis">a bunch of other dynamic html here</div>';
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$result = $xpath->query('//div[#id="keepthis"]');
if ($result->length > 0) {
var_dump($result->item(0)->nodeValue);
}
Warning : The node value will not output tags but you can iterate through childs of $result->item(0) to get them
string rtrim ( string $str [, string $character_mask ] )
This function returns a string with whitespace stripped from the end of str.
Without the second parameter, rtrim() will strip these characters:
I'm currently working on a script to archive an imageboard.
I'm kinda stuck on making links reference correctly, so I could use some help.
I receive this string:
>>10028949<br><br>who that guy???
In said string, I need to alter this part:
<a href="10028949#p10028949"
to become this:
<a href="#p10028949"
using PHP.
This part may appear more than once in the string, or might not appear at all.
I'd really appreciate it if you had a code snippet I could use for this purpose.
Thanks in advance!
Kenny
Disclaimer: as it'll be said in the comments, using a DOM parser is better to parse HTML.
That being said:
"/(<a[^>]*?href=")\d+(#[^"]+")/"
replaced by $1$2
So...
$myString = preg_replace("/(<a[^>]*?href=\")\d+(#[^\"]+\")/", "$1$2", $myString);
try this
>>10028949<br><br>who that guy???
Although you have the question already answered I invite you to see what would (approximately xD) be the correct approach, parsing it with DOM:
$string = '>>10028949<br><br>who that guy???';
$dom = new DOMDocument();
$dom->loadHTML($string);
$links = $dom->getElementsByTagName('a'); // This stores all the links in an array (actually a nodeList Object)
foreach($links as $link){
$href = $link->getAttribute('href'); //getting the href
$cut = strpos($href, '#');
$new_href = substr($href, $cut); //cutting the string by the #
$link->setAttribute('href', $new_href); //setting the good href
}
$body = $dom->getElementsByTagName('body')->item(0); //selecting everything
$output = $dom->saveHTML($body); //passing it into a string
echo $output;
The advantages of doing it this way is:
More organized / Cleaner
Easier to read by others
You could for example, have mixed links, and you only want to modify some of them. Using Dom you can actually select certain classes only
You can change other attributes as well, or the selected tag's siblings, parents, children, etc...
Of course you could achieve the last 2 points with regex as well but it would be a complete mess...