Retrieve a text with certain class name from PHP url - php

How can I get a text property from another page that has certain class name with PHP?
I have an array list of URLs like this
$url_array = array(
'https://www.example.com/item/32',
'https://www.example.com/item/33',
'https://www.example.com/item/34'
);
This is really difficult to explain, so I made a not-so beautiful sketch of
the process:
The first list of the bubbles are the $url_array's items, which each contains a different URL.
Now I need a method to read the URL, and get its content.
The PHP will return a div element that has an <a> -element with href url, but the url is different for each time.
Now I want to get a content from the <a> elements url. It should return a <span> or <p> tags text content, with text-class as its own class.
How could I achieve this approach into a PHP code?
I have tried this but it ain't working:
$htmlAsString = "index.php";
$doc = new DOMDocument();
$doc->loadHTML($htmlAsString);
$xpath = new DOMXPath($doc);
$nodeList = $xpath->query('//a[#class="class-name"]/#href');
for ($i = 0; $i < $nodeList->length; $i++) {
$url_price = $nodeList->item($i)->value . "<br/>\n";
$retrieve_text_begin = explode('<div class="text-property">',
$url_price);
$retrieve_text_end = explode('</div>', $retrieve_text_begin[1]);
echo $retrieve_text_end[0];
}
I know that the $htmlAsString = "index.php"; might be the problem.

Related

Get Title from specific link - php

I am trying to get the title from an anilink url. This particular code works for MyAnimeList webiste however on the AniList website this keeps returning 'AniList' which is the website, i believe the website in question is updating the meta tags after loading the webpage using jquery, however sites like facebook and discord are able to get the title of a series. However my code can't.
here is the code i am using.
For example, here is a random url from the anilist website
https://anilist.co/anime/527/Pocket-Monsters/
myfunction(https://anilist.co/anime/527/Pocket-Monsters/)
function myfunction($form_value)
{
$html = file_get_contents_curl($form_value);
//parsing begins here:
$doc = new DOMDocument();
#$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');
//get and display what you need:
$title = $nodes->item(0)->nodeValue;
$metas = $doc->getElementsByTagName('meta');
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('property') == 'og:title')
{$title = $meta->getAttribute('content');}
if($meta->getAttribute('property') == 'og:site_name')
$site_name = $meta->getAttribute('content');
}
return $title;
}
andi it returns.
AniList
where as this is the meta tag.
<meta property="og:title" content="Pokémon" data-vue-meta="true">
So i am expecting it to return
Pokémon
Should i be using another website to get the desired result?
Anilist is the title as given in the page's markup. If you see anything else in your browser, check whether the application overrides the title using Javascript. If this is the case, a pure PHP approach won't help to read the page's final title. You either need to run the whole page in a browser and read the output from there, or use a proper API

PHP getElementsByTagName('*') avoid duplicate nodes | "In Text ads" by separating content nodes

the source of this problem is because I'm running ads on my website, my content is mainly HTML stored in a database, so I decided to place "In-Text Ads", ads that are not in a fixed zone.
My solution was to explode the content by paragraphs and place the text ad in the middle of the p tags, which worked pretty cool since I use CKEditor to generate the content, I thought images, blockquotes, and other tags would be nested inside p tags (fool me) I realize now that images and blockquotes disappeared from my posts, what did I do next? I changed my code to explode using * instead of exploding by p tag, I sang victory too soon, because now I get a lot of duplicate content, for example, if I have one image now I get the same image 4 times as well as all other tags, I´m not sure about the source of this duplicates but I think It has something to do with nested HTML, I looked for a solution for hours and now I'm here asking to see whether somebody can help me solve this headache
Here is my code:
//In a helper file
function splitByHTMLTagName(string $string, string $tagName = 'p')
{
$text = <<<TEXT
$string
TEXT;
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$nodes = [];
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $text);
foreach ($dom->getElementsByTagName($tagName) as $node) {
array_push($nodes, $dom->saveHTML($node));
}
libxml_clear_errors();
return $nodes;
}
//In my view
$text = nl2br($database['content']);
$nodes = splitByHTMLTagName($text, '*');
//Using var_dump($nodes); here shows the duplicates are here already.
$nodes_count = count($nodes);
$show_ad_at = -1;
$was_added = false;
if($nodes_count % 2 == 0 ){
$show_ad_at = $nodes_count /2;
}else if ($nodes_count == 1 || $nodes_count < 3){
$show_ad_at = -1; //add later
}else if ($nodes_count > 3 && $nodes_count % 2 != 0){
$show_ad_at = ceil($nodes_count/2);
}
for($i = 0; $i<count($nodes); $i++){
if(!$was_added && $i == $show_ad_at){
$was_added = true;
?>
<div>
<script></script><!--This script is provided to me, it adds the ad where it is placed, I don't show the full script, It has nothing to do with the duplicates problem-->
</div>
<?php
}
echo $nodes[$i]; //print the node that comes from $nodes array where the duplicates already exist
}
if(!$was_added){
$was_added = true;
?>
<div>
<script></script><!--This script is provided to me, it adds the ad where it is placed, I don't show the full script, It has nothing to do with the duplicates problem-->
</div>
<?php
}
What can I do?
Thanks in advance.
Postdata #1: I use codeigniter as PHP Framework
Postdata #2: My ads provider does not implement "In-Text ads" as a feature like google does.
It seems you are printing the "ads block" inside if statement.
If I don't misunderstood your code is like
foreach ... {
if (strpos($html_line, "In-Text Ads") !== FALSE) {
print($ads_html);
}
I think, you should use str_replace() instead of print() like functions, if you are using something like print() when you outputting the value...

DomXpath and foreach. How to get a preview of the captured elements?

I am learning to deal with DOMXpath in php. I was using regex (but I was discouraged here in the stack when for html capture). I confess that for me it is not so simple and the DOM has its limits (when there are spaces in tag names and also in error handling). If someone can help me with the command in php to get a preview of the captured elements and check if everything is right, I would appreciate it. If you have suggestions for improving the code, you're welcome to do so.The code below was based on a question in Stackoverflow itself.
<?php
$doc = new DOMDocument;
libxml_use_internal_errors(true);
// Deleting whitespace (if any)
$doc->preserveWhiteSpace = false;
#$doc->loadHTML(file_get_contents ('http://www.imdb.com/search/title?certificates=us:pg_13&genres=comedy&groups=top_250'));
$xpath = new DOMXPath($doc);
// Starting from the root element
$grupos = $xpath->query(".//*[#class='lister-item mode-advanced']");
// Creating an array and then looping with the elements to be captured (image, title, and link)
$resultados = array();
foreach($grupos as $grupo) {
$i = $xpath->query(".//*[#class='loadlate']//#src", $grupo);
$t = $xpath->query(".//*[#class='lister-item-header']//a/text()", $grupo);
$l = $xpath->query(".//*[#class='lister-item-header']//a/#href", $grupo);
$resultados[] = $resultado;
}
// What command should I use to have a preview of the results and check if everything is ok?
print_r($resultados);
OK, so here your code with two corrections. First I'm adding a subarray to $resultados with the elements, and seconds I'm making a foreach instead of print_r/var_dump
BTW, doesn't imdb offer an API?
<?php
ini_set('display_errors', 1);
error_reporting(-1);
$doc = new DOMDocument;
libxml_use_internal_errors(true);
// Deleting whitespace (if any)
$doc->preserveWhiteSpace = false;
$doc->loadHTML(file_get_contents ('http://www.imdb.com/search/title?certificates=us:pg_13&genres=comedy&groups=top_250'));
//$doc->loadHTML($HTML);
$xpath = new DOMXPath($doc);
// Starting from the root element
$grupos = $xpath->query(".//*[#class='lister-item mode-advanced']");
// Creating an array and then looping with the elements to be captured (image, title, and link)
$resultados = array();
foreach($grupos as $grupo) {
$i = $xpath->query(".//*[#class='loadlate']//#src", $grupo);
$t = $xpath->query(".//*[#class='lister-item-header']//a/text()", $grupo);
$l = $xpath->query(".//*[#class='lister-item-header']//a/#href", $grupo);
$resultados[] = ['i' => $i[0], 't' => $t[0], 'l' => $l[0]];
}
// What command should I use to have a preview of the results and check if everything is ok?
//var_dump($resultados);
foreach($resultados as $r){
echo "\n-----------\n";
echo $r['i']->value."\n";
echo $r['t']->textContent."\n";
echo $r['l']->value."\n";
}
You can play with it here:
https://3v4l.org/hal0G

Inserting numerical ID's in paragraphs (PHP/MySQL DB Query)

I have a pretty ordinary query that displays articles stored in a database table (field = 'Article')...
while ($row = $stm->fetch())
{
$Content = $row['Article'];
}
echo $Content;
I'd like to know how I can modify the display so that every paragraph has a numerical ID. For example, the first paragraph would be [p id="1"], the second one [p id="2"] and so on. However, it would be even better if the last paragraph displayed as [p id="Last"].
(Sorry, I forgot how to post inline code, so I replaced the tags (e.g. <) with brackets.)
My goal is to simply get more control over my content. For example, there are certain items that I want to include after the first paragraph on some pages, and I might want to include a certain feature before paragraph#4 on one special page.
ON EDIT... Neither of the methods suggested below worked for me, but it' probably because I simply didn't implement them correctly; the code in both examples isn't familiar to me. At any rate, I'm bookmarking this page so I can learn more about those scripts.
In the meantime, I finally found a regex solution. (I think preg_replace is another word for regex, right?)
This inserts a numerical ID in each paragraph tag:
$c = 1;
$r = preg_replace('/(<p( [^>]+)?>)/ie', '"<p\2 id=\"" . $c++ . "\">"', $Article);
$Article = $r;
This changes the ID in the last paragraph tag to "Last"...
$c = 1;
$r = preg_replace('/(<p( [^>]+)?>)/ie', '"<p\2 id=\"" . $c++ . "\">"', $Article);
$r = preg_replace('/(<p.*?)id="'.($c-1).'"(>)/i', '\1id="Last"\2', $r);
$Article = $r;
Assuming your HTML is well-formed, you could use the SimpleXMLElement class to do so:
$sxe = new SimpleXMLElement($row['Article']);
$i = 0;
foreach ($sxe->children() as $p) {
$p->addAttribute('id', $i);
}
$p->id = 'Last'; // to set the ID of the last paragraph
echo $sxe->__toString();
If it isn't well-formed, you could use the DOMDocument class instead:
$dom = new DOMDocument;
$dom->loadHTML($row['Article']);
$i;
foreach ($dom->getElementsByTagName('p') as $p) {
$p->id = $id;
}
$p->id = 'Last';
echo $dom->saveHTML();

Can't access parent node of an element via DOM

I've got a table with 6 columns, and within the first column, there's a div with some info I don't need. So, I want to delete all the divs, but I keep getting an Trying to get property of non-object error. Here's the code:
$dom = new DomDocument();
#$dom->loadHTML($html); //I've acquired this page via curl
$tbl = $dom->getElementsByTagName('table')->item(4); //The fourth table in the page
$div = $tbl->getElementsByTagName('div');
for ($i = 0; $i < $td->length-1; $i++){
$chld = $div->item($i);
$prnt = $chld->parentNode; <-- here I get the error
$prnt->removeChild($chld);
}
Can you help me? Either by pointing the mistake I've made or giving me a hint at how to do it.

Categories