PHP Regex replace link if it does not have data attribute - php

I need to loop through a bunch of HTML code and remove the <a> </a> tags from all links which DONT include the data attribute data-link="keepLink"
Here is an example of body value I need to modify:
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel.</strong>
After the modification I need it to look like (so the offer link is removed):
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel.</strong>
So far I have managed to get the first half of the link removing if it doesn't include a data-link="keepLink" attribute. But the closing </a> is still present.
Here is the regex I have used:
$result["body_value"] = preg_replace('/<a (?![^>]*data-link="keepLink").*?>/i', '', $result["body_value"]);
So the new body value looks like:
<p><a data-link=\"keepLink\" href=\"[1|9999|16|191967|256]\">Daily Racing Link</a></p>\r\n<br>\n <strong>OFFER – Get up to a £400 deposit bonus when you sign up with Fanduel</a>.</strong>

The DOMDocument extension is available by default in PHP. It is presumably faster and is designed exactly for what you are trying to achieve. You can use it to load your document and search for any links without a data-link attribute like this:
$dom = new DOMDocument;
$dom->loadHTMLFile('http://www.example.com'); // load the file
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a[not(#data-link=\'keepLink\')]'); // search for links that do not have the 'data-link' attribute set to 'keepLink'
foreach($nodes as $element){
$textInside = $element->nodeValue; // get the text inside the link
$parentNode = $element->parentNode; // save parent node
$parentNode->replaceChild(new DOMText($textInside), $element); // remove the element
}
$myNewHTML = $dom->saveHTML(); // see http://php.net/manual/ro/domdocument.savehtml.php for limitations such as auto-adding of doc-type
echo $myNewHTML;
Proof of concept: https://3v4l.org/ejatQ.
Please bear in mind that this will take only the text values inside the elements without a data-link='keepLink' attribute value.

If you are set on regex and don't want to use a parser.
Try this
<a (?!data-link=)[^>]*>((?!<\/a>).*?)<\/a>
And replace it by $1. To keep your link-text.
See https://regex101.com/r/wKQk4p/2
Please say if you need any further explaination.

Related

How to get xPath nodeValue USD dollar amount

I'm trying to start from the <span> element that has text Value when transacted
Then get its parent <div> and get following sibling which is a <div> and from that <div> get the text of the child <span>.
From what I can tell, the code is correct and should echo $1,034.29.
It echos $0.00 instead.
What am I missing here?
php code:
$a = new DOMXPath($doc);
$dep_val_txt = $a->query("//span[contains(text(), 'Value when transacted')]");
$dep_val_nxt_elem = $a->query("parent::div", $dep_val_txt[0]);
$dep_val_elem = $a->query("following-sibling::*[1]", $dep_val_nxt_elem[0]);
$dep_val = $dep_val_elem->item(0)->childNodes->item(0)->nodeValue;
echo $dep_val;
html code:
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>
In case someone else stumbles upon this question in the future, I will summarize the solution which was concluded by conversation with OP in the comments:
The issue here is not with the DOM selectors, as observed by the fact that his output is $0.00 even though he is not formatting the value to appear as a currency. This led me to believe that the website being scraped is in fact using placeholder values which are updated on the client side using Javascript. The reason this cannot be resolved with selectors is because the DOM received by PHP will be the initial render, which does not contain the values we wish to scrape.
So the solution is to examine the website being scraped to determine where and how the values are being fetched before being added to the DOM on the client side. For example, if the website is using an API call to fetch the values, one can simply use the same API to fetch the intended data without having to scrape the HTML DOM at all.
If you follow OPs question literally
start from the <span> element that has text "Value when transacted"
get its parent <div>
get following sibling which is a <div>
get the text of the child <span>
then the xpath expression should be
//span[text()='Value when transacted']/parent::div/following-sibling::div/span
You might find it easier and faster to process using a regex to match the price, here's a quick example in PHP:
<?php
// Your input HTML (as per your example)
$inputHtml = <<<HTML
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh sc-1n72lkw-0 bKaZjn" opacity="1">Value when transacted</span>
</div>
<div class="sc-8sty72-0 cyLejs">
<span class="sc-1ryi78w-0 bFGdFC sc-16b9dsl-1 iIOvXh u3ufsr-0 gXDEBk" opacity="1">$1,034.29</span>
</div>
HTML;
$matches = [];
// Look for any div > span element which contains a string starting with $ and then match a number (allowing for a , or . within the price matched).
if (preg_match_all('#<div.*>\s*<span.*?>\$([0-9.,]+)</span>\s*</div>#mis', $inputHtml, $matches)) {
echo 'Price found: ' . $matches[1][0] . PHP_EOL;
}
Console output from this:
Price found: 1,034.29

scraping images from url using php

i am trying to make a page that allows me to grab and save images from another link , so here's what i want to add on my page:
text box (to enter url that i want to get images from).
save dialog box to specify the path to save images.
but what i am trying to do here i want to save images only from that url and from inside specific element.
for example on my code i say go to example.com and from inside of element class="images" grab all images.
notes: not all images from the page, just from inside the element
whether element has 3 images in it or 50 or 100 i don't care.
here's what i tried and worked using php
<?php
$html = file_get_contents('http://www.tgo-tv.net');
preg_match_all( '|<img.*?src=[\'"](.*?)[\'"].*?>|i',$html, $matches );
echo $matches[ 1 ][ 0 ];
?>
this gets image name and path but what i am trying to make is a save dialog box and the code must save image directly into that path instead of echo it out
hope you understand
Edit 2
it's ok of Not having save dialog box. i must specify save path from the code
If you want something generic, you can use:
<?php
$the_site = "http://somesite.com";
$the_tag = "div"; #
$the_class = "images";
$html = file_get_contents($the_site);
libxml_use_internal_errors(true);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//'.$the_tag.'[contains(#class,"'.$the_class.'")]/img') as $item) {
$img_src = $item->getAttribute('src');
print $img_src."\n";
}
Usage:
Change the site, tag, which can be a div, span, a, etc. also change the class name.
For example, change the values to:
$the_site = "https://stackoverflow.com/questions/23674744/what-is-the-equivalent-of-python-any-and-all-functions-in-javascript";
$the_tag = "div"; #
$the_class = "gravatar-wrapper-32";
Output:
https://www.gravatar.com/avatar/67d8ca039ee1ffd5c6db0d29aeb4b168?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24da669dda96b6f17a802bdb7f6d429f?s=32&d=identicon&r=PG
https://www.gravatar.com/avatar/24780fb6df85a943c7aea0402c843737?s=32&d=identicon&r=PG
Maybe you should try HTML DOM Parser for PHP. I've found this tool recently and to be honest it works pretty well. It was JQuery-like selectors as you can see on the site. I suggest you to take a look and try something like:
<?php
require_once("./simple_html_dom.php");
foreach ($html->find("<tag>") as $<tag>) //Start from the root (<html></html>) find the the parent tag you want to search in instead of <tag> (e.g "div" if you want to search in all divs)
{
foreach ($<tag>->find("img") as $img) //Start searching for img tag in all (divs) you found
{
echo $img->src . "<br>"; //Output the information from the img's src attribute (if the found tag is <img src="www.example.com/cat.png"> you will get www.example.com/cat.png as result)
}
}
?>
I hope i helped you less or more.

How to parse multiple elements in portions for html via Simple Html Dom

I am attempting to get various elements inside of an li as shown below. I am pretty new to this so I may not be using the most efficient methods but this is where I have started...
EXAMPLE CODE SIMPLIFIED....
<li id='entry_0' title='09879879'>
<div ....>
<h2> The title text would go here </h2>
<span class='entrySize' ....> 20oz </span>
<span class='entryPrice' ....> $32.09 </span>
<span class='anotherEntry' ....> More Data I need To Grab </span>
.......
</div>
</li>
<li> .... With same structure as above .... 100's of entries like this </li>
I know how to pull individual parts separately but having trouble grasping how to do it grouped within a portion of the html.
$filename = "directory/file.html";
$html = file_get_html($filename);
for($i=0; $i<=count(entryNumber);$i++)
{
$li_id = "entry_".$i;
foreach($html->find('li[id='.$li_id.']') as $li) {
echo $li->innertext;
}
}
So this gets me the content in the line item tag with the id number as the unique attribute. I would like to grab the h2 text, entrySize, entryPrice etc as I iterate through the line item tags. What I don't understand is once I have the line item tag content how can I parse through that line item inner tags and attributes. There maybe other parts of the full HTML document that has tags with same id, class as these throughout the document so I am breaking this down to portions and than looking to parse each section at a time.
I would also like to pull the title attribute out of the title tag for the li tag.
I hope my explanation make sense.
You should probably use a DOM parser. PHP comes bundled with one, and there are many other's you could use.
http://php.net/dom
PHP Simple HTML DOM Parser
<?php
$html = file_get_content($page);
$doc = new DOMDocument();
$doc->loadHTML($html);
// now find what you need
$items = $dom->getElementsByTagName('li');
foreach ($items as $item) {
$id = $item->getAttribute('id');
if (strpos($id, 'item_') !== false) {
// found matchin li, grab its children
}
}
Use this as a baseline, we can't write all the code for you. Check out the PHP docs to finish this :) From what I have so far, you need to follow the docs to make it grab the child values, and handle them.

Simple HTML DOM Parser - Get all plaintex rather than text of certain element

I tried all the solutions posted on this question. Although it is similar to my question, it's solutions aren't working for me.
I am trying to get the plain text that is outside of <b> and it should be inside the <div id="maindiv>.
<div id=maindiv>
<b>I don't want this text</b>
I want this text
</div>
$part is the object that contains <div id="maindiv">.
Now I tried this:
$part->find('!b')->innertext;
The code above is not working. When I tried this
$part->plaintext;
it returned all of the plain text like this
I don't want this text I want this text
I read the official documentation, but I didn't find anything to resolve this:
Query:
$selector->query('//div[#id="maindiv"]/text()[2]')
Explanation:
// - selects nodes regardless of their position in tree
div - selects elements which node name is 'div'
[#id="maindiv"] - selects only those divs having the attribute id="maindiv"
/ - sets focus to the div element
text() - selects only text elements
[2] - selects the second text element (the first is whitespace)
Note! The actual position of the text element may depend on
your preserveWhitespace setting.
Manual: http://www.php.net/manual/de/class.domdocument.php#domdocument.props.preservewhitespace
Example:
$html = <<<EOF
<div id="maindiv">
<b>I dont want this text</b>
I want this text
</div>
EOF;
$doc = new DOMDocument();
$doc->loadHTML($html);
$selector = new DOMXpath($doc);
$node = $selector->query('//div[#id="maindiv"]/text()[2]')->item(0);
echo trim($node->nodeValue); // I want this text
remove the <b> first:
$part->find('b', 0)->outertext = '';
echo $part->innertext; // I want this text

Add a span tag to Title without javascript

UPDATE:
Yes I am Using PHP in my pages.
Hello Friends I was thinking..... Is there a way to add a <span> tag to the title without using javascript?
May be using Regex or php or some other method. I dont really know.
Let me explain....
My HTML is like this:
<h3 class="title">The Title Goes Here</h3>
What I want is to automatically add a span tag, so the the final HTML looks like this.
<h3 class="title"><span>The </span>Title Goes Here</h3>
I want to wrap only the first word of the title in a <span> tag.
I know this can easily be dont using Javascript but I am looking for a non-javascript solution.
Please Help!
You can do this with DOMDocument in PHP if you don't want to do it with the javascript DOM:
$html = '<h3 class="title">The Title Goes Here</h3>';
$doc = new DOMDocument();
$doc->loadHTML($html);
$xp = new DOMXPath($doc);
foreach($xp->query('//h3[#class="title"]') as $parent) {
$title = $parent->nodeValue;
list($first, $rest) = explode(' ', $title, 2);
$span = new DOMElement('span', $first. ' ');
$parent->nodeValue = $rest;
$parent->insertBefore($span, $parent->firstChild);
}
foreach($doc->getElementsByTagName('body')->item(0)->childNodes as $node)
{
echo $doc->saveHTML($node);
}
My answer is that the cannot be done. You can't manipulate a page in the browser without JavaScript. This can only be achieved by editing the page on the server manually, or by dynamically generating it using PHP logic, or an equivalent solution, of which there are many.
If you are doing this for a corporate solution that is only used on a single corporate standard browser, you could look into building a plugin for the browser.

Categories