Altering <a> in over 3000 Posts with DomDocument - Need Advice - php

I used php DomDocument to extract all the a tags from over 3000 posts and collected them in a database as follows -
I used domDocument C14N() function to fill the existing_link table.
id | existing_link | replacement_link
1 | <a class="class1" href="domain1.com" rel="nofollow">Domain1.com</a> | Domain2.com
My initial thought was to simply use Laravel's Str::replace() to find and replace the links using above table. But, C14N() did something I did not think of. It put the link's attributes in alphabetical order. That is, while the link in my post exists as -
Domain1.com
The C14N() function saved it with attribute order changed (class -> href -> rel)! Look existing_link in above table.
As a result, I cannot use Laravel's Str::replace() to quickly replace links; even though they technically are the same links; they are not the same strings.
Each post in my DB can have multiple links to be replaced based on the table I've prepared. My best attempt so far is as follows -
$new_links = DB::table('links')->get();
foreach ($new_links as $new_link)
{
$post = Post::where('id', $new_link->id)->first();
$post_body = $post->body;
$domDocument = new \DOMDocument();
$domDocument->loadHTML($post_body, LIBXML_NOERROR);
// Pull the links in the post body
$old_links = $domDocument->getElementsByTagName('a');
foreach ($old_links as $old_link)
{
if($old_link->C14N() == $new_link->existing_link)
{
// Perform the replacement. I can't figure out how to do this using DOMDOcument.
}
}
}
I'd really appreciate any help with the final replacement of links using DOMDocument or any other approach. I look forward to your responses and thank you for your time.

Related

Scaping IFrame inside a HTML page with values loaded using Ajax request

I need to scrape this HTML page using PHP ...
http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372
... I need to extract the numbers for the rows "Rosso", "Giallo", Verde" and "Bianco" (note that these numbers are dynamic so they can change when you refresh the page but it doesn't matter....).
I've seen that these rows are inside some IFrames (for example ... http://listeps.cittadellasalute.to.it/?id=01090201 ), and the values are loaded using an ajax request (for examples http://listeps.cittadellasalute.to.it/gtotal.php?id=01090101).
Are there some solutions to scrape directly (I'd like to avoid to parse singular jsons ....), these values from the original HTML page using PHP and $xpath->query?
Suggestions / examples?
I think the problem is that the values aren't in the original page, they are built once the page is loaded. So you would need to use something which will honour all the Javascript functionality (i.e. Selinium webdriver) which is a bit overkill for what you want to do (I assume). Much easier to directly process the IFrame.
You could extract the URL's of the IFrames from the original page ...
$url = "http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372";
$pageContents = file_get_contents($url);
$page = simplexml_load_string($pageContents, "SimpleXMLElement", LIBXML_NOERROR | LIBXML_ERR_NONE);
$ns = $page->getDocNamespaces();
$page->registerXPathNamespace('def', array_values($ns)[0]);
$iframes = $page->xpath("//def:iframe");
foreach ( $iframes as $frame ) {
echo "iframe:".$frame['src'].PHP_EOL;
}
Which gives (just now)
iframe:http://listeps.cittadellasalute.to.it/?id=01090101
iframe:http://listeps.cittadellasalute.to.it/?id=01090201
iframe:http://listeps.cittadellasalute.to.it/?id=01090301
iframe:http://listeps.cittadellasalute.to.it/?id=01090302
You can then process these pages.

Get attributes from post data in PHP

I'm using laravel to create a simple social network. Users can type # in the post area to get a list of their friends to mention them. Every mention in a link like this (using zurb/tribute from github)
<a type="mention" href="/user/Jordan" title="Jordan">Jordan</a>
Normal links other than mentions won't have type='mention'
Now when I get the post and insert it into the database I need to get a list of users mentioned in the post. I'm looking for links which have the type ='mention' and if there's any I want to get the title of everyone to insert into the notification system. What PHP code do I need to add in this if statement?
if(stristr(request('post'),' type="mention" ')){
}
Aside from using an AST (Abstract Syntax Tree), your best bet would be to either use DOM on PHP Side, e.g.:
$string = '<a type="mention" href="/user/Jordan" title="Jordan">Jordan</a>';
$doc = new \DOMDocument();
$doc->loadHTML($string);
$xpath = new \DOMXPath($doc);
foreach ($xpath->query("//a[#type='mention']") as $a) {
$href = $a->getAttribute('href');
$title = $a->getAttribute('title');
echo sprintf("Found mention of %s with href of %s\n", $title, $href);
}
However, I probably wouldn't be sending the A node back to the server. You should consider working out some way to make it a display-only feature implemented on the browser side, and simply send the "#jordan" string back to the server.

Fetch data from site Page by Page & go through sub links

URL : http://www.sayuri.co.jp/used-cars
Example : http://www.sayuri.co.jp/used-cars/B37753-Toyota-Wish-japanese-used-cars
Hey guys , need some help with one of my personal projects , I've already wrote the code to fetch data from each single car url (example) and post on my site
Now i need to go through the main url : sayuri.co.jp/used-cars , and :
1) Make an array / list / nodes of all the urls for all the single cars in it , then run my internal code for each one to fetch data , then move on to the next one
I already have the code to save each url into a log file when completed (don't think it will be necessary if it goes link by link without starting from the top but will ensure no repetition.
2) When all links are done for the page , it should move to the next page and do the same thing until the end ( there are 5-6 pages max )
I've been stuck on this part since last night and would really appreciate any help . Thanks
My code to get data from the main url :
$content = file_get_contents('http://www.sayuri.co.jp/used-cars/');
// echo $content;
and
$dom = new DOMDocument;
$dom->loadHTML($content);
//echo $dom;
I'm guessing you already know this since you say you've gotten data from the car entries themselves, but a good point to start is by dissecting the page's DOM and seeing if there are any elements you can use to jump around quickly. Most browsers have page inspection tools to help with this.
In this case, <div id="content"> serves nicely. You'll note it contains a collection of tables with the required links and a <div> that contains the text telling us how many pages there are.
Disclaimer, but it's been years since I've done PHP and I have not tested this, so it is probably neither correct or optimal, but it should get you started. You'll need to tie the functions together (what's the fun in me doing it?) to achieve what you want, but these should grab the data required.
You'll be working with the DOM on each page, so a convenience to grab the DOMDocument:
function get_page_document($index) {
$content = file_get_contents("http://www.sayuri.co.jp/used-cars/page:{$index}");
$document = new DOMDocument;
$document->loadHTML($content);
return $document;
}
You need to know how many pages there are in total in order to iterate over them, so grab it:
function get_page_count($document) {
$content = $document->getElementById('content');
$count_div = $content->childNodes->item($content->childNodes->length - 4);
$count_text = $count_div->firstChild->textContent;
if (preg_match('/Page \d+ of (\d+)/', $count_text, $matches) === 1) {
return $matches[1];
}
return -1;
}
It's a bit ugly, but the links are available inside each <table> in the contents container. Rip 'em out and push them in an array. If you use the link itself as the key, there is no concern for duplicates as they'll just rewrite over the same key-value.
function get_page_links($document) {
$content = $document->getElementById('content');
$tables = $content->getElementsByTagName('table');
$links = array();
foreach ($tables as $table) {
if ($table->getAttribute('class') === 'itemlist-table') {
// table > tbody > tr > td > a
$link = $table->firstChild->firstChild->firstChild->firstChild->getAttribute('href');
// No duplicates because they just overwrite the same entry.
$links[$link] = "http://www.sayuri.co.jp{$link}";
}
}
return $links;
}
Perhaps also obvious, but these will break if this site changes their formatting. You'd be better off asking if they have a REST API or some such available for long term use, though I'm guessing you don't care as much if it's just a personal project for tinkering.
Hope it helps prod you in the right direction.

Google news feed content

So lets say i have a google news feed, like this: https://news.google.com/news/feeds?pz=1&cf=all&ned=no_no&hl=no&q=%22something%22&output=atom&num=1
Grabbing the title, author and link would be easy, but how would i go around getting say the first 200 characters of the content? its full of html, and mixed in with the title and author aswell.
i could use strip_tags on it, but it would still be a mess.
Any way to make google return a ['description'] maybe?
or is there perhaps any other good news feeds that gives me the content in a way thats easier to manage?
[edit]
Update on how i ended up doing it.
$news = #simplexml_load_string(file_get_contents('https://news.google.com/news/feeds?pz=1&cf=all&ned=no_no&hl=no&q=%22molde+fotballklubb%22+OR+%22tornekrattet%22+OR+%22mfk%22+OR+%22oddmund+bjerkeset%22+-%22moss%22&output=atom&num=1'), 'SimpleXMLElement', LIBXML_NOCDATA);
$data = get_object_vars($news->{'entry'});
$test = explode('<font size="-1">', $data['content']);
$link = get_object_vars($data['link']);
$return['title'] = strip_tags($test[0]);
$return['author'] = strip_tags($test[1]);
$return['description'] = strip_tags($test[2]);
$return['link'] = $link['#attributes']['href'];
It is still not working properly, but thats because the feed gives me the content in different ways all the time. Sometimes the content of the news article itself will just be metadata like the authors and image descriptions.
And the breaking it up by html tags when the html have changes from time to time causes some problems. But i cant figure out any othe way of doing it with this feed.
You could try loading the HTML in a DOMDocument instance and extract the parts you need, or use a wrapper for it like Goutte which makes it a lot easier to extract portions you need.
http://php.net/manual/en/class.domdocument.php
https://github.com/fabpot/Goutte

Updating the XML file using PHP script

I'm making an interface-website to update a concert-list on a band-website.
The list is stored as an XML file an has this structure :
I already wrote a script that enables me to add a new gig to the list, this was relatively easy...
Now I want to write a script that enables me to edit a certain gig in the list.
Every Gig is Unique because of the first attribute : "id" .
I want to use this reference to edit the other attributes in that Node.
My PHP is very poor, so I hope someone could put me on the good foot here...
My PHP script :
Well i dunno what your XML structure looks like but:
<gig id="someid">
<venue></venue>
<day></day>
<month></month>
<year></year>
</gig>
$xml = new SimpleXmlElement('gig.xml',null, true);
$gig = $xml->xpath('//gig[#id="'.$_POST['id'].'"]');
$gig->venue = $_POST['venue'];
$gig->month = $_POST['month'];
// etc..
$xml->asXml('gig.xml)'; // save back to file
now if instead all these data points are attributes you can use $gig->attributes()->venue to access it.
There is no need for the loop really unless you are doing multiple updates with one post - you can get at any specific record via an XPAth query. SimpleXML is also a lot lighter and a lot easier to use for this type of thing than DOMDOcument - especially as you arent using the feature of DOMDocument.
You'll want to load the xml file in a domdocument with
<?
$xml = new DOMDocument();
$xml->load("xmlfile.xml");
//find the tags that you want to update
$tags = $xml->getElementsByTagName("GIG");
//find the tag with the id you want to update
foreach ($tags as $tag) {
if($tag->getAttribute("id") == $id) { //found the tag, now update the attribute
$tag->setAttribute("[attributeName]", "[attributeValue]");
}
}
//save the xml
$xml->save();
?>
code is untested, but it's a general idea

Categories