Drupal URL structure for scraping - php

I am trying to scrape a drupal site with a Python script for music gigs in the past.
In doing this with a wordpress site I would iterate through urls like this:
http://wordpressevents.com/?p=1
...
http://wordpressevents.com/?p=10000
...and that would get me forwarded to a page (if there's one there) that I could scrape. The actual URL would be something like:
http://wordpressevents.com/music/some-band-youve-never-heard-of/
My Drupal site also has sections (e.g. /gigs/ or /classical/ etc).
Is there any way I can find out what their urls might be so that I can go about scraping it with Python and BeautifulSoup (other suggestions welcome)?
Ideally, I would find out what the structure is...
http://drupalevents.com/drupost?=1
...
http://drupalevents.com/drupost?=10000
etc.
But maybe it doesn't work like this?

In drupal the only guaranteed content url structure is /node/[some number]
So the best way to do this to an arbitrary drupal site is to start at /node/1 and go up from there, incrementing by 1 every time. Or if you look at the source of the newest page on the site and find the node id of the page in the body class tag, then you would know the last number and work your way backwards. For example given the node/185324 the body could have the class node-1853524 on it. This might not be there as the body classes could be anything based on how the site was setup.
Most sites also use the pathauto module to give the pages something a bit more friendly than node/123
The pathauto module uses tokens based on things that the site builder specifies to give nice urls to content. One common one is /content/[node:title]. I doubt that this will really help you but at least it will give you some information on how the drupal site is setup.

Related

the right way to show content in themes

I'm new with drupal 7 and now I'm having a hard time with theme's coding, if more specific, I don't know how I should show pre-defined content. Previous developer of project said, that I absolutly shouldn't use/hardcode any content and links directly in template files and I should put it to modules/blocks (with regions). It would be ok, but design I'm coding now is too complex and has much content and just writing module for each thing - it take too much time.
I have very similar design to this one:
http://classter-html.themerex.net/
So, what is the best and right way to show content (and links) in templates? Of course, I could just hardcode it, but I'm kind of person, who follows good practices.
Let's look at the example site you gave. In Drupal it would be set up something like this:
The carousell at the top: Slick
Classter Team: A View of content type "employee" displayed in a block.
The photos: A View of the files, or of a photo content type. Or one of multiple purpose built photo album modules.
Application Features: A block with custom HTML-code.
Things that only show up at one page (like the start page) can be done using full HTML in that nodes content. Views and blocks with HTML usually solves the rest.
If you are sure that some content will never change (is not translatable or something) I don't see why you shouldn't hard-code it.
Regarding hard-coded links, use root relative paths, so if you move your site to some other domain they will still work.

Hide segment in URL but give code access to hidden segment

I'm using Structure and have a "Supernav" page with multiple children that will make up the supernav for the site. I thought this would be a nice way to have all pages on the site accessible to the client via one location: the Structure UI.
If you visit any of the child pages in the "supernav" group the URL comes out like this:
http://website.com/supernav/prospective-students
I'd love to be able to remove the supernav segment of those URLs so that it ends up being:
http://website.com/prospective-students
I don't even want the supernav segment to appear in the status bar when you hover over these links on the page. Is this possible? With CodeIgniter this comes down to a simple routing rule, but I don't know if that's an option with EE.
Appreciate any help I can get!
This may be a bit after the fact, but have you considered using NavEE for this sort of situation and replacing Structure wholesale? You can build multiple navigation content, and don't have to "hide" the content. I love Structure, but you would have to use .htaccess in order to get the results you're wanting, as well as some routing stash/embeds.
You could use Freebie for this as well.
Take control of your URLs — define segments that you want EE to ignore
completely. Use 'freebie' segments to trigger template behavior, build
dynamic archives inside Structure, or just build special URLs for
analytics purposes. Freebie allows you to use segments in powerful,
flexible ways without the hassle of dealing with strict URL parsing
(like Structure's).
You just add supernav to the Freebie settings (under Freebie segments) and supernav can then be ignored. If you still need to run a conditional off of supernav, {segment_1}, you can use {freebie_1} instead. Read more about that in the add-ons documentation over at Devot:ee and in this post by it's creator.
With that said, I'm not positive if you can output the nav using normal Structure tags and get it to display all children of supernav without supernav itself still being in the URL. To get around this you would need to hard code your navigation (or use NavEE or Taxonomy.)
I hope someone can verify if Structure has a tag/param, or not, which addresses this issue as I'm not really sure either way.

Building PHP WYSIWYG Editor

I am building a web application in which the user may add a page, edit the layout, drag drop element, resize element, format the text, edit the element attribute etc.
In the page the user may include (retrieve) dynamic data, like maybe data from database, data generated by php code, etc.
I have played around with cakephp and jquery lately and tried to build this app. But I stumbled upon on how to appropriately display the php code. I tried to look into the cakephp core code and find about output buffering and tried to utilize output buffering to parse the php code and use regex to display it but it is more likely to reinvent the wheel if I write the parser my self
What I am asking is:
Ok, to be more simple and specific I just want to ask, how to save and load the page that was created by the user especially if the page contains php code. I just want to know is there any other method than write my own parser or maybe a library to parse a php code?
Ok that's all for now, does anyone have any idea how to implement it? Or maybe any page / website that could be useful to take some reference from? Maybe a sample code from which I can take some reference
Thanks
I'm not sure you'll find any good answer here about that.
Whoa I don't know where to start. I'll start by the number 3. You want widgets. Then that means you have to create widget class or objects that possesses a template or something that makes them drawable "well, kinda". If I were you it would be loaded from javascript and not really from php. Each widget would be in some way individual applications loaded in a div using javascript.
Point 2, You wanted widgets. When you add widgets to your page, you have to save some informations, like Position, Title, dimensions and so on. You may even save creation parameters. For exemple a ListWidget may be started with different ItemProvider. That way you don't have to write 1000 widgets but only one that shows different content. That said you have widgets, dimension and position. Now that lead us to point 1.
Point 1. Once you have your widgets, position and dimensions, you send the data you used to create them associated with the page to the server. That lead us two point 2 again.
Once you have saved a page. You can see it by retrieving all widgets with parameters and so on. That leaves you 2 options.
Generate Javascript that will recreate the saved widgets.
Generate Html will all the widgets.
Option 1 is simpler since option 2 won't bind html to javascript by itself. Solution 2 on the other hand is better since there is only 1 request to the server.
Oh and a last thing, You should set yourself some limits. That kind of thing can get very complicated and unfortunately not that great. See drupal for example. It does lots of cool stuff but as soon as you install lots of module. Drupal transform itself in some sort of memory eating monster. And almost all the time you don't really need that much of dynamic content. Fixed layouts will do work nice almost 99% of the time.
I'm also forced to say that but if you try to create an application that give users as much power as a scientist that could raise a 7 legged cat. I think you're going to play with really obscure forces!

Very basic HTML/scripting/active page question

A friend has asked me for help with her website design. Although I know a fair amount about the basics behind HTML, XML, Php, ASP.Net, javascript, etc., I'm not really comfortable sitting down and coding from scratch. All of the work I do is in Java, C++, and so on.
My friend would like to add a vertically scrolling marquee to her site - no problem, there is code for that all over the internet. Here is the tricky part - she would like the text to be dynamically pulled from another website. This isn't like a simple text file, either - it's a list of names from a specific blog post, so there would be a lot of text processing involved to wade through all of the other markup, and extract the relevant info.
The way I see it, here are her options -
1) Write some kind of a perl script or somesuch that is set to run daily. This script will visit the blog and extract the necessary info. It will then update the HTML file's marquee text with its new info.
2) Some sort of active page written in ASP or PHP that will dynamically build the marquee (and the rest of the site) each time the site is visited, basically doing the work of the perl script each time. This seems like it has the potential to be somewhat slow.
Per my understanding, those are her only options. Am I correct? There is no simply way to do this in javascript that I am just missing? I know you can reference an image to be dynamically pulled with the marquee, but this isn't that simple...
Thanks.
EDIT: I guess where I was going with my question was this: Unless I implement this statically, this is going to be fairly involved, right? I believe it is over my head. This is why I would like to simply copy/paste the text list into the html document. It would need to be updated every time the blog does, but that only appears to happen every few months, so that's not a large chore. I realize this is a lazy solution, but this is from someone very inexperienced in web development.
For reference, this is the SPECIFIC blog post which the text will come from, and my friend would ONLY like to display that list of names that begins when you scroll several paragraphs down.
http://truthnottasers.blogspot.com/2008/04/what-follows-are-names-where-known.html
It depends what the list of names looks like, i.e. how much intelligence is needed to parse it. But this could be something that could be fairly easily be pulled, parsed and displayed using Ajax, for example in the jquery flavour.
All the blogs I have ever seen have an RSS feed. Why not just grab the feed?... Google provides javascript that does only this.
Google Ajax Feed API
The RSS suggestion sounds good. If you can't get it in the RSS you could screen scrape the content.
If you could do it with Javascript I think it would suffer the same resource issues as your once a day Perl script and every load asp/php methods since it would still have to fetch the web content by making a call to the web site.
Another option is to use asp.net and enable caching so that when other visitors come to the site instead of getting the page all over again it serves up the cached page. You can set this to cache for 24 hours or so. I'm sure other server languages have similar features. Basically this would be the same as your once a day Perl method but keep it within a web framework.
Another hacky solution would be to use an iframe and frame the content with javascript so that it only shows the content you want to show. Of course you'll have no control over the formatting (background, fonts) of the iframe and if the content gets bigger or changes position you'll have problems.

How to organize a PHP blog

So, currently I'm organizing my blog based on filename: To create a post I enter the name of the file. As opposed to storing the post in the database, I store them in PHP files. So each time I create a post, A new row in the table is created with the filename and a unique ID. To reference the post (e.g. for comments) I get the name of the current file, then search the entries table for a matching file name. The post ID of the comment matches the ID of that post.
Obviously this isn't the standard way of organizing a blog, but I do it this way for a few reasons:
Clean URL's (even cleaner than mod_rewrite can provide from what I've read)
I always have a hard copy of the post on my machine
Easier to remember the URL of a specific post (kind of a part of clean URL's)
Now I know that the standard way would be storing each post in the database. I know how to do this, but the clean URL's is the main problem. So now to my questions:
Is there anything WRONG with the way I'm doing it now, or could any problems arise from it in the future?
Can the same level of clean URL's that I can get now be achieved with mod_rewrite? If so, links are appreciated
I will be hosting this on a web host. Do only certain web-hosts provide access to the necessary files for mod_rewrite, or is it generally standard on all web-hosts?
Thanks so much guys!
P.S. To be clear, I don't plan on using a blogging engine.
As cletus said, this is similar to Movable Type. There's nothing inherently wrong with storing your data in files.
One thing that comes to mind is: how much are you storing in the files? Just the post content, or does each PHP file contain a copy of the entire design of the page as opposed to using a base template? How difficult would it be to change the design later on? This may or may not be a problem.
What exactly are you looking for in terms of clean URLs? Rewrite rules are quite powerful and flexible. By using mod_rewrite in conjunction with a main PHP file that answers all requests, you can pretty much have any URL format you want, including user-friendly URLs without obscure ID numbers or even file extensions.
Edit:
Here is how it would work with mod_rewrite and a main PHP file that processes requests:
Web server passes all requests (e.g., /my-post-title) to, say, index.php
index.php parses the request path ("my-post-title")
Look up "my-post-title" in the database's "slug" or "friendly name" (whatever you want to call it) column and locates the appropriate row that way
Retrieve the post from the database
Apply a template to the post data
Return the completed page to the client
This is essentially how systems like Drupal and WordPress work.
Also, regarding how Movable Type works, it's been a while since I've used it so I might be wrong, but I believe it stores all posts in the database. When you hit the publish button, it generates plain HTML files by pulling post data from the database and inserting it into a template. This is incredibly efficient when your site is under heavy load - there are no scripts running when a visitor opens up your website, and the server can keep up with heavy visitation when it only needs to serve up static files.
So obviously you've got a lot of options when figuring out how your solution should work. The one you proposed sounds fine, though you might want to give careful consideration to how you'll maintain a large number of posts in individual files, particularly if you want to change the design of the entire site later on. You might want to consider a templating engine like Smarty, and just store post data (no layout tags) in your individual files, for instance. Or just use some basic include() statements in your post files to suck in headers, footers, nav menus, etc.
What you're describing is kind of like how Movable Type works. The issues you'll need to cover are:
Syndication: RSS/Atom;
Sitemap: for Google;
Commenting; and
Tagging and filtering content.
It's not unreasonable not to use a database. If I were to do that I'd be using a templating engine like Smarty that does a better job of caching the results than PHP will out of the box.

Categories