Find a match, then grab some html before and after it - php

In php, I am scraping some html from one of my other external sites. I'm performing the scrape and getting all of the page html in a php string. I need to find the first .png file type in this string. I then need to just grab the html from this point to find the beginning http before it AND grab the html after it just before the following characters begin "\u002522". Any ideas?
So:
<html><head><title>Hello</title></head><body><p>Here's a nice image</p><img src="http://www.exampleurl.com/image.png?id=35435646&v=5647\\u002522"/></body></html>
Would turn into:
http://www.exampleurl.com/image.png?id=35435646&v=5647
I've looked everywhere for combining all these things at the same time, but to no luck :(

I have used this before and it worked great for me. How to extract img src, title and alt from html using php?
Then just clean up the URL and split on //.
Let me know if I need to be more specific.

Related

Parse HTML and replace content in DIV

I want to know how i can find the DIV tag in a HTML page. This is because i want to replace the links inside that DIV with different links. I do not understand what exact code i require.
First, notice that PHP won't do anything client side. But you should already know it.
you should use file_get_contents to read the webpage as a string (or what is provided by a library for html parsing).
There is already a question that explain how to parse html in any way: Robust and Mature HTML Parser for PHP
If it doesn't fit your needs, try searching it on google: php html parsing, I found some libraries
For example this library I've found allows you to find all tags: http://simplehtmldom.sourceforge.net/
Notice that this is not a great approach and I suggest you change your html page to be a PHP page, and insert some code in place of A tags. This will make everything easier.
Last thing, if the html page is static (it doesn't change), you can use easily line counting to get contents from X line to Y line, put your customized A-tags and then read from J to the end of file.
Good luck anyway.

Short snippet summarizing a webpage?

Is there a clean way of grabbing the first few lines of a given link that summarizes that link? I have seen this being done in some online bookmarking applications but have no clue on how they were implemented. For instance, if I give this link, I should be able to get a summary which is roughly like:
I'll admit it, I was intimidated by
MapReduce. I'd tried to read
explanations of it, but even the
wonderful Joel Spolsky left me
scratching my head. So I plowed ahead
trying to build decent pipelines to
process massive amounts of data
Nothing complex at first sight but grabbing these is the challenging part. Just the first few lines of the actual post should be fine. Should I just use a raw approach of grabbing the entire html and parsing the meta tags or something fancy like that (which obviously and unfortunately is not generalizable to every link out there) or is there a smarter way to achieve this? Any suggestions?
Update:
I just found InstaPaper do this but am not sure if it is getting the information from RSS feeds or some other way.
Well first of all i would suggest you use PHP with a DOM Parser Class, this will make it a lot easier to get the tag contents you need.
// Get HTML from URL or file
$html = file_get_html('http://www.google.com/');
// Find all paragraphs
$paragraphs = $html->find('p')
//echo the first paragraph
echo $paragraphs[0];
The problem is a lot of sites have poorly structure html, some are built on tables, the key get around this is that you decide what tags will you consider the website description. I would try to get the meta description tag, if this one does not exist, look for the first paragraph.
You bes tbet is to pull from from the meta description tag. Most blog platforms will stuff the user/system provided excerpt of the post in here as will a lot of CMS platforms. Then if that meta tag isnt present i would just fall back to title or pick a paragraph of appropriate depth.

Read external HTML page and then find data within

I'm playing around with an idea, and I'm stuck at this one part. I want to read an external HTML page and then extract the data held within two <dd> tags. I've been using file_get_contents with good results, but I'm at a loss as to how to accomplish that last part. The two tags I want to extract the value from are always enclosed within a particular <div>, was wondering if that might help?
In my mind it reads the entire html file into a string, then dumps all the data up until this one particular <div>, and dumps all the data after the closing </div>. Is that possible? I think this needs regex syntax which I've never used yet. So any tips, links, or examples would be great! I can provide more info as necessary.
Maybe this could help:
http://simplehtmldom.sourceforge.net/
You are complicating way too much. Simply load the page content and then search for the proper regex (preg_match()). This will do fine
preg_match('~<tag id="foobar">(?P<content>.*?)</endtag>~is', $input, $matches);
If you use HTQL COM to query the page, the query is: <dd>1:tx

Extract all text from a HTML page without losing context

For a translation program I am trying to get a 95% accurate text from a HTML file in order to translate the sentences and links.
For example:
<div>Overflow <span>Texts <b>go</b> here</span></div>
Should give me 2 results to translate:
Overflow
Texts <b>go</b> here
Any suggestions or commercial packages available for this problem?
I'm not exactly sure what you're asking, but look at simplehtmldom. Specifically the "Extract Contents from HTML" tab under quick start on that front page (can't link directly, sigh). With that you can extract the text of a website without all those pesky tags.

Replicate Digg's Image-Suggestions from Submitted URL with PHP

So I'm looking for ideas on how to best replicate the functionality seen on digg. Essentially, you submit a URL of your page of interest, digg then crawl's the DOM to find all of the IMG tags (likely only selecting a few that are above a certain height/width) and then creates a thumbnail from them and asks you which you would like to represent your submission.
While there's a lot going on there, I'm mainly interested in the best method to retrieve the images from the submitted page.
While you could try to parse the web page HTML can be such a mess that you would be best with something close but imperfect.
Extract everything that looks like an image tag reference.
Try to fetch the URL
Check if you got an image back
Just looking for and capturing the content of src="..." would get you there. Some basic manipulation to deal with relative vs. absolute image references and you're there.
Obviously anytime you fetch a web asset on demand from a third party you need to take care you aren't being abused.
I suggest cURL + regexp.
You can also use PHP Simple HTML DOM Parser which will help you search all the image tags.

Categories