I need to pull a section of text from an HTML page that is not on my local site, and then have it parsed as a string. Specifically, the last column from this page. I assume I would have to copy the source of the page to a variable and then setup a regex search to navigate to that table row. Is that the most efficient way of doing it? What PHP functions would that entail?
Scrape the page HTML with file_get_contents() (needs ini value allow_url_fopen to be true) or a system function like curl or wget
Run a Regular Expression to match the desired part. You could just match any <td>s in this case, as these values are the first occurrences of table cells, e.g. preg_match("/<td.*?>(.*?)<\/td>/si",$html,$matches); (not tested)
If you can use URL fopen, then a simple file_get_contents('http://somesite.com/somepage') would suffice. There are various libraries out there to do web scraping, which is the name for what you're trying to do. They might be more flexible than a bunch of regular expressions (regexes are known for having a tough time parsing complicated HTML/XML).
Related
first question in a long while! I need to find any and all urls's in a string returned from a facebook page request (I'm requesting the website of a page using the graphi api) and putting the value into an array that I subsequently display in a datatable js table.
Anyhow, I'm having issues as when I build the json data for the datatable, it breaks in some cases:-
http://socialinsightlab.com/datatable_fpages.json
The issue is with the website field having erroneous characters / structure / white space etc in the field.
Anyhow I found the perfect regex to use to find all websites in the field (there can be more than one website listed in the return).
The regex is
(?i)\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}\/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))
When I try and assign it to a php variable as in preg_match_all I can't as it won't accept the regex string into the variable as it has quotes in it I guess.
So my question is how can I extract only the urls found in the website field and then assign them to a variable so i can add them to the datatable.
Here is an example of a call that fails:-
http://socialinsightlab.com/datatable_fpages.json
I need to be able to just return websites and nothing more.
Any ideas?
Thanks
Jonathan
This regex is specifically made as a solution to this problem:
(?:https?:\/\/|www)[^"\s]+
Live demo
If you don't want to deal with all this quotes escaping, you can do the following:
Save regex to a file, say, regex.txt.
Read this file into variable and trim: $regex = trim(file_get_contents("regex.txt"));
Use it with preg_match() etc.
Im looking for a simple way to scrape any webpage for the presence of certain keywords. I have a list of words such as {Apple, Banana, Pear, Pineapple} and I have a list of links. I need to search each page for the presence of my list of words and return which ones are present on each link. For example for a link:
http://www.xyz.com
I should search that page and return a vector of binary variables 0 1 1 0, where each respective binary variable corresponds to the presence or absence of each corresponding search key in the list. I am having trouble finding a way to search a webpage since i am new to php. what is the best way to scrape a webpage to get back only relevant text on the page (ie. no html tags or css or javascript metadata etc)? I have tried curl and get_file_contents but they returned pretty ugly representations of the webpage. Can anyone please provide a snippet that returns the text on a page so i can search that returned text?
Thanks in advance!
One of the main examples of curl not working is for the page https://plus.google.com/107630561301274451844/about?gl=us&hl=en
I am trying to find the keyword IL on it and it returns non-relavent text for me to search within.
Look into using something pre-built
This will do what you're looking for: http://simplehtmldom.sourceforge.net/
I want to know how to find a number on a remote website and make it a variable.
For example, if I want to find the stock quote for "AMZN", I would use curl or get contents on the page "http://stock-quotes.com/AMZN" to make it a variable string called $contents
Now that I have $contents, how would I find that AMZN quote? I was thinking of using a regular expression to narrow down the line, like finding "AMZN=35 points", and then perform another function to delete the "AMZN=" and " points" at the start and end of the string so that "35" is all that's left.
Is that how people do it?
1.) DOM Element
2.) Simple XML
3.) preg_match
4.) strpos
What I've always done (say in spidering, etc.) is to use the simple_html_dom library in PHP, then inspect the markup for the site.
The downside, as mentioned before, is that if the markup changes, you'll need to modify your code, but usually it's fairly easy, and if you use a source that has informative markup (consistent class names on the elements you need, etc.), then it's even easier.
Library link: http://simplehtmldom.sourceforge.net/
I have a html site. In that site around 100 html files are available. i want to develop the search engine . If the user typing any word and enter search then i want to display the related contents with the keyword. Is't possible to do without using any server side scripting? And it's possible to implement by using jquery or javascript?? Please let me know if you have any ideas!!!
Advance thanks.
Possible? Yes. You can download all the files via AJAX, save their contents in an array of strings, and search the array.
The performance however would be dreadful. If you need full text search, then for any decent performance you will need a database and a special fulltext search engine.
3 means:
Series of Ajax indexing requests: very slow, not recommended
Use a DB to store key terms/page refernces and perform a fulltext search
Utilise off the shelf functionality, such as that offered by google
The only way this can work is if you have a list of all the pages on the page you are searching from. So you could do this:
pages = new Array("page1.htm","page2.htm"...)
and so on. The problem with that is that to search for the results, the browser would need to do a GET request for every page:
for (var i in pages)
$.get(pages[i], function (result) { searchThisPage(result) });
Doing that 100 times would mean a long wait for the user. Another way I can think of is to have all the content served in an array:
pages = {
"index" : "Some content for the index",
"second_page" : "Some content for the second page",
etc...
}
Then each page could reference this one script to get all the content, include the content for itself in its own content section, and use the rest for searching. If you have a lot of data, this would be a lot to load in one go when the user first arrives at your site.
The final option I can think of is to use the Google search API: http://code.google.com/apis/customsearch/v1/overview.html
Quite simply - no.
Client-side javascript runs in the client's browser. The client does not have any way to know about the contents of the documents within your domain. If you want to do a search, you'll need to do it server-side and then return the appropriate HTML to the client.
The only way to technically do this client-side would be to send the client all the data about all of the documents, and then get them to do the searching via some JS function. And that's ridiculously inefficient, such that there is no excuse for getting them to do so when it's easier, lighter-weight and more efficient to simply maintain a search database on the server (likely through some nicely-packaged third party library) and use that.
some useful resources
http://johnmc.co/llum/how-to-build-search-into-your-site-with-jquery-and-yahoo/
http://tutorialzine.com/2010/09/google-powered-site-search-ajax-jquery/
http://plugins.jquery.com/project/gss
If your site is allowing search engine indexing, then fcalderan's approach is definitely the simplest approach.
If not, it is possible to generate a text file that serves as an index of the HTML files. This would probably be rudimentarily successful, but it is possible. You could use something like the keywording in Toby Segaran's book to build a JSON text file. Then, use jQuery to load up the text file and find the instances of the keywords, unique the resultant filenames and display the results.
method is the most efficient when translating bunches of text/web pages including HTML? I want to translate the text, but keep the HTML.
Also, should I keep the words in a database or an array?
When you say "translating", do you mean from one language to another? If so, you can use regular expressions to capture the data between open and closing tags of your HTML without losing the markup. I'm not sure however why you would want to store your data in a database, unless you were going to retrieve it at a later point?
If this is for a translation on the fly, it will always be faster to store your data in memory -- your Array or simply update the HTML while you loop through the data and eliminate the need for an Array altogether.