Im looking for a simple way to scrape any webpage for the presence of certain keywords. I have a list of words such as {Apple, Banana, Pear, Pineapple} and I have a list of links. I need to search each page for the presence of my list of words and return which ones are present on each link. For example for a link:
http://www.xyz.com
I should search that page and return a vector of binary variables 0 1 1 0, where each respective binary variable corresponds to the presence or absence of each corresponding search key in the list. I am having trouble finding a way to search a webpage since i am new to php. what is the best way to scrape a webpage to get back only relevant text on the page (ie. no html tags or css or javascript metadata etc)? I have tried curl and get_file_contents but they returned pretty ugly representations of the webpage. Can anyone please provide a snippet that returns the text on a page so i can search that returned text?
Thanks in advance!
One of the main examples of curl not working is for the page https://plus.google.com/107630561301274451844/about?gl=us&hl=en
I am trying to find the keyword IL on it and it returns non-relavent text for me to search within.
Look into using something pre-built
This will do what you're looking for: http://simplehtmldom.sourceforge.net/
Related
I am working on a content oriented website, I have to implement web search, I am thinking of auto suggest search, like this:
How it can be done?
I want suggestions followed by the search term as in image, I am using lamp stack.
Suggest me some methods to implement this.
Here are the steps:
Write PHP code that will take search keywords and return results in JSON format
Create form in HTML
On every key stroke in search box take search keywords and make AJAX request to search code you made in step 1
Now display the search response you received in JOSN format
http://www.bewebdeveloper.com/tutorial-about-autocomplete-using-php-mysql-and-jquery
To achieve this in your website, you need to know about AJAX and Database in PHP or any other Server Side language. Then you can use Full Text Search in SQL to do the query. So:
PHP mysqli
AJAX
Full Text Search (Match & Against)
I want to search on website pragmatically using PHP like as we search on website manually, enter query on search box press search and result came out.
Suppose I want to search on this website by products names or model number that are stored in my csv file.
if the products number or model number match with website data then result page should be displayed ..
I search on below question but not able to implement.
Creating a 'robot' to fill form with some pages in
Autofill a form of another website and send it
Please let me know how we can do this PHP ..
Thanks
You want to create a “crawler” for websites.
There are some things to consider first:
You code will never be generic. Each site has proper structure and you can not assume any thing (Example: craigslist “encode” emails with a simple method)
You need to select an objective (Emails ? Items information ? Links ?)
PHP is by far one of the worst languages to do that.
I’ll suggest using C# and the library called AgilityHtmlPack. It allows you to parse HTML pages as XML documents (So you can do XPath expressions and more to retrieve information).
It surely can be done in PHP, but I think it will take at least 10x time in php compared to c#.
I run my own game, and I can use PHP to get an updated value of how many users are online at a current time. I want to create an updating string of text that shows how many users are online. In game it's programmed to update the value every 20 seconds.
The problem is that my website can only use HTML, and that's about as far as it goes for how much customization I have. The other option is Flash, which I have zero clue on how to use.
The HTML doesn't seem to work with PHP inside of it, so I'm really unsure of how to approach this.
I just need the html to grab the text that outputs from a PHP url from my website, basically in the same way you use html to grab an image. It's 100% readable, and it's just a single string that I need to grab to show how many users are online. : ( Is there any way to do this or am I out of luck?
You can try using <iframe src="#url"></iframe> or <embed src="#url"></embed> if you want this to be done using only html.
I need to pull a section of text from an HTML page that is not on my local site, and then have it parsed as a string. Specifically, the last column from this page. I assume I would have to copy the source of the page to a variable and then setup a regex search to navigate to that table row. Is that the most efficient way of doing it? What PHP functions would that entail?
Scrape the page HTML with file_get_contents() (needs ini value allow_url_fopen to be true) or a system function like curl or wget
Run a Regular Expression to match the desired part. You could just match any <td>s in this case, as these values are the first occurrences of table cells, e.g. preg_match("/<td.*?>(.*?)<\/td>/si",$html,$matches); (not tested)
If you can use URL fopen, then a simple file_get_contents('http://somesite.com/somepage') would suffice. There are various libraries out there to do web scraping, which is the name for what you're trying to do. They might be more flexible than a bunch of regular expressions (regexes are known for having a tough time parsing complicated HTML/XML).
I have a html site. In that site around 100 html files are available. i want to develop the search engine . If the user typing any word and enter search then i want to display the related contents with the keyword. Is't possible to do without using any server side scripting? And it's possible to implement by using jquery or javascript?? Please let me know if you have any ideas!!!
Advance thanks.
Possible? Yes. You can download all the files via AJAX, save their contents in an array of strings, and search the array.
The performance however would be dreadful. If you need full text search, then for any decent performance you will need a database and a special fulltext search engine.
3 means:
Series of Ajax indexing requests: very slow, not recommended
Use a DB to store key terms/page refernces and perform a fulltext search
Utilise off the shelf functionality, such as that offered by google
The only way this can work is if you have a list of all the pages on the page you are searching from. So you could do this:
pages = new Array("page1.htm","page2.htm"...)
and so on. The problem with that is that to search for the results, the browser would need to do a GET request for every page:
for (var i in pages)
$.get(pages[i], function (result) { searchThisPage(result) });
Doing that 100 times would mean a long wait for the user. Another way I can think of is to have all the content served in an array:
pages = {
"index" : "Some content for the index",
"second_page" : "Some content for the second page",
etc...
}
Then each page could reference this one script to get all the content, include the content for itself in its own content section, and use the rest for searching. If you have a lot of data, this would be a lot to load in one go when the user first arrives at your site.
The final option I can think of is to use the Google search API: http://code.google.com/apis/customsearch/v1/overview.html
Quite simply - no.
Client-side javascript runs in the client's browser. The client does not have any way to know about the contents of the documents within your domain. If you want to do a search, you'll need to do it server-side and then return the appropriate HTML to the client.
The only way to technically do this client-side would be to send the client all the data about all of the documents, and then get them to do the searching via some JS function. And that's ridiculously inefficient, such that there is no excuse for getting them to do so when it's easier, lighter-weight and more efficient to simply maintain a search database on the server (likely through some nicely-packaged third party library) and use that.
some useful resources
http://johnmc.co/llum/how-to-build-search-into-your-site-with-jquery-and-yahoo/
http://tutorialzine.com/2010/09/google-powered-site-search-ajax-jquery/
http://plugins.jquery.com/project/gss
If your site is allowing search engine indexing, then fcalderan's approach is definitely the simplest approach.
If not, it is possible to generate a text file that serves as an index of the HTML files. This would probably be rudimentarily successful, but it is possible. You could use something like the keywording in Toby Segaran's book to build a JSON text file. Then, use jQuery to load up the text file and find the instances of the keywords, unique the resultant filenames and display the results.