Removing element from wordreference API in PHP - php

How can I remove the elements I desire (for example, the search box) from by instructions in PHP?
Link : http://api.wordreference.com/38159/enes/hello
NOTE: in my site I first request what you see in the link above and paste it in a div, if this info is of any use

For searching a HTML-tree for certain elements, which is what you seem to DOMXPath for searching the tree of an XML-tree.
Have a look at: http://www.php.net/manual/en/domxpath.query.php
Remember however that searching an XML tree is not the cheapest of operations, so use it with caution.
Writing XPath queries can be tricky - but Firebug can be used to easily form them ( in firebug, go to some element of desire and hover the "CSS path" firebug is showing - the tooltip contains the XPath for the same...)
Well, asuming that you for example want to get the content, try:
$word = "hello";
$doc = new DOMDocument;
// We don't want to bother with white spaces.
$doc->preserveWhiteSpace = false;
// preg_replace is used for security, all except a-z and 0-9, and underscore will be replaced with nothing.
$url = "http://api.wordreference.com/38159/enes/" . preg_replace( "/[^\w ]/", "", $word );
$doc -> Load( $url );
$xpath = new DOMXPath($doc);
// We start from the root element.
$entries = $xpath -> query( "//html/body/div/div[3]" );
// do some stuff with the elements found...

Related

preg_match - find forms that contain word A and not B

I'm using PHP preg_match to find and parse specific forms on page. Until now my code worked well but I found that some unneeded forms now contain the word that I actually need, so these unneeded forms are parsed too. Put simply, I need to parse all forms that contain word "findme" and NOT contain word "ignoreme" in URL.
Here's my preg_match:
<form action=("|\'?)([0-9a-zA-Z:\/\._~\-\?=]+)findme(.*?)\/form>
Unfortunately, if form URL is like /some_url/ignoreme/findme/whatever, the code still parses it, which I don't want. How should I modify the code?
$dom = new DOMDocument;
$dom->loadHTML($yourHTML);
$xpath = new DOMXPath($dom);
$formNodeList = $xpath->query('//form[contains(#action, "findme") and not(contains(#action, "ignoreme"))]');
foreach($formNodeList as $formNode) {
// do what you want with the DOMNode (see php manual)
}

php regex selecting url from html source

I'm new to stackoverflow and from South Korea.
I'm having difficulties with regex with php.
I want to select all the urls from user submitted html source.
The restrictions I want to make are following.
Select urls EXCEPT
urls are within tags
for example if the html source is like below,
http://aaa.com
Neither of http://aaa.com should be selected.
urls right after " or =
Here is my current regex stage.
/(?<![\"=])https?\:\/\/[^\"\s<>]+/i
but with this regex, I can't achieve the first rule.
I tried to add negative lookahead at the end of my current regex like
/(?<![\"=])https?\:\/\/[^<>\"\s]+(?!<\/a>)/i
It still chooses the second url in the a tag like below.
http://aaa.co
We don't have developers Q&A community like Stackoverflow in Korea, so I really hope someone can help this simplely looking regex issue!
Seriously consider using PHP's DOMDocument class. It does reliable HTML parsing. Doing this with regular expressions is error prone, more work, and slower.
The DOM works just like in the browser and you can use getElementsByTagName to get all links.
I got your use case working with this code using the DOM (try it here: http://3v4l.org/5IFof):
<?php
$html = <<<HTML
http://aaa.com
HTML;
$dom = new DOMDocument();
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $link) {
var_dump($link->getAttribute('href'));
// Output: http://aaa.com
}
Don't use Regex. Use DOM
$html = 'http://aaa.com';
$dom = new DOMDocument;
$dom->loadHTML($html);
foreach ($dom->getElementsByTagName('a') as $a) {
if($a->hasAttribute('href')){
echo $a->getAttribute('href');
}
//$a->nodeValue; // If you want the text in <a> tag
}
Seeing as you're not trying to extract urls that are the href attribute of an a node, you'll want to start by getting the actual text content of the dom. This can be easily done like so:
$dom = new DOMDocument;
$dom->loadHTML($htmlString);
$root = $dom->getElementsByTagName('body')[0];//get outer tag, in case of a full dom, this is body
$text = $root->textContent;//no tags, no attributes, no nothing.
An alternative approach would be this:
$text = strip_tags($htmlString);//gets rid of makrup.

RegExp: Finding all links on page w/ nofollow

I'm trying to write a RegEx which finds all links on a webpage with the rel="nofollow" attribute. Mind you, I'm a RegEx newb so please don't be to harsh on me :)
This is what I got so far:
$link = "/<a href=\"([^\"]*)\" rel=\"nofollow\">(.*)<\/a>/iU";
Obviously this is very flawed. Any link with any other attribute or styled a little differently (single quotes) won't be matched.
You should really use DOM parser for this purpose as any regex based solution will be error prone for this kind of HTML parsing. Consider code like this:
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your html
$xpath = new DOMXPath($doc);
// returns a list of all links with rel=nofollow
$nlist = $xpath->query("//a[#rel='nofollow']");
Try this:
$link = "/<(a)[^>]*rel\s*=\s*(['\"])nofollow\\2[^>]*>(.*?)<\/\\1>/i";

how would I screen scrape a page like this using file_get_contents and preg_match?

I have a page with many HTML lines like this:
<ul><li><a href='a_silly_link_that_changes_each_line.php'>the_content_i_need</a></li></ul>
Now as you can see, theres a link in that line, which unfortunately changes on each line.
So I need a way to scrape the content in that line, without letting the link get in the way.
I've also tried to scrape like this: .php'>(*.)</a></li></ul> but thats no good, as it returns allot of unwanted content.
Also, because there are many lines on the page that i need to take the content from, could i just loop through, somehow?
I'm using preg_match and file_get_contents but am open to other suggestions. :)
From: PHP Parse HTML code
Use something like:
$str = '<ul><li><a src="test.html">linky</a></li></ul>';
$DOM = new DOMDocument;
$DOM->loadHTML($str);
$items = $DOM->getElementsByTagName('ul');
for($i =0;$i<$items->length;$i++){
$ul = $items->item($i);
$li=$ul->firstChild;
if($li->nodeName=='li' && $li->firstChild->nodeName=='a'){
//do something with $li->firstChild->nodeValue
}
}
In this case, $li->firstChild->nodeValue will be linky.
That should do it :)
Try using
$match = array();
preg_match_all( '~\\.php>(.*?)</a></li></ul>~', file_get_contents( $filename), $matches, PREG_SET_ORDER)`.
This will match all links inside your file. *? means "match 0-inf characters but as little characters as possible" (greedy killer) so you won't be getting any unvanted content.

Remove urls using PHP

I'd like to only remove the anchor tags and the actual urls.
For instance, test www.example.com would become test.
Thanks.
I often use:
$string = preg_replace("/<a[^>]+>/i", "", $string);
And remember that strip_tags can remove all the tags from a string, except the ones specified in a "white list". That's not what you want, but I tell you also this for exhaustiveness.
EDIT: I found the original source where I got that regex. I want to cite the author, for fairness: http://bavotasan.com/tutorials/using-php-to-remove-an-html-tag-from-a-string/
you should consider using the PHP's DOM library for this job.
Regex is not the right tool for HTML parsing.
Here is an example:
// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();
// Load the html's contents into DOM
$xml->loadHTML($html);
$links = $xml->getElementsByTagName('a');
//Loop through each <a> tags and replace them by their text content
for ($i = $links->length - 1; $i >= 0; $i--) {
$linkNode = $links->item($i);
$lnkText = $linkNode->textContent;
$newTxtNode = $xml->createTextNode($lnkText);
$linkNode->parentNode->replaceChild($newTxtNode, $linkNode);
}
Note:
It's important to use a regressive loop here, because when calling replaceChild, if the old node has a different name from the new node, it will be removed from the list once it has been replaced, and some of the links would not be replaced.
This code doesn't remove urls from the text inside a node, you can use the preg_replace from nico on $lnkText before the createTextNode line. It's always better to isolate parts from html using DOM, and then use regular expressions on these text only parts.
To complement gd1's answer, this will get all the URLs:
// http(s)://
$txt = preg_replace('|https?://www\.[a-z\.0-9]+|i', '', $txt);
// only www.
$txt = preg_replace('|www\.[a-z\.0-9]+|i', '', $txt);

Categories