Use a regex to get text from html source code

Use a regex to get text from html source code - php

I have got a php code that stores html source code of a site in a variable and I want to get two links from that source code only.
First link is in meta tag key content:
<meta property="og:image" content="http://img.xxx.xx/vid/xxx/b7950d611f934f0eef95c1cd010348e3.jpg"/>
And second
jw.load([{ file: 'http://vrbx105.xxx.xx/U7yvQnLiA_m5mhE9MUHf3w/1477628604/vl107aeb2d7db53f91fc6ad2e76fe11e49.mp4', provider: 'http' }]);
I need to get only those two links, they change every time a page is reloaded:
http://img.xxx.xx/vid/xxx/b7950d611f934f0eef95c1cd010348e3.jpg
http://vrbx105.xxx.xx/U7yvQnLiA_m5mhE9MUHf3w/1477628604/vl107aeb2d7db53f91fc6ad2e76fe11e49.mp4

If you insist in regex, here's one for the first link: https://regex101.com/r/CHpfDY/1
And here's the second: https://regex101.com/r/VVF0Gf/1

Unless you have a PHP JavaScript parser handy, you can at least get rid of the regular expression for the HTML search. Something like this should work, though it's hard to test without the URL...
<?php
$dom=new DomDocument();
$dom->loadHTMLFile("http://example.com/example.html");
$xpath = new DomXpath($dom);
$metanode = $xpath->query("//meta[#property='og:image']/#content");
if ($metanode->length) {
$url1 = $metanode[0]->value;
}
$scriptnode = $xpath->query("//script");
foreach ($scriptnode as $script) {
$array = explode("\n", $script->nodeValue);
foreach ($array as $line) {
if (preg_match("/jw.load... file: '(.*?)'/", $line, $matches)) {
$url2 = $matches[1];
break(2);
}
}
}
echo $url1;
echo $url2;

Related

PHP file_get_contents not showing url link

I'm having an issue with php file_get_content(), I have a txt file with links where I created a foreach loop that display multiple links in the same webpage but it's not working, please take a look at the code:
<?php
$urls = file("links.txt");
foreach($urls as $url) {
file_get_contents($url);
echo $url;
}
The content of links.txt is: https://www.google.com
Result: Only a String displaying "https://www.google.com"
Another code that works is :
$url1 = file_get_contents('https://google.com');
echo $url1;
This code returns google's homepage, but I need to use first method with loops to provide multiple links.
Any idea?

Here's one way of combining the things you already had implemented:
$urls = file("links.txt");
foreach($urls as $url) {
$contents = file_get_contents($url);
echo $contents;
}
Both file and file_get_contents are functions that return some value; what you had to do is putting return value of the latter one inside a variable, then outputting that variable with echo.
In fact, you didn't even need to use variable: this...
$urls = file("links.txt");
foreach($urls as $url) {
echo file_get_contents($url);
}
... should have been sufficient too.

PHP echo content from <tags>

I've searched around and around and I'm not sure how this really works.
I have the tags
<taghere>content</taghere>
and i want to pull the "content" so i can put an ifstatement depending on what the "content" is as the "content" is varrying depending on the page
i.e
<taghere>HelloWorld</taghere>
$content = //function that returns the text between <taghere> and </taghere>
if($content == "HelloWorld")
{
//execute function;
}
else if($content =="Bonjour")
{
//execute seperate function
}
i tried using preg but it doesnt seem to work and just returns whatever value is in the lines field instead of actually giving me the information within the tags

If I understand your question correctly, you want the data INSIDE the tag "taghere".
If you are parsing HTML, you should use DOMDocument
Try something similar to this:
<?php
// Assuming your content (the html where those tags are found) is available as $html
$doc = new DOMDocument();
libxml_use_internal_errors(true);
$doc->loadHTML($html); // loads your HTML
libxml_clear_errors();
// Note: Tag names are case sensitive
$text = $dom->getElementsByTagName('taghere');
// Echo the content
echo $text

you can use DomDocument and loadXML to do this
<?php
function doAction($word=""){
$html="<taghere>$word</taghere>";
$doc = new DOMDocument();
$doc->loadXML($html);
//discard white space
$hTwo= $doc->getElementsByTagName('taghere'); // here u use your desired tag
if($hTwo->item(0)->nodeValue== "HelloWorld")
{
echo "1";
}
else if($hTwo->item(0)->nodeValue== "Bonjour")
{
echo "2";
//execute seperate function
}
}
doAction($word="Bonjour");

You cannot do it like that. Technically it is possible but it's more than an overkill. And you mixed up PHP with HTML in a way that doesn't work.
To achieve the thing that you want you have to do something like this:
$content = 'something';
if ($comtent === 'something') {
//do something
}
if ($content === 'something else') {
//do something else
}
echo '<tag>'. $content . '</tag>' ;
Of course you can change $content in the ifs.

Dont forget, you can allways add an ID into a tag so you can reference it with java script.
<tag id='tagid'>blah blah blah </tag>
<script>
document.getElementById(tagid)
</script>
This might be a much simpler way to get what you are thinking about then some of the other responses

I don't know what regex you tried and therefor not what would have been wrong. Might have been the escaping of the <
<?php
if(preg_match('#\<taghere>(.*)\</taghere>#', $document, $a)){
$content = $a[1];
}
?>
I suppose there will be only one

How to return the link from background url with simple dom html?

I am trying to get the link of a background
<div class="mine" style="background: url('http://www.something.com/something.jpg')"></div>
I am using find('div.mine')
$link = find('div.mine');
$link returns the html code containing all the
How do I parse so it returns only the link?

That syntax isn't quite correct. You're doing $link = find('div.mine'); but that should be $link = $yourHTML->find('div.mine'); instead.
Get all the divs with the class name mine first, loop through them, and get the style attributes. Now you'll have a string like:
background: url('http://www.something.com/something.jpg')
You could then use a CSS Parser (recommended way), or a regular expression to grab just the URL part from that string.
if(preg_match('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $link, $matches)) {
$image_url = $matches[0];
}
Full code:
$html = file_get_html('file.html');
$divs = $html->find('div.mine');
foreach ($divs as $div) {
$link = $div->style;
}
if(preg_match('#\bhttps?://[^\s()<>]+(?:\([\w\d]+\)|([^[:punct:]\s]|/))#', $link, $matches)) {
$image_url = $matches[0];
}
echo $image_url;
Output:
http://www.something.com/something.jpg
The URL matching regex pattern is from Wordpress' make_clickable function in wp-includes/formatting.php. See this post for the complete implementation.

try with substr() function to extract the text

change variable with GET method

I have a page test.php in which I have a list of names:
name1: 992345
name2: 332345
name3: 558645
name4: 434544
In another page test1.php?id=name2 and the result should be:
332345
I've tried this PHP code:
<?php
libxml_use_internal_errors(true);
$doc = new DOMDocument();
$doc->loadHTMLFile("/test.php");
$xpath = new DOMXpath($doc);
$elements = $xpath->query("//*#".$_GET["id"]."");
if (!is_null($elements)) {
foreach ($elements as $element) {
$nodes = $element->childNodes;
foreach ($nodes as $node) {
echo $node->nodeValue. "\n";
}
}
}
?>
I need to be able to change the name with GET PHP method in test1.pdp?id=name4
The result should be different now.
434544
is there another way, becose mine won't work?

Here is another way to do it.
<?php
libxml_use_internal_errors(true);
/* file function reads your text file into an array. */
$doc = file("test.php");
$id = $_GET["id"];
/* Show your array. You can remove this part after you
* are sure your text file is read correct.*/
echo "Seeking id: $id<br>";
echo "Elements:<pre>";
print_r($doc);
echo "</pre>";
/* this part is searching for the get variable. */
if (!is_null($doc)) {
foreach ($doc as $line) {
if(strpos($line,$id) !== false){
$search = $id.": ";
$replace = '';
echo str_replace($search, $replace, $line);
}
}
} else {
echo "No elements.";
}
?>

There is a completely different way to do this, using PHP combined with JavaScript (not sure if that's what you're after and if it can work with your app, but I'm going to write it). You can change your test.php to read the GET parameter (it can be POST as well, you'll see), and according to that, output only the desired value, probably from the associative array you have hard-coded in there. The JavaScript approach will be different and it would involve making a single AJAX call instead of DOM traversing using PHP.
So, in short: AJAX call to test.php, which then output the desired value based on the GET or POST parameter.
jQuery AJAX here; native JS tutorial here.
Just let me know if this won't work for your app, and I'll delete my answer.

Parse Website for URLs

Just wondering if someone can help me further with the following. I want to parse the URL on this website:http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr
I have the following code:
<?PHP
$url = "http://www.directorycritic.com/free-directory-list.html?pg=1&sort=pr";
$input = #file_get_contents($url) or die("Could not access file: $url");
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
if(preg_match_all("/$regexp/siU", $input, $matches)) {
// $matches[2] = array of link addresses
// $matches[3] = array of link text - including HTML code
}
?>
Which does nothing at present and what I need this to do is scrap all the URL in the table for all 16 pages and would really appreciate some help with how to amend the above to do that and output URL into a text file.

Use HTML Dom Parser
$html = file_get_html('http://www.example.com/');
// Find all links
$links = array();
foreach($html->find('a') as $element)
$links[] = $element->href;
Now links array contains all URLs of given page and you can use these URLs to parse further.
Parsing HTML with regular expressions is not a good idea. Here are some related posts:
Using regular expressions to parse HTML: why not?
RegEx match open tags except XHTML self-contained tags
EDIT:
Some Other HTML Parsing tools as described by Gordon in comments below:
phpQuery
Zend_Dom
QueryPath
FluentDom

You really shouldn’t use regular expressions to parse HTML as it’s to error prone.
Better use an HTML parser like the one of PHP’s DOM library:
$code = file_get_contents($url);
$doc = new DOMDocument();
$doc->loadHTML($code);
$links = array();
foreach ($doc->getElementsByTagName('a') as $element) {
if ($element->hasAttribute('href')) {
$links[] = $elements->getAttribute('href');
}
}
Note that this will collect the URI references as they appear in the document and not as an absolute URI. You might want to resolve them before.
It seems that PHP doesn’t provide an appropriate library (or I haven’t found it yet). But see RFC 3986 – Reference Resolution and my answer on Convert a relative URL to an absolute URL with Simple HTML DOM? for further details.

Try this method
function getinboundLinks($domain_name) {
ini_set('user_agent', 'NameOfAgent (<a class="linkclass" href="http://localhost">http://localhost</a>)');
$url = $domain_name;
$url_without_www=str_replace('http://','',$url);
$url_without_www=str_replace('www.','',$url_without_www);
$url_without_www= str_replace(strstr($url_without_www,'/'),'',$url_without_www);
$url_without_www=trim($url_without_www);
$input = #file_get_contents($url) or die('Could not access file: $url');
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
//$inbound=0;
$outbound=0;
$nonfollow=0;
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) {
foreach($matches as $match) {
# $match[2] = link address
# $match[3] = link text
//echo $match[3].'<br>';
if(!empty($match[2]) && !empty($match[3])) {
if(strstr(strtolower($match[2]),'URL:') || strstr(strtolower($match[2]),'url:') ) {
$nonfollow +=1;
} else if (strstr(strtolower($match[2]),$url_without_www) || !strstr(strtolower($match[2]),'http://')) {
$inbound += 1;
echo '<br>inbound '. $match[2];
}
else if (!strstr(strtolower($match[2]),$url_without_www) && strstr(strtolower($match[2]),'http://')) {
echo '<br>outbound '. $match[2];
$outbound += 1;
}
}
}
}
$links['inbound']=$inbound;
$links['outbound']=$outbound;
$links['nonfollow']=$nonfollow;
return $links;
}
// ************************Usage********************************
$Domain='<a class="linkclass" href="http://zachbrowne.com">http://zachbrowne.com</a>';
$links=getinboundLinks($Domain);
echo '<br>Number of inbound Links '.$links['inbound'];
echo '<br>Number of outbound Links '.$links['outbound'];
echo '<br>Number of Nonfollow Links '.$links['nonfollow'];

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Use a regex to get text from html source code - php

If you insist in regex, here's one for the first link: https://regex101.com/r/CHpfDY/1 And here's the second: https://regex101.com/r/VVF0Gf/1

Related

PHP file_get_contents not showing url link

PHP echo content from <tags>

How to return the link from background url with simple dom html?

change variable with GET method

Parse Website for URLs

Categories

Resources