PHP - Extract data from string with regex - php

I need help to do this operation. I Have a string like this:
<!doctype html> <html> <head> <meta charset="utf-8"> <title>Formatting the report</title><meta http-equiv="refresh" content="5;url=/file/xslt/download/?fileName=somename.pdf"> </head>
I need to extract the fileName parameter. How to do this?
I thing that is possible with regex, but I do not know well this.
Thanks!

Try this..
This will capture the filename
The Pattern is given below
/fileName=(.+?)\"/
<?php
$subject = "<!doctype html> <html> <head> <meta charset="utf-8"> <title>Formatting the report</title><meta http-equiv="refresh" content="5;url=/file/xslt/download/?fileName=somename.pdf"> </head>";
$pattern = '/fileName=(.+)"/';
preg_match($pattern, $subject, $matches, PREG_OFFSET_CAPTURE, 2);
print_r($matches);
?>
$1->Contains the file name
demo

Try something along the lines of:
$str = '<!doctype html> <html> <head> <meta charset="utf-8"> <title>Formatting the report</title><meta http-equiv="refresh" content="5;url=/file/xslt/download/?fileName=somename.pdf"> </head>';
preg_match('#fileName=(.*)"#', $str, $matches);
print_r($matches);

php simple html dom is clean and good way for trace html and find html elements by selector's like Jquery selectors.

Related

How do i make Xpath 1.0 query case insensitive

In PHP, I'm currently making a xpath query but I need to make it case insensitive.
I'm using is XPath 1.0 which from my query means I've got to use some thing called a translate function but I'm unsure of how to do this.
Here is my query test PHP file :
$html = <<<'HTML'
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=utf-8">
<meta NAME="Description" content="Test Case">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<Link Rel="Canonical" href="http://www.testsite.com/" />
<Title>My Title</Title>
</head>
<Body>
Test Case
</Body>
</html>
HTML;
$domDoc = new DOMDocument();
$domDoc->loadHTML('<?xml encoding="utf-8" ?>' . $html);
// Canonical link
$xpath = new DOMXPath($domDoc);
$canonicalTags = $xpath->query('//link[#rel=\'canonical\']'); // Return nothing
//some use translate(WhatVariable?, 'ABCDEFGHIJKLMNOPQRSTUVWXYZÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞŸŽŠŒ', 'abcdefghijklmnopqrstuvwxyzàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿžšœ')
var_dump($canonicalTags);
Any help would be greatly appreciated. Thanks.
Basically, translate is used to convert dynamic value that you need to compare to be all lower-case (or all upper-case). In this case, you want to apply translate() to rel attribute value, and compare the result to lower-case literal "canonical" (formatted for readability) :
//link[
translate(#rel, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz') = 'canonical'
]

How to output Chinese in HTML file

I have a form and insert some chinese words in database and it's ok. Table charset is UTF8. Problem appears when I select this data and send it via mail as HTML attachment.
Then, Chinese doesn't display properly. How to fix charset before send data via mail? Should I use some headers and will it work?
My code looks like that:
//$attachedBodyContent is data from database that contains some chinese words
Mail::send(
"emails.applicationTemplate",
$data,
function($message) use ($data, $template, $subject, $attachedBodyContent) {
$message->to($data['email'], $data['name'])
->from($template['from'],$template['from_name'])
->subject($subject)
->attachData($attachedBodyContent,'YourApplicationData.html');
}
);
When you generate .html attach file you should include in your <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
In this case you can use this code for merge your content with <head>
<?php
$header = '<!DOCTYPE html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>';
$footer = '</body>
</html>';
$allContent = $header.$attachedBodyContent.$footer;
?>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
This should do it, for further information check the link.
http://www.inventpartners.com/chinese-chars

Replace all <link> tags containing given href attribute with Regex or DOM

I'm struggling with this. The idea is to replace all <link> tags, containing specific href attribute inside given string (which comes from a buffer and it is regular HTML, but malformed sometimes).
I've tried to use the PHP DOM approach, also the SimpleHTMLDOM parser library, so far nothing works for me (the problem is that DOM approach returns only links inside <body> element, but not those in <head> section of the page), so I decided to use regex.
Here is the non-working PHP DOM approach code:
function remove_css_links($string = "", $css_files = array()) {
$css_files = array("http://www.example.com/css/css.css?ver=2.70","style.css?ver=3.8.1");
$xml = new DOMDocument();
$xml->loadHTML($string);
$link_list = $xml->getElementsByTagName('link');
$link_list_length = $link_list->length;
//The cycle
for ($i = 0; $i < $link_list_length; $i++) {
$attributes = $link_list->item($i)->attributes;
$href = $attributes->getNamedItem('href');
if (in_array($href->value, $css_files)) {
//Remove the HTML node
}
}
$string = $xml->saveHTML();
return $string;
}
Here is the regex code, however I know that all of you do not recommend to use it for parsing of HTML, but let's not discuss this here and now:
$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website » Feed" href="/feed/" />
<link rel=\'stylesheet\' href=\'http://www.example.com/css/css.css?ver=2.70\' type=\'text/css\' media=\'all\' /></head>
<body>...some content...
<link rel=\'stylesheet\' id=\'css\' href=\'style.css?ver=3.8.1\' type=\'text/css\' media=\'all\' />
</body></html>
';
$url = preg_quote("http://www.example.com/css/css.css?ver=2.70");
$pattern = "~<link([^>]+) href=".$url."/?>~";
$link = preg_replace($pattern, "", $html_text);
The problem with the regex is that the href attribute can be at any place inside <link> tag and this one, which I use, can detect any type of <link> tags, as you can see I do not want to remove the shortcut icon or alternate types of them, as well as anything different than given URL as href attribute. You can notice that the <link> tags contains different type of quotes, single and/or double.
However, I'm open to suggestions and if it is possible to make the DOM approach work, rather than use regex - it's OK.
OK, so here you are :
<?php
$html_text = '
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="shortcut icon" href="http://www.example.com/favicon.ico" />
<link rel="alternate" type="application/rss+xml" title="Website » Feed" href="/feed/" />
<link rel="stylesheet" href="http://www.example.com/css/css.css?ver=2.70" type="text/css" media="all" /></head>
<body>...some content...
<link rel="stylesheet" id="css" href="style.css?ver=3.8.1" type="text/css" media="all" />
</body></html>
';
$d = new DOMDocument();
#$d->loadHTML($html_text);
$xpath = new DOMXPath($d);
$result = $xpath->query("//link");
foreach ($result as $link)
{
$href = $link->getattribute("href");
if ($href=="whatyouwanttofilter")
{
$link->parentNode->removeChild($link);
}
}
$output= $d->saveHTML();
echo $output;
?>
Tested and working. Have fun! :-)
The general idea is :
Load your HTML into a DOMDocument
Look for link nodes, using XPath
Loop through the nodes
Depending on the node's href attribute, delete the node (actually, remove the child from its... parent - well, yep, that's the php way... lol)
After doing all the cleaning-up, re-save the HTML and get it back into a string

string's result is different after load in domdocument

I want to have same result after load in domdocument. how to do it?
echo "Café";
$s = <<<HTML
<html>
<head>
</head>
<body>
Café
</body>
</html>
HTML;
$d = new domdocument;
$d->loadHTML($s);
echo $d->textContent;
first echo's result is = Café
second echo's result is =Café
You need to mark your HTML as UTF-8 encoded
$s = <<<HTML
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
Café
</body>
</html>
HTML;
$d = new domdocument;
$d->loadHTML($s);
echo $d->textContent;
your problem is Encoding,
for the First Echo, you echo the text with your default encoding,
but for the text randered through the DOMDocument,
the e+apostroph is split into two chars,
i dont know how to enforce the right encoding to DOMDoc...
but i am sure this is your problem
hope i helped,
best of luck.
With First echo before HTML you send HEADERS with your server default encoding. This ignores any next set encodings..
You must first echo
<Html tag and encodings etc..
and than echo any other values..

Converting russian characters from upper case to lower case in php

I'm trying to change the case of russian characters from upper to lower.
function toLower($string) {
echo strtr($string,'ЁЙЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЯЧСМИТЬБЮ','ёйцукенгшщзхъфывапролджэячсмитьбю');
};
This is the function I used and the output looks something like this
ЁЙ## ёѹ##`
Can anybody help me with this ?
Thanks in advance
$result = mb_strtolower($orig, 'UTF-8');
(assuming the data is in utf-8)
Specify the charset within the HTML and use mb_strtolower() to convert case:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 TRANSITIONAL//EN">
<html>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<head>
<title></title>
</head>
<body>
<?
$string = 'ЦУКЕНГШЩЗХЪФЫВАПРОЛДЖЭЯЧСМИТЬБЮ' ;
echo mb_strtolower($string, 'UTF-8');
?>
</body>
</html>
With the meta-tag it looks like this:
цукенгшщзхъфывапролджэячсмитьбю
Without the meta-tag it looks like this
цукенгшщзхъфывапролджÑÑчÑмитьбю

Categories