Scraping HN Front Page - Handeling Simple HTML Dom Error

Scraping HN Front Page - Handeling Simple HTML Dom Error - php

I'm using 'Simple HTML Dom' to scrape the HN Front Page (news.ycombinator.com), which works great most of the time.
However, every now and then they promote a job/company that lacks the elements that the scraper is looking for, i.e. score, username and number of comments.
This of course, breaks the array and thus the output of my script:
<?php
// 2012-02-12 Maximilian (Extract news.ycombinator.com's Front Page)
// Set the header during development
//header ("content-type: text/xml");
// Call the external PHP Simple HTML DOM Parser (http://simplehtmldom.sourceforge.net/manual.htm)
include('lib/simple_html_dom.php');
date_default_timezone_set('Europe/Berlin');
// Download 'news.ycombinator.com' content
//$tmp = file_get_contents('http://news.ycombinator.com');
//file_put_contents('get.tmp', $tmp);
// Retrieve the content
$html = file_get_html('tc.tmp');
// Set the extraction pattern for each item
$title = $html->find("tr td table tr td.title a");
$score = $html->find("tr td.subtext span");
$user = $html->find("tr td.subtext a[href^=user]");
$link = $html->find("tr td table tr td.title a");
$time = $html->find("tr td.subtext");
$additionals = $html->find("tr td.subtext a[href^=item?id]");
// Construct the feed by looping through the items
for($i=0;$i<29;$i++) {
$cr=1;
// Check if the item points to an external website
if (!strstr($link[$i]->href,'http')) {
$url = 'http://news.ycombinator.com/'.$link[$i]->href;
$description = "Join the discussion on Hacker News.";
} else {
$url = $link[$i]->href;
// Getting content here
if (empty($abstract)) {
$description ="Failed to load any relevant content. Please try again later.";
} else {
$description = $abstract;
}
}
// Put all the items together
$result .= '<item><id>f'.$i.'</id><title>'.htmlspecialchars(trim($title[$i]->plaintext)).'</title><description><![CDATA['.$description.']]></description><pubDate>'.str_replace(' | '.$additionals[$i]->plaintext,'',str_replace($score[$i]->plaintext.' by '.$user[$i]->plaintext.' ','',$time[$i]->plaintext)).'</pubDate><score>'.$score[$i]->plaintext.'</score><user>'.$user[$i]->plaintext.'</user><comments>'.$additionals[$i]->plaintext.'</comments><id>'.substr($additionals[$i]->href,8).'</id><discussion>http://news.ycombinator.com/'.$additionals[$i]->href.'</discussion><link>'.htmlspecialchars($url).'</link></item>';
}
$output = '<rss><channel><id>news.ycombinator.com Frontpage</id><buildDate>'.date('Y-m-d H:i:s').'</buildDate>'.$result.'</channel></rss>';
file_put_contents('tc.xml', $output);
?>
Here's an example of the correct output
<item>
<id>f0</id>
<title>Show HN: Bootswatch, free swatches for your Bootstrap site</title>
<description><![CDATA[Easy to Install Simply download the CSS file from the swatch of your choice and replace the one in Bootstrap. No messing around with hex values. Whole New Feel We've all been there with the black bar and blue buttons. See how a splash of color and typography can transform the feel of your site. Modular Changes are contained in just two LESS files, enabling modification and ensuring forward compatibility.]]></description>
<pubDate>3 hours ago</pubDate>
<score>196 points</score>
<user>parkov</user>
<comments>30 comments</comments>
<id>3594540</id>
<discussion>http://news.ycombinator.com/item?id=3594540</discussion>
<link>http://bootswatch.com</link>
</item>
<item>
<id>f1</id>
<title>Louis CK inspires Jim Gaffigan to sell comedy special for $5 online</title>
<description><![CDATA[Dear Internet Friends,Inspired by the brilliant Louis CK, I have decided to debut my all-new hour stand-up special on my website, Jimgaffigan.com.Beginning sometime in April, “Jim Gaffigan: Mr. Universe” will be available exclusively for download for only $5. A dollar from each download will go directly to The Bob Woodruff Foundation; a charity dedicated to serving injured Veterans and their families.I am confident that the low price of my new comedy special and the fact that 20% of each $5 download will be donated to this very noble cause will prevent people from stealing it. Maybe I’m being naïve, but I trust you guys.]]></description>
<pubDate>57 minutes ago</pubDate>
<score>25 points</score>
<user>rkudeshi</user>
<comments>4 comments</comments>
<id>3595285</id>
<discussion>http://news.ycombinator.com/item?id=3595285</discussion>
<link>http://www.whosay.com/jimgaffigan/content/218011</link>
</item>
And here's an example of incorrect output. Note that the elements are not empty, thus I cannot seem to catch the error and simply jump to the next item. Everything past the promotion post will break:
<item>
<id>f14</id>
<title>Build the next Legos: We're hiring an iOS Developer & Web Developer (YC S11)</title>
<description><![CDATA[Interested in building the next generation of toys on digital devices such as the iPad? That’s what we’re doing here at Launchpad Toys with apps like Toontastic (Named one of the “Top 10 iPad Apps of 2011” by the New York Times and was recently added to the iTunes Hall of Fame) and an awesom]]><![CDATA[e suite of others we have under development. We’re looking for creative and playful coders that have made games or highly visual apps/sites in the past for our two open development positions. As a kid, you probably played with Legos endlessly and grew up to be a hacker because you still love building things. Sounds like you? Email us at howdy#launchpadtoys.com with a couple links to some projects and code that we can look at along with your resume.]]></description>
<pubDate>2 hours ago</pubDate>
<score>14 points</score>
<user>bproper</user>
<comments>7 comments</comments>
<id>3594944</id>
<discussion>http://news.ycombinator.com/item?id=3594944</discussion>
<link>http://launchpadtoys.com/blog/2012/02/iosdeveloper-webdeveloper/</link>
</item>
<item>
<id>f15</id>
<title>SOPA foe Fred Wilson supports a blacklist on pirate sites</title>
<description><![CDATA[VC Fred Wilson says Google, Bing, Facebook, and Twitter should warn people when they try to log in at known pirate sites: "We don't need legislation." Fred Wilson says: If they try to pass antipiracy legislation, it will once again be 'war.' (Credit: Greg Sandoval/CNET) Fred Wilson, a well-known ven]]><![CDATA[ture capitalist from New York, says he's in favor of creating a blacklist for Web sites found to traffic in pirated films, music, and other intellectual property. The co-founder of Union Square Ventures told a gathering of media executives at the Paley Center for Media yesterday that he believes a good antipiracy measure would be for Google, Twitter, Facebook, and other major sites to issue warnings to people when they try to connect with a known pirate site. Fred Wilson, a co-founder of Union Square Ventures, says 'Our children have been taught to steal.' (Credit: Union Square Ventures) Wilson favors establishing an independent group to create a "black and white list." "The blacklist are those sites we all know are bad news," he told the audience in New York.]]></description>
<pubDate>14 points by bproper 2 hours ago | 7 comments</pubDate>
<score>24 points</score>
<user>andrewcross</user>
<comments>12 comments</comments>
<id>3594558</id>
<discussion>http://news.ycombinator.com/item?id=3594558</discussion>
<link>http://news.cnet.com/8301-31001_3-57377862-261/post-sopa-influential-tech-investor-favors-blacklisting-pirate-sites/</link>
</item>
So here's my question: How can I handle a situation where a particular element is missing and find() doesn't throw an error? Do I have to start from scratch, or is there a better approach in scraping the HN front page?
For anyone curious, here's the whole XML file: http://thequeue.org/api/tc.xml

You have to work by chunks in order to handle that, there seems to be a dummy spacer element that can help you with that:
$news = preg_split('/<tr style="height:5px"><\/tr>/',$html->find('tbody',2)->innertext);
And then use subselectors:
foreach($news as $article){
$article = str_get_html($article)
// No upvote arrow found so its not a valid article
if(count($article->find('img')) === 0){
continue;
}
}
And for the other elements you use the same selectors

We'll thanks to Ivan's trail of thought, I am now splitting the initially scraped HTML into an array, each node representing a post. Then, going through every single post in a loop, I'll check if the up voting arrow image exists. If not, I'll not add it to the result. In the end everything will be stitched back together and the sponsored post is left out. Here's the code:
$array = explode('<tr style="height:5px"></tr>',$html);
foreach ($array as $post) {
if (!strstr($post,'grayarrow.gif')){}else{
$clean .= $post;
}
}
unset($array);
$html = str_get_html($clean.'</body></html>');

Related

Strip the contents of page via CURL in PHP [duplicate]

This question already has answers here:
How do you parse and process HTML/XML in PHP?
(31 answers)
Closed 7 years ago.
I want a load the contents of my website via CURL . However I dont want the etire website contents. I just want the body part of it .
The link to my website is : www.sanjosespartan.com/blog.
The code for CURL is :
[insert_php]
$ch = curl_init("http://www.sanjosespartan.com/blog/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HEADER, 0);
echo curl_exec($ch);
curl_close($ch);
[/insert_php]
THis loads all the CSS and other data in my URL.
But, I want to echo only this part :
"<div class="entry-content">
<p>Welcome to ‘Books for Geeks’!!</p>
<p>BooksForGeeks offers you over 10 million titles across categories such as Children’s Books, Business & Economics, Indian Writing and Literature & Fiction.</p>
<p>Reading books is the favourite pastime of many people. If you’re bitten by the book-bug too, then there is a massive collection of books for you to read. From bestsellers to new & future releases, the choices are exhaustive when you shop onlineat India’s Largest Bookstore.</p>
<p>From books for dummies, to textbooks for students, there are a wide variety of books. You can explore the young adults books store if you’re looking to gift a nice book to a teenager, where you can find books from the best-selling series.</p>
<p>Innumerable books are divided under various categories like action & adventure, business & economics, comics & mangas, crime, thriller & mystery, fiction, humour, and romance. You can browse by genre when you buy online making it more convenient for you to narrow down your choices. Then there are biographies and true accounts bestsellers as well. These books are available in different formats like hardcover, paperback, and board book.</p>
</div>"
How can I achieve this ? Strip HTML tags ?

You'll want to use a dom parser such as https://github.com/tj/php-selector to parse the document and select the contents of the <body> tag

Download simple_html_dom.php from here http://sourceforge.net/projects/simplehtmldom/files/simple_html_dom.php/download
and try this code
error_reporting(-1);
//include('simple_html_dom.php');
include($_SERVER['DOCUMENT_ROOT'].'/marketplace/simple_html_dom.php');
$html = file_get_html('http://www.sanjosespartan.com/blog/');
$src = new DOMDocument('1.0', 'utf-8');
$src->formatOutput = true;
$src->preserveWhiteSpace = false;
#$src->loadHTML($html);
$xpath = new DOMXPath($src);
foreach ($html->find('div[class=entry-content]') as $div) {
echo $div;
};

Specification of mark-up format included in facebook open graph text

When I am performing Open Graph requests, some of the responses that I am expecting to be text are having some kind of markup included. For example, when I am requesting the Name and Description of an album, in the description I get something like \u0040[12412421421:124:The Link]. (The \u0040 is actually the # sign.)
In this case it seems that what it is saying is that the 'The Link' should be a hyperlink to a facebook page with ID 12412421421. I presume there is similar kind of markup for hashtags and external URLs.
I am trying to find some official documentation or description for this, but I can't seem to find any documentation of this (I might be looking with the wrong keywords).
Is there any online documentation that describes this? And better still is there an PHP library or function already available somewhere that converts this text into its HTML equivalent?
I am using this Facebook PHP SDK, but it doesn't seem to offer any such function. (Not sure if there is anything in the new version 4.0 one but I can't use it anyway for now because it requres PHP 5.4+ and my host currently is still on 5.3.).

It's true that the PHP SDK doesn't provide anything to deal with these links and the documentation doesn't document that either. However the API gives all the information you need in the description field itself, so here is what you could do:
$description = "Live concert with #[66961492640:274:Moonbootica] "
. "in #[106078429431815:274:London, United Kingdom]! #music #house";
function get_html_description($description) {
return
// 1. Handle tags (pages, people, etc.)
preg_replace_callback("/#\[([0-9]*):([0-9]*):(.*?)\]/", function($match) {
return ''.$match[3].'';
},
// 2. Handle hashtags
preg_replace_callback("/#(\w+)/", function($match) {
return ''.$match[0].'';
},
// 3. Handle breaklines
str_replace("\n", "<br />", $description)));
}
// Display HTML
echo get_html_description($description);
While 2. and 3. handle hashtags and breaklines, the part 1. of the code basically splits up the tag #[ID:TYPE:NAME] into 3 groups of information (id, type, name) before generating HTML links from the page IDs and names:
Live concert with Moonbootica in London, United Kingdom! #music #house
Live concert with Moonbootica in London, United Kingdom!
#music #house
FYI and even if it's not much useful, here are the meanings of the types:
an app (128),
a page (274),
a user (2048).

the # describes a tag to someone,facebook id doesnt make difference between a fanpage or a single person so you gotta deal with php only.and the # should be the only char that describes a person/page tagged

The markup is used to reference a fanpage.
Example:
"description": "Event organised by #[303925999750490:274:World Next Top Model MALTA]\nPhotography by #[445645795469650:274:Pixbymax Photography]"
The 303925999750490 is the fanpage ID. The World Next Top Model MALTA is the name of fanpage. (Don't know what the 274 means)
When you render this on your page, you can render like this:
Event organised by World Next Top Model MALTA
Photography by Pixbymax Photography

Scraping HTML using php

I am trying to scrape the following text from the given text
Scrape:
Promise Me This (Between Breaths, #4)
The image with the src as http://d.gr-assets.com/books/1402555544l/22077246.jpg
A new love will test the boundaries of passion between a privileged boy next door and the tattooed, blue-haired girl who helps him embrace his wild side...\n\n\nNate has developed quite a playboy reputation around campus. It\'s not that he doesn\'t respect or trust women; he doesn\'t trust himself. The men in Nate’s family are prone to abusive behavior—a dirty secret that Nate’s been running from his entire life—so Nate doesn\'t do relationships. But he can’t help himself around one girl…\n\nJessie is strong, independent, and works at a tattoo parlor. Nate can’t resist getting close to her, even if it’s strictly a friendship. But it doesn\'t take long for Nate to admit that what he wants with Jessie is more than just friendly.\n\nWith Jessie, he can be himself and explore what he’s always felt was a terrifying darkness inside him. Even when Nate begins to crave her in a way that both shocks and horrifies him, Jessie still wants to know every part of him. Testing their boundaries together will take a trust that could render them inseparable… or tear them apart
HTML:
<div class="leftAlignedImage bookBox">
<div class="coverWrapper" id="bookCover646987_22077246">
<img alt="Promise Me This (Between Breaths, #4)" class="bookImage" src="https://i.stack.imgur.com/NXMoh.jpg" title="" width="115" />
</div>
<script type="text/javascript">
//<![CDATA[
var newTip = new Tip($('bookCover646987_22077246'), "\n\n <h2><a href=\"http://www.goodreads.com/book/show/22077246-promise-me-this?from_choice=false&from_home_module=false\" class=\"readable\">Promise Me This (Between Breaths, #4)<\/a><\/h2>\n\n <div>\n by <a href=\"/author/show/7060187.Christina_Lee\" class=\"authorName\">Christina Lee<\/a><span title=\"Goodreads Author!\">*<\/span>\n <\/div>\n <div class=\"smallText uitext darkGreyText\">\n <span class=\"minirating\"><span class=\"stars staticStars\"><a class=\"staticStar p10\" size=\"12x12\" title=\"4.13 of 5 stars\">4.13 of 5 stars<\/a><a class=\"staticStar p10\" size=\"12x12\" title=\"4.13 of 5 stars\"><\/a><a class=\"staticStar p10\" size=\"12x12\" title=\"4.13 of 5 stars\"><\/a><a class=\"staticStar p10\" size=\"12x12\" title=\"4.13 of 5 stars\"><\/a><a class=\"staticStar p3\" size=\"12x12\" title=\"4.13 of 5 stars\"><\/a><\/span> 4.13 avg rating — 388 ratings<\/span>\n — published 2014\n <\/div>\n\n <div class=\"addBookTipDescription\">\n \n<span id=\"freeTextContainer3494377565927542800\" class=\"elementOne\">\n A new love will test the boundaries of passion between a privileged boy next door and the tattooed, blue-haired girl who helps him embrace his wild side...\n\n\nNate has developed quite a playboy reputation around campus. It\'s not that he doesn\'t respect or trust women; he doesn\'t trust himself. The men<\/span>\n <span id=\"freeText3494377565927542800\" class=\"elementTwo\" style=\"display:none\">\n A new love will test the boundaries of passion between a privileged boy next door and the tattooed, blue-haired girl who helps him embrace his wild side...\n\n\nNate has developed quite a playboy reputation around campus. It\'s not that he doesn\'t respect or trust women; he doesn\'t trust himself. The men in Nate’s family are prone to abusive behavior—a dirty secret that Nate’s been running from his entire life—so Nate doesn\'t do relationships. But he can’t help himself around one girl…\n\nJessie is strong, independent, and works at a tattoo parlor. Nate can’t resist getting close to her, even if it’s strictly a friendship. But it doesn\'t take long for Nate to admit that what he wants with Jessie is more than just friendly.\n\nWith Jessie, he can be himself and explore what he’s always felt was a terrifying darkness inside him. Even when Nate begins to crave her in a way that both shocks and horrifies him, Jessie still wants to know every part of him. Testing their boundaries together will take a trust that could render them inseparable… or tear them apart.<\/span>\n <a data-text-id=\"3494377565927542800\" href=\"#\" onclick=\"swapContent($(this));; return false;\">...more<\/a>\n <\/div>\n\n\n\n", { style: 'addbook', stem: 'leftMiddle', hook: { tip: 'leftMiddle', target: 'rightMiddle' }, hideOn: false, width: 400, hideAfter: 0.05, delay: 0.35 });
$('bookCover646987_22077246').observe('prototip:shown', function() {
if (this.up('#box')) {
$$('div.prototip')[0].setStyle({zIndex: $('box').getStyle('z-index')});
} else {
$$('div.prototip')[0].setStyle({zIndex: 6000});
}
});
newTip['wrapper'].addClassName('prototipAllowOverflow');
$('bookCover646987_22077246').observe('prototip:shown', function () {
$$('div.prototip').each(function (e) {
if ($('bookCover646987_22077246').hasClassName('ignored')) {
e.setStyle({'display': 'none'});
return;
}
e.setStyle({'overflow': 'visible'});
});
});
$('bookCover646987_22077246').observe('prototip:hidden', function () {
$$('span.elementTwo').each(function (e) {
if (e.getStyle('display') !== 'none') {
var lessLink = e.next();
swapContent(lessLink);
}
});
});
//]]>
</script>
</div>
I am new with php and Xampp and have already surfed internet for help but was of no use.
I have connected Apache from Xampp control pannel, have made a save.php page wherein I wrote the following:
<?php
$html = file_get_contents('http://www.goodreads.com/genres/new_releases/art');
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXpath( $doc);
$node = $xpath->query( '//div[#name="coverWrapper"]')->item( 0);
echo $node->textContent;
?>
This gives me an error on 11th Line
Error: Trying to get property of non-object in C:\xampp\htdocs\xampp\ind\save.php on line 11

For something as simple as this, I'd save myself the headache and skip xpath... You're already reading the HTML into a text string, it'll probably be easier to process $html as a string. For example:
You know the title of the book on the page you're looking at is between class=\"readable\" (which only shows up the one time in the document) and <\/a>.
For the image, there is only a single img tag, so the following src attribute should always belong to the img tag, so code similar to the following will slice it out pretty quickly.
$imgStart = stristr ($html, '<img'); // get the start of the img tag
$srcStart = stristr(subtr($html, $imgStart), 'src="');
$srcStart += 5; // Offset for the chars src="
$srcEnd = stristr((subtr($html, $srcStart), '"');
$imgSrc = substr($html, $srcStart, $srcEnd - $srcStart);
Super robust? No... But you're screen-scraping, and there's no real robust way to do that, since you're always depending so much on the precise structure or syntax of someone else's code.
Also be sure that the terms of use of the site you're using allow scraping. Lots of sites really frown on that.

How to get the entire YouTube Video description, php, gdata

I have php code that correctly retrieves, using the YouTube api, the title, video url, viewcount, video date, last comment date, and the first 160 characters of the description. I can't seem to figure out how to get the entire description. I know it is there in the xml retrieved, because I have dumped that. So how come I am only getting 160 chars?
The entire description is truncated at 157 chars, and "..." is added, so that by the time I echo it or var_dump it, it is 160 chars. Here is my complete test code (without title, video url, etc etc).
<?php
$feedURL = 'http://gdata.youtube.com/feeds/api/videos?q=phone&v=2&fields=entry[yt:statistics/#viewCount > 10000]&start-index=1&max-results=1';
$sxml = simplexml_load_file($feedURL);
foreach ($sxml->entry as $entry) {
$media = $entry->children('http://search.yahoo.com/mrss/');
echo $media->group->description;
}
?>
This is what displays on the page:
FREE TuTiTu's Games: http://www.tutitu.tv/index.php/games FREE TuTiTu's Coloring pages at: http://www.tutitu.tv/index.php/coloring Join us on Facebook: https...
When I get the xml this way:
gdata.youtube.com/feeds/api/videos/JI-5kh_4gO0?v=2&alt=json-in-script&callback=youtubeFeedCallback&prettyprint=true
The entire description looks like this:
"media$description": {
"$t": "FREE TuTiTu's Games: http://www.tutitu.tv/index.php/games\nFREE TuTiTu's Coloring pages at: http://www.tutitu.tv/index.php/coloring\nJoin us on Facebook: https://www.facebook.com/TuTiTuTV\nTuTiTu's T-Shirts: http://www.zazzle.com/TuTiTu?rf=238778092083495163\n\nTuTiTu - The toys come to life\n\nTuTiTu - \"The toys come to life\" is a 3D animated television show targeting 2-3 year olds. Through colorful shapes TuTiTu will stimulate the children's imagination and creativity. On each episode TuTiTu's shapes will transform into a new and exciting toy.",
"type": "plain"
},
I'm sure I am missing something basic, but when I've looked for a solution, I have not found it.
Thanks for any help.

These 2 different types of API requests will return a different description size.
I assume it's a way to limit the total response size.
1) doing a search as in: http://gdata.youtube.com/feeds/api/videos?q=phone&v=2&fields=entry&alt=json&prettyprint=true will return the short video description.
2) doing a video request as in: http://gdata.youtube.com/feeds/api/videos/JI-5kh_4gO0?v=2&alt=json&prettyprint=true will return the long video description.
BTW: api version 3 will allow you to request a list of video id's in 1 request (to get their long descriptions).

$media->group->{'media$description'} should do the trick

PHP Markdown tagging last chunk of content as h3

I'm using PHP Markdown (version 1.0.1n, updated October 2009) to display text saved to a database in markdown format. I'm running into a strange issue where it's tagging the last chunk of every entry as an H3. When I search the markdown.php file, though, there isn't a single instance of H3.
Here are two pieces of text from my database:
Since its launch, major CPG brands, endemic as well as non-endemic, have flocked to retail websites to reach consumers deep in the purchase funnel through shopping media. In this session, you will hear about:
- The prioritization of shopping media for CPG brands.
- A case study of brands on Target.com on how this retailer (and others) have introduced a new channel for brand marketers to engage consumers where they are making the majority of purchase decisions: online.
- How CPG brands are leveraging real-time data from shopping media to capture consumer insights and market trends.
In this one, it is tagging the LI items correctly, but inside the final LI it's tagging the actual text as H3.
Beyond the actual money she saves, this consumer is both empowered and psychologically gratified by getting the best value on her everyday purchases. It is essential for both marketers and retailers to focus on what motivates and activates this consumer.
Diane Oshin will share insights on what influences her shopping behavior and then identify specific tools that activate her to buy.
In this one, the entire paragraph starting with Diane Oshin is tagged as an H3.
Here's the really odd thing: when I do a view source, both of them are tagged correctly; it's only when using Inspect Element that I see the H3. However, it's obvious in the actual display that the H3 tag is being applied:
example 1
example 2
Can anyone help me out?
update
Per a comment below, I looked for instances of H tags. I found these functions, but don't know if this is what could be causing the issue or not. They are the only place in the entire file that appears to be creating a header tag of any kind.
function doHeaders($text) {
# Setext-style headers:
# Header 1
# ========
#
# Header 2
# --------
#
$text = preg_replace_callback('{ ^(.+?)[ ]*\n(=+|-+)[ ]*\n+ }mx',
array(&$this, '_doHeaders_callback_setext'), $text);
# atx-style headers:
# # Header 1
# ## Header 2
# ## Header 2 with closing hashes ##
# ...
# ###### Header 6
#
$text = preg_replace_callback('{
^(\#{1,6}) # $1 = string of #\'s
[ ]*
(.+?) # $2 = Header text
[ ]*
\#* # optional closing #\'s (not counted)
\n+
}xm',
array(&$this, '_doHeaders_callback_atx'), $text);
return $text;
}
function _doHeaders_callback_setext($matches) {
# Terrible hack to check we haven't found an empty list item.
if ($matches[2] == '-' && preg_match('{^-(?: |$)}', $matches[1]))
return $matches[0];
$level = $matches[2]{0} == '=' ? 1 : 2;
$block = "<h$level>".$this->runSpanGamut($matches[1])."</h$level>";
return "\n" . $this->hashBlock($block) . "\n\n";
}
function _doHeaders_callback_atx($matches) {
$level = strlen($matches[1]);
$block = "<h$level>".$this->runSpanGamut($matches[2])."</h$level>";
return "\n" . $this->hashBlock($block) . "\n\n";
}

I could not reproduce what you describe with the version you've been given:
<?php
include(__DIR__.'/php-markdown/markdown.php');
$testText = 'Since its launch, major CPG brands, endemic as well as non-endemic, have flocked to retail websites to reach consumers deep in the purchase funnel through shopping media. In this session, you will hear about:
- The prioritization of shopping media for CPG brands.
- A case study of brands on Target.com on how this retailer (and others) have introduced a new channel for brand marketers to engage consumers where they are making the majority of purchase decisions: online.
- How CPG brands are leveraging real-time data from shopping media to capture consumer insights and market trends.
';
$resultText = Markdown($testText);
var_dump($resultText);
The output looks fairly as you might expect it
string(649) "<p>Since its launch, major CPG brands, endemic as well as non-endemic, have flocked to retail websites to reach consumers deep in the purchase funnel through shopping media. In this session, you will hear about:</p>
<ul>
<li><p>The prioritization of shopping media for CPG brands.</p></li>
<li><p>A case study of brands on Target.com on how this retailer (and others) have introduced a new channel for brand marketers to engage consumers where they are making the majority of purchase decisions: online.</p></li>
<li><p>How CPG brands are leveraging real-time data from shopping media to capture consumer insights and market trends.</p></li>
</ul>
"
I assume something else tampering the data before it get's into the markdown parser or afterwards. But based on the data, the markdown parser does not create the <h3> tags. You must look somewhere else :(

We Keep Coding

PHP, A popular general-purpose scripting language that is especially suited to web development.

Scraping HN Front Page - Handeling Simple HTML Dom Error - php

Related

Strip the contents of page via CURL in PHP [duplicate]

Specification of mark-up format included in facebook open graph text

Scraping HTML using php

How to get the entire YouTube Video description, php, gdata

PHP Markdown tagging last chunk of content as h3

Categories

Resources