I have a html table, generated by another website that I'm trying to convert to a php array.
I can not convert it using simplexml because the code of the generated table is not valid, and cause a lot of errors, also I need to keep some attributes of the table td elements, and remove the others.
What would be the most efficient way of doing this? Or do you know any php class that could help me achieve this?
BTW: What I'm trying to do is convert an school schedule to a php array, that I will be able to exploit after.
Here is an example of the data I retrieve: http://paste2.org/p/1869193
Btw, using php strip tags, I already remove the unnecessary tags such as spans and fonts.
You can also use PHP's Tidy if installed (it is by default on some installs) - it not only cleans up the HTML, but also lets you traverse the DOM:
http://www.php.net/manual/en/book.tidy.php
You can find a list of HTML parserd in the answers of the following question on SO:
Robust and Mature HTML Parser for PHP
Related
I can get all image elements in the DOM like this:
$element->getElementsByTagName('img')
How can I get all image and iframe elements in the DOM in one request? Something like:
$element->getElementsByTagName('img, iframe')
Using getElementsByTagName you cannot get multiple elements by tag name. You can get one tag at a time only. To achieve this, there are two ways
Calling Tag for two times
$element->getElementsByTagName('img')
$element->getElementsByTagName('iframe')
OR
Using JavaScript
document.querySelectorAll('img,ifame')
PS: I would not recommend using PHP for DOM parsing because that will impact page performance.
Hope this helps
There is a similar question to this on StackOverflow. But, My question is a little different.
I have selected the image with the required class whose image I want. Earlier, I used
element->src
to get the value of src attribute, but now the site has replaced it with 'data-src'.
I do not have the full contents of a tag, hence I can not use preg_replace. I Have the reqired element, I just want to be able to do something like
$element->data-src
I am trying to do this using PHP SIMPLE HTML DOM PARSER, but no luck yet.
Try using
$element->{'data-src'}
I'm creating my own blog in PHP and want to know your opinions on how I should format my post content.
Currently I store the post content as just plain text, call it when necessary, then wrap each line with P tags. I did this in case I wanted to change the way I formatted my text in the future and it would save me the dilema of having to remove all P tags from the posts in the DB.
Now the problem I have this this method is that if I want to add extra formatting in, e.g. lists etc those would also be wrapped with P tags which is not correct.
How would you do this, would you store text as plain text in the DB, or would you add the HTML formatting and store that in the DB to?
I'd prefer not to store unnessary HTML in the DB, but not sure of a way around it?
I think the best way would be to keep the html in the db. You would have too much to work with parsing the text if you don't use html.
See how it's done in other blog tools. I know that Joomla, for example, keeps all html in the db. I know Joomla isn't blog tool :) but still...
Wordpress stores html in the db. You say you are concerned about storing 'unnecessary' html in the db. What makes it unnecessary? I think it is the opposite. You may have headings or bold or italic text in your post. If storing as plain text, how do you save this formatting? How are you saving the lists you mentioned?
I see it as a better practice to store raw user input in the database, and format it on output, caching the result if it is needed. That way you can change the way you are parsing things easily without having to regex-replace anything inside the database. You can also store the raw input in one column, and the formatted HTML in another one.
I assume that you are formatting your raw text with the Markdown or the Textile syntax?
If you store HTML in your DB, you will be just a few clicks away from your current situation:
you can use strip_tags() to remove HTML formating and in case of bigger changes, you can run HTML Tidy on your code to remap tags and classes.
I need to get the data out of all of the table cells in the 4th row of the 4th table on an HTML page. After researching for a while, it seems that using DOMXPath is the best way to parse the HTML file. However, no IDs or classes are used anywhere in the file. What would be the best way to get the data out of these cells?
Thanks in advance.
You can specify an index when fetching with XPath. In your case
/html/body/table[4]/tbody/tr[4]/td
Note that an XPath index is not zero-based, but one-based.
If you are familiar with jQuery syntax, have you looked into phpQuery?
method is the most efficient when translating bunches of text/web pages including HTML? I want to translate the text, but keep the HTML.
Also, should I keep the words in a database or an array?
When you say "translating", do you mean from one language to another? If so, you can use regular expressions to capture the data between open and closing tags of your HTML without losing the markup. I'm not sure however why you would want to store your data in a database, unless you were going to retrieve it at a later point?
If this is for a translation on the fly, it will always be faster to store your data in memory -- your Array or simply update the HTML while you loop through the data and eliminate the need for an Array altogether.