def resizeiframe(elem, width=280): ''' Resize an iframe to have a mobile-friendly width If elem contains (or is) an iframe element, set its width to a mobile-friendly value. The height attribute, if present, is scaled appropriately to preserve the original aspect ratio. This was originally created for the resizing of iframe-based embedded Youtube videos. TODO: see TODOs on resizeobject, which are mostly relevant to this filter ''' iframe_elem = findonetag(elem, 'iframe') if iframe_elem is not None: setwidth(iframe_elem, width)
def noimgsize(elem): ''' Strip the height and width attributes from the first child img tag This filter searches for the first img in the element, and removes any sizing attributes. This is useful if you have a large source image, and want to use a "width: 100%" trick in CSS to make it span any device. @param elem : Element representing an html tag @type elem : lxml.html.HTMLElement ''' img_elem = findonetag(elem, 'img') if img_elem is not None: for a in ('height', 'width'): if a in img_elem.attrib: del img_elem.attrib[a]
def resizeobject(elem, width=280): ''' Resize something embedded in an object tag to have a mobile-friendly width If elem contains (or is) a OBJECT element, set its width to a mobile-friendly value. The height attribute, if present, is scaled appropriately to preserve the original aspect ratio. This is done for both the "object" tag, and also any "embed" tag that may be present inside. TODO: This will operate on only the first object; if there are several object elements within, those beyond the first will be ignored. Best thing is probably to just find and operate on all of them. ''' object_elem = findonetag(elem, 'object') if object_elem is not None: setwidth(object_elem, width) embed_elem = object_elem.find('.//embed') if embed_elem is not None: setwidth(embed_elem, width)
def table2divgroupsgs(elem, specmapgen, omit_whitespace=True): ''' Apply the table2divgroups filter with a dynamically generated spec map This filter is much like table2divgroups. However, instead of taking a explicit spec map argument, table2divgroupsgs takes a callable that generates the spec map. This callable, specmapgen, accepts a table element as its single argument, and returns a spec map. @param elem : Element to operate on @type elem : lxml.html.HtmlElement @param specmapgen : Callable that generates a spec map @type specmapgen : function: HtmlElemnt -> type(specmap) @param omit_whitespace : Whether to omit cells just containing content that would render as whitespace in the browser @type omit_whitespace : bool ''' table_elem = findonetag(elem, 'table') specmap = specmapgen(table_elem) return _table2divgroups(elem, table_elem, specmap, omit_whitespace)
def table2divgroups(elem, specmap, omit_whitespace=True): ''' Extract blocks arranged in a table grid as more semantic elements Table based layouts sometimes lead to a grid of elements semantically spanning some set of rows and columns. This filter helps extract them into a clearer semantic organization. Let's try to make this concrete. Consider this html: <table> <tbody> <tr> <td>CONTACT US</td> <td> </td> <td> </td> <td> </td> <tr> <td>123 Main Str</td> <td> </td> <td>OUR TEAM</td> <td> </td> <tr> <td>Springfield, IL</td> <td> </td> <td>Mike Smith</td> <td><img src="/mike-smith.jpg"/></td> <tr> <td>1-800-BUY-DUFF</td> <td> </td> <td>Jen Jones</td> <td><img src="/jen-jones.jpg"/></td> <tr> <td> </td> <td> </td> <td>Scruffy</td> <td><img src="/scruffy-the-dog.jpg"/></td> <tr> </tbody> </table> Schematically, this would render as something like this (with ___ indicating a content-free TD cell): CONTACT US ___ ___ ___ 123 Main Str ___ OUR TEAM ___ Springfield, IL ___ Mike Smith <img src="/mike-smith.jpg"/> 1-800-BUY-DUFF ___ Jen Jones <img src="/jen-jones.jpg"/> ___ ___ Scruffy <img src="/scruffy-the-dog.jpg"/> There are two clear semantic elements here. From a mobile design perspective, it would be great to parse them more like this: <div class="mwu-elem-table2divgroups-group" id="mwu-elem-contact"> <div>CONTACT US</div> <div>123 Main Str</div> <div>Springfield, IL</div> <div>1-800-BUY-DUFF</div> </div> ... and: <div class="mwu-elem-table2divgroups-group" id="mwu-elem-ourteam"> <div> <div>OUR TEAM</div> </div> <div> <div>Mike Smith</div> <div><img src="/mike-smith.jpg"/></div> </div> <div> <div>Jen Jones</div> <div><img src="/jen-jones.jpg"/></div> </div> <div> <div>Scruffy</div> <div><img src="/scruffy-the-dog.jpg"/></div> </div> </div> That's exactly what this filter can do. You'll need to specify what the semantic groups are, and how to extract them from a table grid. The specmap argument is a list of Spec instances. Each spec object defines a square of cells, from 1 or more rows and 1 or more columns in the source table. It also defines a DOM ID name (equivalent to 'mwu-elem-contact' and 'mwu-elem-ourteam') above. See the Spec class documentation for more details, but briefly, one way to define a group of cells is with these four numbers: (tr_start, td_start, tr_end, td_end) These integers are 0-based indices of the row and column. So a specmap for the above would read: specmap = [ Spec(idname('contact'), 0, 0, 3, 0)), Spec(idname('ourteam'), 1, 2, 4, 3)), ] By default, any TD cells that would render as whitespace in the browser are omitted. Set omit_whitespace=False if you don't want these cells discarded. TODO: make the above paragraph true even if a TD element contains, say, an empty SPAN If the extracted cells are one-dimensional (i.e. a single column or row), the group will be a list of DIVs (as in the "contact us" example). But if the cells extend over more than one row and column in the source table, they will be organized in divs by row, as in the "our team" example. @param elem : Element to operate on @type elem : lxml.html.HtmlElement @param specmap : Specification of what groups of cells to extract @type specmap : list of (key, value) tuples @param omit_whitespace : Whether to omit cells just containing content that would render as whitespace in the browser @type omit_whitespace : bool ''' table_elem = findonetag(elem, 'table') return _table2divgroups(elem, table_elem, specmap, omit_whitespace)