Python, lxml và loại bỏ các thẻ bên ngoài từ việc sử dụng lxml.html.tostring (el)

Tôi đang sử dụng dưới đây để có được tất cả các nội dung html của một phần để lưu vào cơ sở dữ liệuPython, lxml và loại bỏ các thẻ bên ngoài từ việc sử dụng lxml.html.tostring (el)

el = doc.get_element_by_id('productDescription') 
lxml.html.tostring(el)

Mô tả sản phẩm có một thẻ trông như thế này:

<div id='productDescription'> 

    <THE HTML CODE I WANT> 

</div>

mã này hoạt động tuyệt vời, mang lại cho tôi tất cả các mã html nhưng làm thế nào để loại bỏ các lớp bên ngoài tức là <div id='productDescription'> và thẻ đóng </div>?

Nguồn

2012-02-14 Tampa

Bạn có thể chuyển đổi mỗi đứa trẻ để chuỗi riêng lẻ:

text = el.text 
text += ''.join(map(lxml.html.tostring, el.iterchildren()))

Hoặc ngay cả cách hackish hơn:

el.attrib.clear() 
el.tag = '|||' 
text = lxml.html.tostring(el) 
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>') 
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]

Nguồn

2012-02-14 19:24:56 jfs

nếu bạn productDescriptiondiv div chứa hỗn hợp text/yếu tố nội dung, ví dụ

<div id='productDescription'> 
    the 
    <b> html code </b> 
    i want 
</div>

bạn có thể lấy nội dung (trong chuỗi) sử dụng xpath('node()') traversal:

s = '' 
for node in el.xpath('node()'): 
    if isinstance(node, basestring): 
     s += node 
    else: 
     s += lxml.html.tostring(node, with_tail=False)

Nguồn

2012-02-15 14:07:32 mykhal

'basestring' là gì? – nHaskins

Đây là một chức năng mà những gì bạn muốn.

def strip_outer(xml): 
    """ 
    >>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML   http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd"> 
    ... <mrow> 
    ...  <msup> 
    ...  <mi>x</mi> 
    ...  <mn>2</mn> 
    ...  </msup> 
    ...  <mo> + </mo> 
    ...  <mi>x</mi> 
    ... </mrow> 
    ... </math>''' 
    >>> so = strip_outer(xml) 
    >>> so.splitlines()[0]=='<mrow>' 
    True 

    """ 
    xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute 
    xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element 
    rx = lxml.etree.XML(xml) 
    lxml.etree.strip_tags(rx,'math')#strip <math with all attributes 
    uc=lxml.etree.tounicode(rx) 
    uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again 
    return uc.strip()

Nguồn

2013-04-20 16:22:12

Sử dụng regexp.

def strip_outer_tag(html_fragment): 
    import re 
    outer_tag = re.compile(r'^<[^>]+>(.*?)</[^>]+>$', re.DOTALL) 
    return outer_tag.search(html_fragment).group(1) 

html_fragment = strip_outer_tag(tostring(el, encoding='unicode')) # `encoding` is optionaly

Nguồn

2017-04-02 00:52:57 bl79

Python, lxml và loại bỏ các thẻ bên ngoài từ việc sử dụng lxml.html.tostring (el)

Trả lời

Các vấn đề liên quan