cómo eliminar un elemento en lxml

Question 1

Necesito eliminar elementos por completo, según el contenido de un atributo, usando el lxml de python. Ejemplo:

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  #remove this element from the tree

print et.tostring(tree, pretty_print=True)

Me gustaría imprimir esto:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

¿Hay alguna manera de hacer esto sin almacenar una variable temporal e imprimirla manualmente, como:

newxml="<groceries>\n"
for elt in tree.xpath('//fruit[@state=\'fresh\']'):
  newxml+=et.tostring(elt)

newxml+="</groceries>"

Question 2

Utilice el removemétodo de un xmlElement:

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)     # here I grab the parent of the element to call the remove directly on it

print et.tostring(tree, pretty_print=True, xml_declaration=True)

Si tuviera que comparar con la versión de @Acorn, la mía funcionará incluso si los elementos a eliminar no están directamente debajo del nodo raíz de su xml.

Question 3

Estás buscando la removefunción. Llame al método remove del árbol y pásele un subelemento para eliminar.

import lxml.etree as et

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <punnet>
    <fruit state="rotten">strawberry</fruit>
    <fruit state="fresh">blueberry</fruit>
  </punnet>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree=et.fromstring(xml)

for bad in tree.xpath("//fruit[@state='rotten']"):
    bad.getparent().remove(bad)

print et.tostring(tree, pretty_print=True)

Resultado:

<groceries>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>

Question 4

Me encontré con una situación:

<div>
    <script>
        some code
    </script>
    text here
</div>

div.remove(script)eliminará la text hereparte que no quise.

siguiendo la respuesta aquí , descubrí que etree.strip_elementses una solución mejor para mí, que puede controlar si eliminará o no el texto detrás con with_tail=(bool)param.

Pero todavía no sé si esto puede usar el filtro xpath para la etiqueta. Solo pon esto para informar.

Aquí está el documento:

strip_elements (árbol_or_elemento, * nombres_etiqueta, with_tail = True)

Elimine todos los elementos con los nombres de etiqueta proporcionados de un árbol o subárbol. Esto eliminará los elementos y su subárbol completo, incluidos todos sus atributos, contenido de texto y descendientes. También eliminará el texto final del elemento a menos que establezca explícitamente la with_tailopción de argumento de palabra clave en Falso.

Los nombres de las etiquetas pueden contener comodines como en _Element.iter.

Tenga en cuenta que esto no eliminará el elemento (o elemento raíz de ElementTree) que pasó incluso si coincide. Solo tratará a sus descendientes. Si desea incluir el elemento raíz, verifique su nombre de etiqueta directamente antes incluso de llamar a esta función.

Uso de ejemplo:
   strip_elements(some_element,
       'simpletagname',             # non-namespaced tag
       '{http://some/ns}tagname',   # namespaced tag
       '{http://some/other/ns}*'    # any tag from a namespace
       lxml.etree.Comment           # comments
       )

Question 5

Como ya se mencionó, puede usar el remove()método para eliminar (sub) elementos del árbol:

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
  bad.getparent().remove(bad)

Pero elimina el elemento incluido el suyo tail, lo cual es un problema si está procesando documentos de contenido mixto como HTML:

<div><fruit state="rotten">avocado</fruit> Hello!</div>

Se convierte

<div></div>

Que es lo que supongo que no siempre quieres :) He creado una función auxiliar para eliminar solo el elemento y mantener su cola:

def remove_element(el):
    parent = el.getparent()
    if el.tail.strip():
        prev = el.getprevious()
        if prev:
            prev.tail = (prev.tail or '') + el.tail
        else:
            parent.text = (parent.text or '') + el.tail
    parent.remove(el)

for bad in tree.xpath("//fruit[@state=\'rotten\']"):
    remove_element(bad)

De esta forma mantendrá el texto final:

<div> Hello!</div>

Question 6

También puede usar html de lxml para resolver eso:

from lxml import html

xml="""
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>
"""

tree = html.fromstring(xml)

print("//BEFORE")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

for i in tree.xpath("//fruit[@state='rotten']"):
    i.drop_tree()

print("//AFTER")
print(html.tostring(tree, pretty_print=True).decode("utf-8"))

Debería generar esto:

//BEFORE
<groceries>
  <fruit state="rotten">apple</fruit>
  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>
  <fruit state="rotten">mango</fruit>
  <fruit state="fresh">peach</fruit>
</groceries>


//AFTER
<groceries>

  <fruit state="fresh">pear</fruit>
  <fruit state="fresh">starfruit</fruit>

  <fruit state="fresh">peach</fruit>
</groceries>