Cómo iterar mediante programación a través de subíndices, superíndices y ecuaciones que se encuentran en un documento de Word

12

Tengo algunos documentos de Word, cada uno con unos cientos de páginas de datos científicos que incluyen:

Fórmulas químicas (H2SO4 con todos los subíndices y superíndices adecuados)
Números científicos (exponentes formateados con superíndices)
Muchas ecuaciones matemáticas. Escrito usando el editor de ecuaciones matemáticas en Word.

El problema es que almacenar estos datos en Word no es eficiente para nosotros. Por eso queremos almacenar toda esta información en una base de datos (MySQL). Queremos convertir el formato a LaTex.

¿Hay alguna forma de recorrer en iteración todos los subíndices, superíndices y ecuaciones dentro de un documento de Word usando VBA?

microsoft-word microsoft-word-2007 vba

— garras
fuente

¿Has pensado en extraer los datos xml del propio documento? Todos los documentos de Microsoft 2007+ (.docx) son básicamente archivos xml comprimidos. Puede recuperarlos usando un analizador xml.

— James Mertz

fue demasiado largo para publicar como comentario, así que agregué como respuesta.

— James Mertz

12

Sí hay. Sugeriría usar Powershell ya que maneja los archivos de Word bastante bien. Creo que seré la forma más fácil.

Más información sobre la automatización de Powershell vs Word aquí: http://www.simple-talk.com/dotnet/.net-tools/com-automation-of-office-applications-via-powershell/

He cavado un poco más profundo y encontré este script de PowerShell:

param([string]$docpath,[string]$htmlpath = $docpath)

$srcfiles = Get-ChildItem $docPath -filter "*.doc"
$saveFormat = [Enum]::Parse([Microsoft.Office.Interop.Word.WdSaveFormat], "wdFormatFilteredHTML");
$word = new-object -comobject word.application
$word.Visible = $False

function saveas-filteredhtml
    {
        $opendoc = $word.documents.open($doc.FullName);
        $opendoc.saveas([ref]"$htmlpath\$doc.fullname.html", [ref]$saveFormat);
        $opendoc.close();
    }

ForEach ($doc in $srcfiles)
    {
        Write-Host "Processing :" $doc.FullName
        saveas-filteredhtml
        $doc = $null
    }

$word.quit();

Guárdelo como .ps1 y comience con:

convertdoc-tohtml.ps1 -docpath "C:\Documents" -htmlpath "C:\Output"

Guardará todo el archivo .doc del directorio especificado, como los archivos html. Así que tengo un archivo doc en el que tengo su H2SO4 con subíndices y después de la conversión de PowerShell, la salida es la siguiente:

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 14 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
    {font-family:Calibri;
    panose-1:2 15 5 2 2 2 4 3 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
    {margin-top:0in;
    margin-right:0in;
    margin-bottom:10.0pt;
    margin-left:0in;
    line-height:115%;
    font-size:11.0pt;
    font-family:"Calibri","sans-serif";}
.MsoChpDefault
    {font-family:"Calibri","sans-serif";}
.MsoPapDefault
    {margin-bottom:10.0pt;
    line-height:115%;}
@page WordSection1
    {size:8.5in 11.0in;
    margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
    {page:WordSection1;}
-->
</style>

</head>

<body lang=EN-US>

<div class=WordSection1>

<p class=MsoNormal><span lang=PL>H<sub>2</sub>SO<sub>4</sub></span></p>

</div>

</body>

</html>

Como puede ver, los subíndices tienen sus propias etiquetas en HTML, por lo que lo único que queda es analizar el archivo en bash o c ++ para cortar de cuerpo a cuerpo, cambiar a LATEX y eliminar el resto de etiquetas HTML después.

Código de http://blogs.technet.com/b/bshukla/archive/2011/09/27/3347395.aspx

Así que he desarrollado un analizador en C ++ para buscar el subíndice HTML y reemplazarlo con el subíndice LATEX.

El código:

#include <iostream>
#include <fstream>
#include <string>
#include <sstream>
#include <vector>

using namespace std;

 vector < vector <string> > parse( vector < vector <string> > vec, string filename )
{
        /*
                PARSES SPECIFIED FILE. EACH WORD SEPARATED AND
                PLACED IN VECTOR FIELD.

                REQUIRED INCLUDES:
                                #include <iostream>
                                #include <fstream>
                                #include <string>
                                #include <sstream>
                                #include <vector>

            EXPECTS: TWO DIMENTIONAL VECTOR
                     STRING WITH FILENAME
            RETURNS: TWO DIMENTIONAL VECTOR
                     vec[lines][words]
        */
        string vword;
        ifstream vfile;
        string tmp;

         // FILENAME CONVERSION FROM STING
        //  TO CHAR TABLE

        char cfilename[filename.length()+1];
        if( filename.length() < 126 )
        {
                for(int i = 0; i < filename.length(); i++)
                                cfilename[i] = filename[i];
                cfilename[filename.length()] = '\0';
        }
        else return vec;

         // OPENING FILE
        //
        vfile.open( cfilename );
        if (vfile.is_open())
        {
                while ( vfile.good() )
                {
                        getline( vfile, vword );
                        vector < string > vline;
                        vline.clear();

                        for (int i = 0; i < vword.length(); i++)
                        {
                                tmp = "";
                                 // PARSING CONTENT. OMITTING SPACES AND TABS
                                //
                                while (vword[i] != ' ' && vword[i] != ((char)9) && i < vword.length() )
                                        tmp += vword[i++];
                                if( tmp.length() > 0 ) vline.push_back(tmp);
                        }
                        if (!vline.empty())
                                vec.push_back(vline);
                }
                vfile.close();
        }
        else cout << "Unable to open file " << filename << ".\n";
        return vec;
}

int main()
{
        vector < vector < string > > vec;
        vec = parse( vec, "parse.html" );

        bool body = false;
        for (int i = 0; i < vec.size(); i++)
        {
                for (int j = 0; j < vec[i].size(); j++)
                {
                        if ( vec[i][j] == "<body") body=true;
                        if ( vec[i][j] == "</body>" ) body=false;
                        if ( body == true )
                        {
                                for ( int k=0; k < vec[i][j].size(); k++ )
                                {
                                        if (k+4 < vec[i][j].size() )
                                        {
                                                if (    vec[i][j][k]   == '<' &&
                                                        vec[i][j][k+1] == 's' &&
                                                        vec[i][j][k+2] == 'u' &&
                                                        vec[i][j][k+3] == 'b' &&
                                                        vec[i][j][k+4] == '>' )
                                                {

                                                        string tmp = "";
                                                        while (vec[i][j][k+5] != '<')
                                                        {
                                                                tmp+=vec[i][j][k+5];
                                                                k++;
                                                        }
                                                        tmp = "_{" + tmp + "}";
                                                        k=k+5+5;
                                                        cout << tmp << endl;;
                                                }
                                                else cout << vec[i][j][k];
                                        }
                                        else cout << vec[i][j][k];
                                }
                                cout << endl;
                        }
                }
        }
        return 0;
}

Para el archivo html:

<html>

<head>
<meta http-equiv=Content-Type content="text/html; charset=windows-1252">
<meta name=Generator content="Microsoft Word 14 (filtered)">
<style>
<!--
 /* Font Definitions */
 @font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
 /* Style Definitions */
 p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin-top:0in;
        margin-right:0in;
        margin-bottom:10.0pt;
        margin-left:0in;
        line-height:115%;
        font-size:11.0pt;
        font-family:"Calibri","sans-serif";}
.MsoChpDefault
        {font-family:"Calibri","sans-serif";}
.MsoPapDefault
        {margin-bottom:10.0pt;
        line-height:115%;}
@page WordSection1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.WordSection1
        {page:WordSection1;}
-->
</style>

</head>

<body lang=EN-US>

<div class=WordSection1>

<p class=MsoNormal><span lang=PL>H<sub>2</sub>SO<sub>4</sub></span></p>

</div>

</body>

</html>

El resultado es:

<body
lang=EN-US>
<div
class=WordSection1>
<p
class=MsoNormal><span
lang=PL>H_{2}
SO_{4}
</span></p>
</div>

No es ideal, por supuesto, pero tratar es como una prueba de concepto.

— mnmnc
fuente

3

Puede extraer el xml directamente de cualquier documento de Office que sea 2007+. Esto se hace de la siguiente manera:

cambie el nombre del archivo de .docx a .zip
extraer el archivo usando 7zip (o algún otro programa de extracción)
Para el contenido real del documento, busque en la carpeta extraída debajo de la wordsubcarpeta y el document.xmlarchivo. Eso debería contener todo el contenido del documento.

ingrese la descripción de la imagen aquí

Creé un documento de muestra, y en las etiquetas del cuerpo encontré esto (tenga en cuenta que rápidamente lo armé, por lo que el formato podría estar un poco apagado):

<?xml version="1.0" encoding="UTF-8" standalone="true"?>
<w:body>
    -<w:p w:rsidRDefault="000E0C3A" w:rsidR="008B5DAA">
        -<w:r>
            <w:t xml:space="preserve">This </w:t>
        </w:r>
-       <w:r w:rsidRPr="000E0C3A">
            -<w:rPr>
                <w:vertAlign w:val="superscript"/>
            </w:rPr>
            <w:t>is</w:t>
        </w:r>
-       <w:r>
            <w:t xml:space="preserve"> a </w:t>
        </w:r>
            -<w:r w:rsidRPr="000E0C3A">
                -<w:rPr>
                    <w:vertAlign w:val="subscript"/>
                </w:rPr>
                <w:t>test</w:t>
            </w:r>
        -<w:r>
            <w:t>.</w:t>
        </w:r>
    </w:p>
</w:body>

Parece que la <w:t>etiqueta es para texto, <w:rPr>es la definición de la fuente y el<w:p> es un nuevo párrafo.

La palabra equivalente se ve así:

ingrese la descripción de la imagen aquí

— James Mertz
fuente

2

He estado buscando un enfoque diferente del que persigue mnmnc.

Mis intentos de guardar un documento de Word de prueba como HTML no fueron exitosos. En el pasado descubrí que el HTML generado por Office está tan lleno de paja que elegir los bits que desea es casi imposible. He encontrado que ese es el caso aquí. También he tenido un problema con las ecuaciones. Word guarda ecuaciones como imágenes. Para cada ecuación habrá dos imágenes, una con una extensión de WMZ y otra con una extensión de GIF. Si muestra el archivo html con Google Chrome, las ecuaciones se ven bien pero no maravillosas; la apariencia coincide con el archivo GIF cuando se muestra con una herramienta de visualización / edición de imágenes que puede manejar imágenes transparentes. Si muestra el archivo HTML con Internet Explorer, las ecuaciones se ven perfectas.

Información Adicional

Debería haber incluido esta información en la respuesta original.

Creé un pequeño documento de Word que guardé como HTML. Los tres paneles en la imagen a continuación muestran el documento original de Word, el documento Html como lo muestra Microsoft Internet Explorer y el documento Html como lo muestra Google Chrome.

Palabra original, HTML mostrado por IE y HTML mostrado por Chrome

Como se explicó anteriormente, la diferencia entre las imágenes de IE y Chrome es el resultado de que las ecuaciones se guardan dos veces, una en formato WMZ y otra en formato GIF. El HTML es demasiado grande para mostrar aquí.

El HTML creado por la macro es:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" 
                   "http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head><body>
<p>Some ordinary text.</p>
<p>H<sub>2</sub>SO<sub>4</sub>.</p>
<p>Abc &amp; def &gt; ghi &lt; jkl</p>
<p>x<sup>3</sup>+ x<sup>2</sup>+3x+4=0.</p><p></p>
<p><i>Equation</i>  </p>
<p>Mno</p>
<p><i>Equation</i></p>
</body></html>

Que se muestra como:

HTML creado por macro como se muestra por IE

No he intentado convertir las ecuaciones desde el kit de desarrollo de software MathType gratuito aparentemente incluye rutinas que se convierten a LaTex

El código es bastante básico, así que no hay muchos comentarios. Pregunte si algo no está claro. Nota: esta es una versión mejorada del código original.

Sub ConvertToHtml()

  Dim FileNum As Long
  Dim NumPendingCR As Long
  Dim objChr As Object
  Dim PathCrnt As String
  Dim rng As Word.Range
  Dim WithinPara As Boolean
  Dim WithinSuper As Boolean
  Dim WithinSub As Boolean

  FileNum = FreeFile
  PathCrnt = ActiveDocument.Path
  Open PathCrnt & "\TestWord.html" For Output Access Write Lock Write As #FileNum

  Print #FileNum, "<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 Frameset//EN""" & _
                  " ""http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd"">" & _
                  vbCr & vbLf & "<html xmlns=""http://www.w3.org/1999/xhtml"" " & _
                  "xml:lang=""en"" lang=""en"">" & vbCr & vbLf & _
                  "<head><meta http-equiv=""Content-Type"" content=""text/html; " _
                  & "charset=utf-8"" />" & vbCr & vbLf & "</head><body>"

  For Each rng In ActiveDocument.StoryRanges

    NumPendingCR = 0
    WithinPara = False
    WithinSub = False
    WithinSuper = False

    Do While Not (rng Is Nothing)
      For Each objChr In rng.Characters
        If objChr.Font.Superscript Then
          If Not WithinSuper Then
            ' Start of superscript
            Print #FileNum, "<sup>";
            WithinSuper = True
          End If
        ElseIf WithinSuper Then
          ' End of superscript
          Print #FileNum, "</sup>";
          WithinSuper = False
        End If
        If objChr.Font.Subscript Then
          If Not WithinSub Then
            ' Start of subscript
            Print #FileNum, "<sub>";
            WithinSub = True
          End If
        ElseIf WithinSub Then
          ' End of subscript
          Print #FileNum, "</sub>";
          WithinSub = False
          End If
          Select Case objChr
            Case vbCr
              NumPendingCR = NumPendingCR + 1
            Case "&"
              Print #FileNum, CheckPara(NumPendingCR, WithinPara) & "&amp;";
            Case "<"
              Print #FileNum, CheckPara(NumPendingCR, WithinPara) & "&lt;";
            Case ">"
              Print #FileNum, CheckPara(NumPendingCR, WithinPara) & "&gt;";
            Case Chr(1)
              Print #FileNum, CheckPara(NumPendingCR, WithinPara) & "<i>Equation</i>";
            Case Else
              Print #FileNum, CheckPara(NumPendingCR, WithinPara) & objChr;
          End Select
      Next
      Set rng = rng.NextStoryRange
    Loop
  Next

  If WithinPara Then
    Print #FileNum, "</p>";
    withpara = False
  End If

  Print #FileNum, vbCr & vbLf & "</body></html>"

  Close FileNum

End Sub
Function CheckPara(ByRef NumPendingCR As Long, _
                   ByRef WithinPara As Boolean) As String

  ' Have a character to output.  Check paragraph status, return
  ' necessary commands and adjust NumPendingCR and WithinPara.

  Dim RtnValue As String

  RtnValue = ""

  If NumPendingCR = 0 Then
    If Not WithinPara Then
      CheckPara = "<p>"
      WithinPara = True
    Else
      CheckPara = ""
    End If
    Exit Function
  End If

  If WithinPara And (NumPendingCR > 0) Then
    ' Terminate paragraph
    RtnValue = "</p>"
    NumPendingCR = NumPendingCR - 1
    WithinPara = False
  End If
  Do While NumPendingCR > 1
    ' Replace each pair of CRs with an empty paragraph
    RtnValue = RtnValue & "<p></p>"
    NumPendingCR = NumPendingCR - 2
  Loop
  RtnValue = RtnValue & vbCr & vbLf & "<p>"
  WithinPara = True
  NumPendingCR = 0

  CheckPara = RtnValue

End Function

— Tony Dallimore
fuente

Buen trabajo. ¿Funcionará para varios archivos o tiene que colocarlo dentro del archivo que desea convertir?

— mnmnc

@mnmnc. Gracias. Creo que su solución es una impresión, aunque probablemente esté claro que no creo que una solución que comience con Microsoft Html funcione. Como resultado de una pregunta de Stack Overflow, estoy trabajando en convertir Excel a Html porque PublishObjects de Microsoft crea Html inaceptable para la mayoría (¿todos?) De teléfonos inteligentes. Tengo poca experiencia con Word VBA; Soy mejor con Excel y Outlook VBA y solía ser bueno con Acess VBA. Todos estos permiten que una macro en un archivo acceda a otros archivos, así que estoy seguro de que lo mismo es cierto para Word.

— Tony Dallimore

0

La forma más sencilla de hacer esto es solo las siguientes líneas en VBA:

Sub testing()
With ActiveDocument.Content.Find
 .ClearFormatting
 .Format = True
 .Font.Superscript = True
 .Execute Forward:=True
End With

End Sub

Esto encontrará todo el texto superíndice. Si desea hacer algo con él, simplemente insértelo en el método. Por ejemplo, para encontrar la palabra "super" en un superíndice y convertirla en uso "super encontrado":

Sub testing()

With ActiveDocument.Content.Find
 .ClearFormatting
 .Format = True
 .Font.Superscript = True
 .Execute Forward:=True, Replace:=wdReplaceAll, _
 FindText:="super", ReplaceWith:="super found"
End With

End Sub

— soandos
fuente