用 docx4j 对docx文档进行简单读写

=Start=

缘由:

简单记录一下如何用 docx4j 对常见的Office格式文档(此处以docx文件为例)进行读写操作。

正文:

参考解答:

从零创建docx文件的方法在前面的文章中有记录,这里不重复写了,只记录一下读取docx文件内容的方法,核心其实就是遍历存储了文档内容的w:t节点,但对于我这种刚接触的新手来说一开始不知道怎么做才是最大的入门障碍。

package com.example;

import com.example.rewrite.BinaryPartAbstractImageExt;
import org.docx4j.dml.wordprocessingDrawing.Inline;
import org.docx4j.jaxb.Context;
import org.docx4j.jaxb.XPathBinderAssociationIsPartialException;
import org.docx4j.openpackaging.exceptions.Docx4JException;
import org.docx4j.openpackaging.packages.WordprocessingMLPackage;
import org.docx4j.openpackaging.parts.WordprocessingML.BinaryPartAbstractImage;
import org.docx4j.openpackaging.parts.WordprocessingML.MainDocumentPart;
import org.docx4j.wml.*;

import javax.xml.bind.JAXBElement;
import javax.xml.bind.JAXBException;
import java.io.File;
import java.math.BigInteger;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.List;

/**
 * @author ixyzero
 * Created on 2022-04-23
 */
public class opOfficeTextMark {
    private static boolean isDebug = false;

    public static void main(String[] args) {
        String filePath = "20220423.docx";
        docx4jCodeExample.docxCreate(filePath, "add some content here.");

        WordprocessingMLPackage wordMLPackage = null;
        try {
            wordMLPackage = WordprocessingMLPackage.load(new File(filePath));
        } catch (Docx4JException e) {
            e.printStackTrace();
        }
        printWordContent(wordMLPackage);

        String markStr = "this is hidden text.";
        textMark(wordMLPackage, markStr);

        try {
            wordMLPackage.save(new File(String.format("new-%s",filePath)));
        } catch (Docx4JException e) {
            e.printStackTrace();
        }
        printWordContent(wordMLPackage);
        return;
    }

    /**
     * Use XPath expression (//w:t) to get all text nodes from the main document part.
     * <w:p> 表示段落格式、<w:r> 表示字符格式、<w:t> 表示文本内容。
     * 其中,<w:t> 包含在 <w:r> 体 中,<w:r> 包 含 在 <w:p> 体 中。
     * 即便 <w:r> 有 <w:vanish/> 等属性(隐藏文本),但是这里直接拿的 <w:t> 没有管属性问题,所以可以拿全。
     */
    private static void printWordContent(WordprocessingMLPackage wordMLPackage) {
        MainDocumentPart mainDocumentPart = wordMLPackage.getMainDocumentPart();
        String textNodesXPath = "//w:t";
        List<Object> textNodes= null;
        try {
            textNodes = mainDocumentPart.getJAXBNodesViaXPath(textNodesXPath, true);
        } catch (JAXBException e) {
            e.printStackTrace();
        } catch (XPathBinderAssociationIsPartialException e) {
            e.printStackTrace();
        }
        for (Object obj : textNodes) {
            Text text = (Text) ((JAXBElement) obj).getValue();
            String textValue = text.getValue();
            System.out.println(textValue);
        }

        return;
    }

    // 给传入的 docx 文档在第一段和最后一段添加隐藏文本
    private static void textMark(WordprocessingMLPackage docxPackage, String markStr) {
        P firstP = getFirstParagraph(docxPackage);
        P lastP = getLastParagraph(docxPackage);

        Text text = Context.getWmlObjectFactory().createText();
        text.setValue(markStr);

        R run= Context.getWmlObjectFactory().createR();
        RPr runProperties = new RPr();
        runProperties.setVanish(new BooleanDefaultTrue());
        HpsMeasure sz = new HpsMeasure();
        sz.setVal(BigInteger.valueOf(1L));
        runProperties.setSz(sz);
        run.setRPr(runProperties);
        run.getContent().add(text);

        firstP.getContent().add(run);
        if(firstP != lastP) {
            lastP.getContent().add(run);
        }
    }

    private static P getFirstParagraph(WordprocessingMLPackage wordMLPackage) {
        for (Object obj : wordMLPackage.getMainDocumentPart().getContent()) {
            if (obj instanceof P) {
                return ((P) obj);
            }
        }
        // create if not found
        return Context.getWmlObjectFactory().createP();
    }

    private static P getLastParagraph(WordprocessingMLPackage wordMLPackage) {
        List<Object> contents = wordMLPackage.getMainDocumentPart().getContent();
        for (int i = contents.size()-1; i>=0; i--) {
            if (contents.get(i) instanceof P) {
                return ((P) contents.get(i));
            }
        }
        // create if not found
        return Context.getWmlObjectFactory().createP();
    }
}

参考链接:

Introduction To Docx4J
https://www.baeldung.com/docx4j

如何使用 docx4j 中的 getJaxbElement 方法
https://www.tabnine.com/code/java/methods/org.docx4j.openpackaging.parts.DocPropsCustomPart/getJaxbElement

Show all text of a docx in a stringBuilder with docx4j
https://stackoverflow.com/questions/26117645/show-all-text-of-a-docx-in-a-stringbuilder-with-docx4j
https://www.docx4java.org/forums/docx-java-f6/is-it-possible-to-extract-all-text-also-tab-and-hyphen-t1996.html

I am using docx4j for reading .docx files and I need to get the paragraph of a document and replace strings
https://stackoverflow.com/questions/13199900/i-am-using-docx4j-for-reading-docx-files-and-i-need-to-get-the-paragraph-of-a-d

利用docx4j完美导出word文档(标签替换、插入图片、生成表格)
https://blog.csdn.net/qq_31905135/article/details/80431042

java 操作word(docx4j)
https://www.cnblogs.com/marydon20170307/p/14757039.html
https://blog.csdn.net/weixin_34295316/article/details/86022702

java docx4j_docx4j基本操作
https://blog.csdn.net/weixin_28932161/article/details/114308517

Java 使用Docx4j实现word文档Docx格式转Doc格式
https://blog.csdn.net/qq_41394352/article/details/123302153

利用docx4j word转pdf
https://www.csdn.net/tags/MtTaMgysMzg3ODAtYmxvZwO0O0OO0O0O.html

https://github.com/plutext/docx4j/tree/master/docs

=END=

声明: 除非注明,ixyzero.com文章均为原创,转载请以链接形式标明本文地址,谢谢!
https://ixyzero.com/blog/archives/5229.html

《用 docx4j 对docx文档进行简单读写》上的4个想法

  1. Wordprocessing Text
    http://officeopenxml.com/WPtext.php
    `
    A run defines a non-block region of text with a common set of properties. It is specified with the element. The properties of the run are specified with the element, which is the first element of the . Runs most commonly contain text elements (which contain the actual literal text of a pararaph), but they may also contain such special content as symbols, tabs, hyphens, carriage returns, drawings, breaks, and footnote references. See Special Content.

    一个 run 定义了具有一组通用属性的 non-block 文本区域。它由 元素指定。run 的属性由 元素指定,它是 的第一个元素。run 通常包含文本元素 (包含段落的实际文字文本),但它们也可能包含诸如符号、制表符、连字符、回车符、绘图、中断和脚注引用等特殊内容。

    I feel that there is much to be said for the Celtic belief that the souls of those whom we have lost are held captive in some inferior being…

    `

发表评论

您的电子邮箱地址不会被公开。