在 Java 中使用 PDFBox 操作PDF文件


=Start=

缘由:

前段时间看到一篇文章想起来的,最近有时间简单做了测试,以及一些代码和功能的学习,这里简单整理一下,方便后面有需要的时候参考。

正文:

参考解答:

常见的一些PDF文件读写的需求:

  • 读取PDF文件内容
  • 给每一页内容添加特定“不可见”文本
  • 给每一页内容添加特定字符串水印
  • 读取PDF文件的meta信息
  • 给PDF文件写入特定的meta信息

其它特殊情况:

  • 水印如何进行旋转、居中等设置?
  • 多个PDF文件如何合并成1个?
  • 如何创建表格?
  • 如何提取图像?
  • ……

这里简单整理一下常见的PDF文件读写需求的代码片段,其它的功能实现后面按实际需求情况进行补充。

<!-- https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox -->
<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>2.0.30</version>
</dependency>
package com.example;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.text.PDFTextStripper;

import java.io.File;
import java.io.IOException;

/**
 * @author ixyzero
 * Created on 2023-12-01
 */
public class PdfBoxWatermark {
    // 传入文件对象,返回文件内容字符串
    static String getPdfText(File pdfFile) throws IOException {
        PDDocument doc = PDDocument.load(pdfFile);
        return new PDFTextStripper().getText(doc);
    }

    // 给传入的PDF文件对象的每一页上,添加特定的文本字符串(当字体的大小、颜色以及所在位置比较明显时,其实就类似于文本水印的效果)
    public static void addText(PDDocument document, String text) throws IOException {
        // 在添加特定文本之前,先添加自定义的元信息
        document.getDocumentInformation().setCustomMetadataValue("customFieldName", "customStringValueHere");

        // 遍历 PDF 中的所有页面
        for (int i = 0; i < document.getNumberOfPages(); i++) {
            PDPage page = document.getPage(i);

            // System.out.println(page.getMetadata()); // meta 一般是 document 级别,而非 page 级别的,所以这里会返回 null
            System.out.println(page.getRotation()); // 获取旋转角度,没有旋转时值为0
            System.out.println(page.getMediaBox().getHeight()); // 获取页面高度,对于A4纸来说,这里是 842 个像素
            System.out.println(page.getMediaBox().getWidth()); // 获取页面宽度,对于A4纸来说,这里是 595 个像素

            // 以append模式创建一个 PDPageContentStream 对象
            PDPageContentStream contentStream = new PDPageContentStream(document, page, PDPageContentStream.AppendMode.APPEND, true, true);

            // 设置字体和字号,以及字体颜色
            contentStream.setFont(PDType1Font.TIMES_ROMAN, 48); // 设置字体和字号
            contentStream.setNonStrokingColor(1f, 0f, 0f); // 设置字体颜色

            // 添加文本水印
            contentStream.beginText();
            contentStream.newLineAtOffset(200,1); // 指定添加的位置
            contentStream.showText(text); // 添加文本内容
            contentStream.endText();

            contentStream.close(); // 处理了之后需要及时关闭
        }
    }

    public static void main(String[] args) throws IOException {
        // System.out.println(getPdfText(new File("original.pdf")));

        // 读取 PDF 文件用以生成一个 PDDocument 对象
        PDDocument document = PDDocument.load(new File("original.pdf"));

        // 查询并打印一些文档的元信息字段
        System.out.println(document.getDocumentInformation().getMetadataKeys());
        System.out.println(document.getDocumentInformation().getCreator());
        System.out.println(document.getDocumentInformation().getCreationDate().getTime());
        System.out.println(document.getDocumentInformation().getKeywords());
        System.out.println(document.getDocumentInformation().getTitle());
        System.out.println(document.getDocumentInformation().getProducer());

        // 添加文本水印
        addText(document,"ixyzero.com");

        // 查询并打印一些文档的元信息字段
        System.out.println(document.getDocumentInformation().getCustomMetadataValue("fieldNameNotExists"));
        System.out.println(document.getDocumentInformation().getCustomMetadataValue("customFieldName"));

        // 保存修改后的 PDF 文件
        document.save(new File("output-pdfbox.pdf"));
        document.close();

        System.out.println(getPdfText(new File("output-pdfbox.pdf")));
    }

}
参考链接:

SpringBoot 实现 PDF 添加水印,我有 5 种实现方案
https://mp.weixin.qq.com/s/gphcp_L80OzOXxsPAijgig

The Apache PDFBox library is an open source Java tool for working with PDF documents.
https://mvnrepository.com/artifact/org.apache.pdfbox/pdfbox

How to extract text from a PDF file with Apache PDFBox
https://stackoverflow.com/questions/23813727/how-to-extract-text-from-a-pdf-file-with-apache-pdfbox

Adding Text to an Existing PDF Document
https://www.tutorialspoint.com/pdfbox/pdfbox_adding_text.htm

PDFBox Working with Metadata
https://www.javatpoint.com/pdfbox-working-with-metadata

How to get raw text from pdf file using java
https://stackoverflow.com/questions/18098400/how-to-get-raw-text-from-pdf-file-using-java

A4 size in pixels
https://www.a4-size.com/a4-size-in-pixels/

PdfBox 2.0.0 write text at given postion in a page
https://stackoverflow.com/questions/36449776/pdfbox-2-0-0-write-text-at-given-postion-in-a-page

Rotate watermark text at 45 degree angle across the center Apache PDFBox
https://stackoverflow.com/questions/53108150/rotate-watermark-text-at-45-degree-angle-across-the-center-apache-pdfbox

Using Pdfbox to Align Text in the Center
https://copyprogramming.com/howto/how-to-center-a-text-using-pdfbox

How to add metadata to PDF document using PDFbox?
https://stackoverflow.com/questions/40295264/how-to-add-metadata-to-pdf-document-using-pdfbox

=END=


《 “在 Java 中使用 PDFBox 操作PDF文件” 》 有 5 条评论

  1. 用Java获取文件的信息
    Get Information About a PDF in Java
    https://www.baeldung.com/java-pdf-info
    `
    1. Overview
    In this tutorial, we’ll get to know different ways of getting information about a PDF file using the iText and PDFBox libraries in Java.

    2. iText Library

    2.2. Getting the Number of Pages
    PdfReader.getNumberOfPages()

    2.3. Getting the PDF Metadata
    PdfReader.getInfo()

    2.4. Knowing the PDF Password Protection
    PdfReader.isEncrypted()

    3. PDFBox Library

    3.2. Getting the Number of Pages
    PDDocument.getNumberOfPages()

    3.3. Getting the PDF Metadata
    PDDocumentInformation info = PDDocument.getDocumentInformation();

    3.4. Knowing the PDF Password Protection
    boolean isEncrypted = PDDocument.isEncrypted();
    `

  2. 6. Adding Watermarks to Existing PDF
    iText PDF library made it easy to add watermarks to existing PDFs. We’ll first load our PDF document into our program. And use the iText library to manipulate our existing PDF.
    https://www.baeldung.com/java-watermarks-with-itext
    https://github.com/eugenp/tutorials/blob/master/libraries-files/src/main/java/com/baeldung/iTextPDF/StoryTime.java
    `
    func main
    func createPdf
    func addWatermarkToExistingPdf
    func createWatermarkParagraph
    func addWatermarkToPage
    func addWatermarkToExistingPage
    `

  3. java生成PDF的几种方法
    https://www.jianshu.com/p/e0a54d5e63c3
    `
    总结一下用java生成PDF的方法:

    A、itext-PdfStamper pdfStamper(俗称抠模板)
    B、itext-Document document(正常代码撰写)
    C、wkhtmltopdf(使用工具)

    比较分析
    方法 优点 缺点
    A 代码简单 模板要先提供,且字段长度固定、不灵活
    B 模板可根据代码调整、但样式不如C灵活 要维护的后台代码较多
    C 模板样式可根据前端随意调整 要维护的前台代码较多
    `

  4. How to identify and remove hidden text from the PDF using PDFBox java
    https://stackoverflow.com/questions/63936154/how-to-identify-and-remove-hidden-text-from-the-pdf-using-pdfbox-java?rq=3
    `
    So let’s extend the text stripper by a color filtering option. This in particular means adding operator processors for color setting instructions as the PDFTextStripper by default ignores them

    因此,让我们通过颜色过滤选项来扩展文本剥离器。这尤其意味着为颜色设置指令添加操作处理器,因为 PDFTextStripper 默认忽略这些指令
    `

    PDFBox – Removing invisible text (by clip/filling paths issue)
    https://stackoverflow.com/questions/47908124/pdfbox-removing-invisible-text-by-clip-filling-paths-issue
    `
    The reason why the PDFVisibleTextStripper from this answer the OP referenced does not work is that the calculation of the end of a character baseline end in the overwritten processTextPosition does not take page rotation into account. If you change that method, though, to only test the start of each character baseline and ignore the end, it works fairly good for the document at hand:

    OP 所引用的答案中的 PDFVisibleTextStripper 无法正常工作的原因是,在覆盖 processTextPosition 中计算字符基线末端时没有将页面旋转考虑在内。不过,如果更改该方法,只测试每个字符基线的起点,而忽略终点,那么对于手头的文档来说,效果还是不错的:
    `

    Methods for Adding Hidden Text to a PDF Document
    https://copyprogramming.com/howto/how-to-insert-invisible-text-into-a-pdf

  5. Page orientation 页面方向
    https://en.wikipedia.org/wiki/Page_orientation
    `
    landscape mode是指宽度比高度宽的模式,也就是俗称的”宽屏模式”;
    portrait mode是指高度比宽度高的模式,也就是俗称的”竖屏模式”;
    `
    啥是landscape,啥是portrait
    https://blog.csdn.net/k7arm/article/details/48085423

    What is the difference between portrait mode and landscape mode?
    https://pc.net/helpcenter/portrait_and_landscape_mode

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注