Class PDFText2HTML

java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.pdfbox.tools.PDFText2HTML

public class PDFText2HTML extends org.apache.pdfbox.text.PDFTextStripper
Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.
Author:
John J Barton
  • Field Summary

    Fields inherited from class org.apache.pdfbox.text.PDFTextStripper

    charactersByArticle, document, LINE_SEPARATOR, output
  • Constructor Summary

    Constructors
    Constructor
    Description
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    protected float
    computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0)
     
    protected void
    Write out the article separator.
    void
    endDocument(org.apache.pdfbox.pdmodel.PDDocument document)
    protected String
    This method will attempt to guess the title of the document using either the document properties or the first lines of text.
    protected void
    showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4)
     
    protected void
    startArticle(boolean isLTR)
    Write out the article separator (div tag) with proper text direction information.
    protected void
    startDocument(org.apache.pdfbox.pdmodel.PDDocument document)
     
    protected void
    Deprecated.
    protected void
    Writes the paragraph end "</p>" to the output.
    protected void
    Write a string to the output stream and escape some HTML characters.
    protected void
    writeString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions)
    Write a string to the output stream, maintain font state, and escape some HTML characters.

    Methods inherited from class org.apache.pdfbox.text.PDFTextStripper

    endPage, getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getCharactersByArticle, getCurrentPageNo, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getListItemPatterns, getOutput, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getStartPage, getSuppressDuplicateOverlappingText, getText, getWordSeparator, matchPattern, processPage, processPages, processTextPosition, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndBookmark, setEndPage, setIndentThreshold, setLineSeparator, setListItemPatterns, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setStartBookmark, setStartPage, setSuppressDuplicateOverlappingText, setWordSeparator, startArticle, startPage, writeCharacters, writeLineSeparator, writePage, writePageEnd, writePageStart, writeParagraphSeparator, writeParagraphStart, writeText, writeWordSeparator

    Methods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine

    addOperator, applyTextAdjustment, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, operatorException, processAnnotation, processChildStream, processOperator, processOperator, processSoftMask, processTilingPattern, processTilingPattern, processTransparencyGroup, processType3Stream, registerOperatorProcessor, restoreGraphicsStack, restoreGraphicsState, saveGraphicsStack, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showFontGlyph, showFontGlyph, showForm, showGlyph, showText, showTextString, showTextStrings, showTransparencyGroup, showType3Glyph, showType3Glyph, transformedPoint, transformWidth, unsupportedOperator

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • PDFText2HTML

      public PDFText2HTML() throws IOException
      Constructor.
      Throws:
      IOException - If there is an error during initialization.
  • Method Details

    • writeHeader

      @Deprecated protected void writeHeader() throws IOException
      Deprecated.
      Write the header to the output document. Now also writes the tag defining the character encoding.
      Throws:
      IOException - If there is a problem writing out the header to the document.
    • startDocument

      protected void startDocument(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
      Overrides:
      startDocument in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • endDocument

      public void endDocument(org.apache.pdfbox.pdmodel.PDDocument document) throws IOException
      Overrides:
      endDocument in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • getTitle

      protected String getTitle()
      This method will attempt to guess the title of the document using either the document properties or the first lines of text.
      Returns:
      returns the title.
    • startArticle

      protected void startArticle(boolean isLTR) throws IOException
      Write out the article separator (div tag) with proper text direction information.
      Overrides:
      startArticle in class org.apache.pdfbox.text.PDFTextStripper
      Parameters:
      isLTR - true if direction of text is left to right
      Throws:
      IOException - If there is an error writing to the stream.
    • endArticle

      protected void endArticle() throws IOException
      Write out the article separator.
      Overrides:
      endArticle in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String text, List<org.apache.pdfbox.text.TextPosition> textPositions) throws IOException
      Write a string to the output stream, maintain font state, and escape some HTML characters. The font state is only preserved per word.
      Overrides:
      writeString in class org.apache.pdfbox.text.PDFTextStripper
      Parameters:
      text - The text to write to the stream.
      textPositions - the corresponding text positions
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String chars) throws IOException
      Write a string to the output stream and escape some HTML characters.
      Overrides:
      writeString in class org.apache.pdfbox.text.PDFTextStripper
      Parameters:
      chars - String to be written to the stream
      Throws:
      IOException - If there is an error writing to the stream.
    • writeParagraphEnd

      protected void writeParagraphEnd() throws IOException
      Writes the paragraph end "</p>" to the output. Furthermore, it will also clear the font state.
      Overrides:
      writeParagraphEnd in class org.apache.pdfbox.text.PDFTextStripper
      Throws:
      IOException
    • showGlyph

      protected void showGlyph(org.apache.pdfbox.util.Matrix arg0, org.apache.pdfbox.pdmodel.font.PDFont arg1, int arg2, String arg3, org.apache.pdfbox.util.Vector arg4) throws IOException
      Overrides:
      showGlyph in class org.apache.pdfbox.contentstream.PDFStreamEngine
      Throws:
      IOException
    • computeFontHeight

      protected float computeFontHeight(org.apache.pdfbox.pdmodel.font.PDFont arg0) throws IOException
      Throws:
      IOException