Class TextPosition

java.lang.Object
org.apache.pdfbox.text.TextPosition

public final class TextPosition extends Object
This represents a string and a position on the screen of those characters.
Author:
Ben Litchfield
  • Constructor Summary

    Constructors
    Constructor
    Description
    TextPosition(int pageRotation, float pageWidth, float pageHeight, Matrix textMatrix, float endX, float endY, float maxHeight, float individualWidth, float spaceWidth, String unicode, int[] charCodes, PDFont font, float fontSize, int fontSizeInPt)
    Constructor.
  • Method Summary

    Modifier and Type
    Method
    Description
    boolean
    Determine if this TextPosition logically contains another (i.e.
    boolean
     
    int[]
    Return the internal PDF character codes of the glyphs in this text.
    float
    Return the direction/orientation of the string in this object based on its text matrix.
    float
    This will get the x coordinate of the end position.
    float
    This will get the y coordinate of the end position.
    This will get the font for the text being drawn.
    float
    This will get the font size that has been set with the "Tf" operator (Set text font and size).
    float
    This will get the font size in pt.
    float
    This will get the maximum height of all characters in this string.
    float
    This will get the maximum height of all characters in this string.
    float[]
    Get the widths of each individual character.
    float
    This will get the height of the page that the text is located in.
    float
    This will get the width of the page that the text is located in.
    int
    This will get the rotation of the page that the text is located in.
    The matrix containing the starting text position and scaling.
    Return the string of characters stored in this object.
    Same as getUnicode() except that returned text is ensured to be visually ordered (i.e.
    float
    This will get the width of the string when page rotation adjusted coordinates are used.
    float
    This will get the width of the string when text direction adjusted coordinates are used.
    float
    This will get the width of a space character.
    float
    This will get the page rotation adjusted x position of the character.
    float
    This will get the text direction adjusted x position of the character.
    float
    This will get the X scaling factor.
    float
    This will get the page rotation adjusted x position of the character.
    float
    This will get the y position of the text, adjusted so that 0,0 is upper left and it is adjusted based on the text direction.
    float
    This will get the Y scaling factor.
    int
     
    boolean
     
    void
    Merge a single character TextPosition into the current object.
    Show the string data for this text position.

    Methods inherited from class java.lang.Object

    clone, finalize, getClass, notify, notifyAll, wait, wait, wait
  • Constructor Details

    • TextPosition

      public TextPosition(int pageRotation, float pageWidth, float pageHeight, Matrix textMatrix, float endX, float endY, float maxHeight, float individualWidth, float spaceWidth, String unicode, int[] charCodes, PDFont font, float fontSize, int fontSizeInPt)
      Constructor.
      Parameters:
      pageRotation - rotation of the page that the text is located in
      pageWidth - width of the page that the text is located in
      pageHeight - height of the page that the text is located in
      textMatrix - text rendering matrix for start of text (in display units)
      endX - x coordinate of the end position
      endY - y coordinate of the end position
      maxHeight - Maximum height of text (in display units)
      individualWidth - The width of the given character/string. (in text units)
      spaceWidth - The width of the space character. (in display units)
      unicode - The string of Unicode characters to be displayed.
      charCodes - An array of the internal PDF character codes for the glyphs in this text.
      font - The current font for this text position.
      fontSize - The new font size.
      fontSizeInPt - The font size in pt units (see getFontSizeInPt() for details).
  • Method Details

    • getUnicode

      public String getUnicode()
      Return the string of characters stored in this object. The length can be different than the CharacterCodes length e.g. if ligatures are used ("fi", "fl", "ffl") where one glyph represents several unicode characters.
      Returns:
      The string on the screen.
    • getVisuallyOrderedUnicode

      public String getVisuallyOrderedUnicode()
      Same as getUnicode() except that returned text is ensured to be visually ordered (i.e. same order you would see them displayed on screen when looking from left to right). This is important for Arabic/Hebrew where several unicode characters can be composed in one glyph with logical order (the order in which it would be normally typed from right to left).
      Returns:
      The string on the screen in visual order.
    • getCharacterCodes

      public int[] getCharacterCodes()
      Return the internal PDF character codes of the glyphs in this text.
      Returns:
      an array of internal PDF character codes
    • getTextMatrix

      public Matrix getTextMatrix()
      The matrix containing the starting text position and scaling. Despite the name, it is not the text matrix set by the "Tm" operator, it is really the effective text rendering matrix (which is dependent on the current transformation matrix (set by the "cm" operator), the text matrix (set by the "Tm" operator), the font size (set by the "Tf" operator) and the page cropbox).
      Returns:
      The Matrix containing the starting text position
    • getDir

      public float getDir()
      Return the direction/orientation of the string in this object based on its text matrix. Only angles of 0, 90, 180, or 270 are supported. To get other angles, use this code:
       TextPosition text = ...
       Matrix m = text.getTextMatrix().clone();
       m.concatenate(text.getFont().getFontMatrix());
       int angle = (int) Math.round(Math.toDegrees(Math.atan2(m.getShearY(), m.getScaleY())));
       
      Returns:
      The direction of the text (0, 90, 180, or 270).
    • getX

      public float getX()
      This will get the page rotation adjusted x position of the character. This is adjusted based on page rotation so that the upper left is 0,0 which is unlike PDF coordinates, which start at the bottom left. See also this answer by Michael Klink for further details and PDFBOX-4597 for a sample file.
      Returns:
      The x coordinate of the character.
    • getXDirAdj

      public float getXDirAdj()
      This will get the text direction adjusted x position of the character. This is adjusted based on text direction so that the first character in that direction is in the upper left at 0,0. This method ignores the page rotation but takes the text rotation (see getDir()) and adjusts the coordinates to awt. This is useful when doing text extraction, to compare the glyph positions when imagining these to be horizontal. See also this answer by Michael Klink for further details and PDFBOX-4597 for a sample file.
      Returns:
      The x coordinate of the text.
    • getY

      public float getY()
      This will get the page rotation adjusted x position of the character. This is adjusted based on page rotation so that the upper left is 0,0 which is unlike PDF coordinates, which start at the bottom left. See also this answer by Michael Klink for further details and PDFBOX-4597 for a sample file.
      Returns:
      The adjusted y coordinate of the character.
    • getYDirAdj

      public float getYDirAdj()
      This will get the y position of the text, adjusted so that 0,0 is upper left and it is adjusted based on the text direction. This method ignores the page rotation but takes the text rotation and adjusts the coordinates to awt. This is useful when doing text extraction, to compare the glyph positions when imagining these to be horizontal. See also this answer by Michael Klink for further details and PDFBOX-4597 for a sample file.
      Returns:
      The adjusted y coordinate of the character.
    • getWidth

      public float getWidth()
      This will get the width of the string when page rotation adjusted coordinates are used.
      Returns:
      The width of the text in display units.
    • getWidthDirAdj

      public float getWidthDirAdj()
      This will get the width of the string when text direction adjusted coordinates are used.
      Returns:
      The width of the text in display units.
    • getHeight

      public float getHeight()
      This will get the maximum height of all characters in this string.
      Returns:
      The maximum height of all characters in this string.
    • getHeightDir

      public float getHeightDir()
      This will get the maximum height of all characters in this string.
      Returns:
      The maximum height of all characters in this string.
    • getFontSize

      public float getFontSize()
      This will get the font size that has been set with the "Tf" operator (Set text font and size). When the text is rendered, it may appear bigger or smaller depending on the current transformation matrix (set by the "cm" operator) and the text matrix (set by the "Tm" operator).
      Returns:
      The font size.
    • getFontSizeInPt

      public float getFontSizeInPt()
      This will get the font size in pt. To get this size we have to multiply the font size from getFontSize() with the text matrix (set by the "Tm" operator) horizontal scaling factor and truncate the result to integer. The actual rendering may appear bigger or smaller depending on the current transformation matrix (set by the "cm" operator). To get the size in rendering, use getXScale().
      Returns:
      The font size in pt.
    • getFont

      public PDFont getFont()
      This will get the font for the text being drawn.
      Returns:
      The font size.
    • getWidthOfSpace

      public float getWidthOfSpace()
      This will get the width of a space character. This is useful for some algorithms such as the text stripper, that need to know the width of a space character.
      Returns:
      The width of a space character.
    • getXScale

      public float getXScale()
      This will get the X scaling factor. This is dependent on the current transformation matrix (set by the "cm" operator), the text matrix (set by the "Tm" operator) and the font size (set by the "Tf" operator).
      Returns:
      The X scaling factor.
    • getYScale

      public float getYScale()
      This will get the Y scaling factor. This is dependent on the current transformation matrix (set by the "cm" operator), the text matrix (set by the "Tm" operator) and the font size (set by the "Tf" operator).
      Returns:
      The Y scaling factor.
    • getIndividualWidths

      public float[] getIndividualWidths()
      Get the widths of each individual character.
      Returns:
      An array that has the same length as the CharacterCodes array.
    • contains

      public boolean contains(TextPosition tp2)
      Determine if this TextPosition logically contains another (i.e. they overlap and should be rendered on top of each other).
      Parameters:
      tp2 - The other TestPosition to compare against
      Returns:
      True if tp2 is contained in the bounding box of this text.
    • mergeDiacritic

      public void mergeDiacritic(TextPosition diacritic)
      Merge a single character TextPosition into the current object. This is to be used only for cases where we have a diacritic that overlaps an existing TextPosition. In a graphical display, we could overlay them, but for text extraction we need to merge them. Use the contains() method to test if two objects overlap.
      Parameters:
      diacritic - TextPosition to merge into the current TextPosition.
    • isDiacritic

      public boolean isDiacritic()
      Returns:
      True if the current character is a diacritic char.
    • toString

      public String toString()
      Show the string data for this text position.
      Overrides:
      toString in class Object
      Returns:
      A human readable form of this object.
    • getEndX

      public float getEndX()
      This will get the x coordinate of the end position. This is the unadjusted value passed into the constructor.
      Returns:
      The unadjusted x coordinate of the end position
    • getEndY

      public float getEndY()
      This will get the y coordinate of the end position. This is the unadjusted value passed into the constructor.
      Returns:
      The unadjusted y coordinate of the end position
    • getRotation

      public int getRotation()
      This will get the rotation of the page that the text is located in. This is the unadjusted value passed into the constructor.
      Returns:
      The unadjusted rotation of the page that the text is located in
    • getPageHeight

      public float getPageHeight()
      This will get the height of the page that the text is located in. This is the unadjusted value passed into the constructor.
      Returns:
      The unadjusted height of the page that the text is located in
    • getPageWidth

      public float getPageWidth()
      This will get the width of the page that the text is located in. This is the unadjusted value passed into the constructor.
      Returns:
      The unadjusted width of the page that the text is located in
    • equals

      public boolean equals(Object o)
      Overrides:
      equals in class Object
    • hashCode

      public int hashCode()
      Overrides:
      hashCode in class Object