changelogs.md


jsvine/pdfplumber

Repository  -  API  -  Source

0.7.1

May 31, 2022

Fixed

  • Fix bug when calling PageImage.debug_tablefinder() (i.e., with no parameters). (#659 + 063e2ed) [h/t @rneumann7]

Development Changes

  • Add Makefile target for examples, as well as dev requirements to support re-running the example notebooks automatically. (ef065a7)

0.7.0

May 27, 2022

Added

  • Add "matrix" property to char objects, representing the current transformation matrix. (ae6f99e)
  • Add pdfplumber.ctm submodule with class CTM, to calculate scale, skew, and translation of a current transformation matrix obtained from a char's "matrix" property. (ae6f99e)
  • Add page.search(...), an experimental feature that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. (#201 + 58b1ab1)
  • Add --include-attrs/--exclude-attrs to CLI (and corresponding params to .to_json(...), .to_csv(...), and Serializer. (4deac25)
  • Add py.typed for PEP561 compatibility and detection of typing hints by mypy. (ca795d1) [h/t @jhonatan-lopes]

Changed

  • Bump pinned pdfminer.six version to 20220524. (486cea8)

Removed

  • Remove utils.collate_chars(...), the old name (and then alias) for utils.extract_text(...). (24f3532)

Fixed

  • Fix IndexError bug for .extract_text(layout=True) on pages without text. (#658 + ad3df11) [h/t @ethanscorey]

0.6.2

May 6, 2022

Added

  • Add type annotations, and refactor parts of the library accordingly. (9587cc7)
  • Add enforcement of type annotations via mypy --strict. (cdfdb87)
  • Add final bits of test coverage. (feb9d08)
  • Add TableSettings class, a behind-the-scenes handler for managing and validating table-extraction settings. (9587cc7)

Changed

  • Rename the positional argument to .to_csv(...) and .to_json(...) from types to object_types. (9587cc7)
  • Tweak the output of .to_json(...) so that, if an object type is not present for a given page, it has no key in the page's object representation. (9587cc7)

Removed

  • Remove utils.filter_objects(...) and move the functionality to within the FilteredPage.objects property calculation, the only part of the library that used it. (9587cc7)
  • Remove code that sets pdfminer.pdftypes.STRICT = True and pdfminer.pdfinterp.STRICT = True, since that has now been the default for a while. (9587cc7)

0.6.1

April 23, 2022

Changed

  • Bump pinned pdfminer.six version to 20220319. (e434ed0)
  • Bump minimum Pillow version to >=9.1. (d88eff1)
  • Drop support for Python 3.6 (EOL Dec. 2021) (a32473e)

Fixed

  • If pdfplumber.open(...) opens a file but a pdfminer.pdfparser.PSException is raised during the process, pdfplumber now makes sure to close that file. (#581 + (#578) [h/t @johnhuge]
  • Fix incompatibility with Pillow>=9.1. (#637)

0.6.0

December 21, 2021

Added

  • Add .extract_text(layout=True), an experimental feature which attempts to mimic the structural layout of the text on the page. (#10)
  • Add utils.merge_bboxes(bboxes), which returns the smallest bounding box that contains all bounding boxes in the bboxes argument. (f8d5e70)
  • Add --precision argument to CLI (#520)
  • Add snap_x_tolerance and snap_y_tolerance to table extraction settings. (#51 + #475) [h/t @dustindall]
  • Add join_x_tolerance and join_y_tolerance to table extraction settings. (cbb34ce)

Changed

  • Upgrade pdfminer.six from 20200517 to 20211012; see that library's changelog for details, but a key difference is an improvement in how it assigns line, rect, and curve objects. (Diagonal two-point lines, for instance, are now line objects instead of curve objects.) (#515)
  • Remove Decimal-ization of parsed object attributes, which are now represented with as much precision as is returned by pdfminer.six (#346 + #520)
  • .extract_text(...) returns "" instead of None when character list is empty. (#482 + cb9900b) [h/t @tungph]
  • .extract_words(...) now includes doctop among the attributes it returns for each word. (66fef89)
  • Change behavior of horizontal text_strategy, so that it uses the top and bottom of every word, not just the top of every word and the bottom of the last. (#467 + #466 + #265) [h/t @bobluda + @samkit-jain]
  • Change table.merge_edges(...) behavior when join_tolerance (and x/y variants) <= 0, so that joining is attempted regardless, to handle cases of overlapping lines. (cbb34ce)
  • Raise error if certain table-extraction settings are negative. (aa2d594)

Fixed

  • Fix slowdown in .extract_words(...)/WordExtractor.iter_chars_to_words(...) on very long words, caused by repeatedly re-calculating bounding box. (#483)
  • Handle UnicodeDecodeError when trying to decode utf-16-encoded annotations (#463) [h/t @tungph]
  • Fix crash when extracting tables with null values in (text|intersection)_(x|y)_tolerance settings. (#539) [h/t @yoavxyoav]

Removed

  • Remove pdfplumber.load(...) method, which has been deprecated since 0.5.23 (54cbbc5)

Development Changes

  • Add CONTRIBUTING.md (#428)
  • Enforce import order via isort (d72b879)
  • Update Pillow and Wand versions in requirements.txt (cae6924)
  • Update all dependency versions in requirements-dev.txt (2f7e7ee)

0.5.28

May 8, 2021

Added

  • Add --laparams flag to CLI. (#407)

Changed

  • Change .convert_csv(...) to order objects first by page number, rather than object type. (#407)
  • Change .convert_csv(...), .convert_json(...), and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. (#407)

Fixed

  • Fix .extract_text(...) so that it can accept generator objects as its main parameter. (#385) [h/t @alexreg]
  • Fix page-parsing so that LTAnno objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when setting laparams.) (#388)
  • Fix Page.extract_table(...) so that it honors text tolerance settings (#415) [h/t @trifling]

0.5.27

February 28, 2021

Fixed

  • Fix regression (introduced in 0.5.26/b1849f4) in closing files opened by PDF.open
  • Reinstate access to higher-level layout objects (such as textboxhorizontal) when laparams is passed to pdfplumber.open(...). Had been removed in 0.5.24 via 1f87898. (#359 + #364)

Development Changes

  • Add a python setup.py build sdist test to main GitHub action. (#365)

0.5.26

February 10, 2021

Added

  • Add Page.close/__enter__/__exit__ methods, by generalizing that behavior through the Container class (b1849f4)

Changed

  • Change handling of floating point numbers; no longer convert them to Decimal objects and do not round them
  • Change TableFinder to return tables in order of topmost-and-then-leftmost, rather than leftmost-and-then-topmost (#336)
  • Change Page.to_image()'s handling of alpha layer, to remove aliasing artifacts (#340) [h/t @arlyon]

Development Changes

  • Enforce psf/black and flake8 on tests/ (#327

0.5.25

December 9, 2020

Added

  • Add new boolean argument strict_metadata (default False) to pdfplumber.open(...) method for handling metadata resolution failures (f2c510d)

Fixed

  • Fix metadata extraction to handle integer/floating-point values (cb32478) (#297)
  • Fix metadata extraction to handle nested metadata values (2d9415) (#316)
  • Explicitly load text as utf-8 in setup.py (7854328) (#304)
  • Fix pdfplumber.open(...) so that it does not close file objects passed to it (408605f) (#312)

0.5.24

October 20, 2020

Added

Changed

  • Change character attribute upright from int to bool (per original pdfminer.six representation) (1f87898)
  • Remove access and reference to Container.figures, given that they are not fundamental objects (8e74cb9)

Fixed

  • Decimalize "simple" explicit_horizontal_lines/explicit_vertical_lines descs passed to TableFinder methods (bc40779) (#290)

Development Changes

  • Refactor/simplify Page.process_objects (1f87898), utils.extract_words (c8b200e), and convert.serialize (a74d3bc)
  • Remove test_issues.py:test_pr_77 (917467a) and narrow test_ca_warn_report:test_objects (6233bbd) to speed up tests

0.5.23

August 15, 2020

Added

  • Add utils.resolve (non-recursive .resolve_all) (7a90630)
  • Add page.annots and page.hyperlinks, replacing non-functional page.annos, and mirroring pdfminer's language ("annot" vs. "anno"). (aa03961)
  • Add page/pdf.to_json and page/pdf.to_csv (cbc91c6)
  • Add relative=True/False parameter to .crop and .within_bbox; those methods also now raise exceptions for invalid and out-of-page bounding boxes. (047ad34) [h/t @samkit-jain]

Changed

  • Remove pdfminer.from_path and pdfminer.load as deprecated; now pdfminer.open is the canonical way to load a PDF. (00e789b)
  • Simplify the logic in "text" table-finding strategies; in edge cases, may result in changes to results. (d224202)
  • Drop support for Python 3.5 (baf1033)

Fixed

  • Fix .extract_words, which had been returning incorrect results when horizontal_ltr = False (d16aa13)
  • Fix utils.resize_object, which had been failing in various permutations (d16aa13)
  • Fix lines_strict table-finding strategy, which a typo had prevented from being usable (f0c9b85)
  • Fix utils.resolve_all to guard against two known sources of infinite recursion (cbc91c6)

Development Changes

  • Rename default branch to "stable," to clarify its purpose
  • Reformat code with psf/black (1258e09)
  • Add code linting via psf/black and flake8 (1258e09)
  • Switch from nosetests to pytest (1ac16dd)
  • Switch from pipenv to standard requirements.txt + python -m venv (48eaa51)
  • Add GitHub action for tests + codecov (b148fd1)
  • Add Makefile for building development virtual environment and running tests (4c69c58)
  • Add badges to README.md (9e42dc3)
  • Add Trove classifiers for Python versions to setup.py (6946e8d)
  • Add MANIFEST.in (eafc15c)
  • Add GitHub issue templates (c4156d6)
  • Remove pandas from dev requirements and tests (a5e7d7f)

0.5.22

July 18, 2020

Changed

  • Upgraded pdfminer.six requirement to ==20200517 (cddbff7) [h/t @youngquan]

Added

  • Add support for non_stroking_color attribute on char objects (0254da3) [h/t @idan-david]

0.5.21

May 27, 2020

Fixed

  • Fix Page.extract_table(...) to return None instead of crashing when no table is found (d64afa8) [h/t @stucka]

0.5.20

April 29, 2020

Fixed

  • Fix .get_page_image to prefer paths over streams, when possible (ab957de) [h/t @ubmarco]
  • Local-fix pdfminer.six's .resolve_all to handle tuples and simplify (85f422d)

Changed

  • Remove support for Python 2 and Python <3.3

0.5.19

April 16, 2020

Changed

  • Add utils.decimalize performance improvement (830d117) [h/t @ubmarco]

Fixed

  • Fix un-referenced method when using "text" table-finding strategy (2a0c4a2)
  • Add missing object type rect_edge to obj_to_edges() (0edc6bf)

0.5.18

April 1, 2020

Changed

  • Allow rect and curve objects also to be passed to "explicit_..._lines" setting when table-finding. (And disallow other types of dicts to be passed.)

Fixed

  • Fix utils.extract_text bug introduced in prior version

0.5.17

April 1, 2020

Fixed

  • Fix and simplify obj-in-bbox logic (see commit 25672961)
  • Improve/fix the way utils.extract_text handles vertical text (see commit 8a5d858b) [h/t @dwalton76]
  • Have Page.to_image use bytes stream instead of file path (Issue #124 / PR #179) [h/t @cheungpat]
  • Fix issue #176, in which Page.extract_tables did not pass kwargs to Table.extract [h/t @jsfenfen]

0.5.16

January 12, 2020

Fixed

  • Prevent custom LAParams from raising exception (Issue #168 / PR #169) [h/t @frascuchon]
  • Add six as explicit dependency (for now)

0.5.15

January 5, 2020

Changed

  • Upgrade pdfminer.six requirement to ==20200104
  • Upgrade pillow requirement >=7.0.0
  • Remove Python 2.7 and 3.4 from tox tests

0.5.14

October 6, 2019

Fixed

  • Fix sorting bug in page.extract_table()
  • Fix support for password-protected PDFs (PR #138)

0.5.13

August 29, 2019

Fixed

  • Fixed PDF object resolution for rotation (PR #136)

0.5.12

April 14, 2019

Added

  • cdecimal support for Python 2
  • Support for password-protected PDFs

0.5.11

November 13, 2018

Added

  • Caching for .decimalize() method

Changed

  • Upgrade to pdfminer.six==20181108
  • Make whitespace checking more robust (PR #88)

Fixed

  • Fix issue #75 (.to_image() custom arguments)
  • Fix issue raised in PR #77 (PDFObjRef resolution), and general class of problems
  • Fix issue #90, and general class of problems, by explicitly typecasting each kind of PDF Object

0.5.10

August 3, 2018

Fixed

  • Fix bug in which, when calling get_page_image(...), the alpha channel could make the whole page black out.

0.5.9

July 10, 2018

Fixed

  • Fix issue #67, in which bool-type metadata were handled incorrectly

0.5.8

March 6, 2018

Fixed

  • Fix issue #53, in which non-decimalize-able (non_)stroking_color properties were raising errors.

0.5.7

January 20, 2018

Added

  • .travis.yml, but failing on .to_image()

Changed

  • Move from defunct pycrypto to pycryptodome
  • Update pdfminer.six to 20170720

0.5.6

November 21, 2017

Fixed

  • Fix issue #41, in which PDF-object-referenced cropboxes/mediaboxes weren't being fully resolved.

0.5.5

May 10, 2017

Added

  • Access to __version__ from main namespace

Fixed

  • Fix issue #33, by checking decode_text's argument type

0.5.4

April 27, 2017

Fixed

  • Pin pdfminer.six to version 20151013 (for now), fixing incompatibility

0.5.3

February 27, 2017

Fixed

  • Allow import pdfplumber even if ImageMagick not installed.

0.5.2

February 27, 2017

Added

  • Access to curve points. (E.g., page.curves[0]["points"].)
  • Ability for .draw_line to draw curve points.

Changed

  • Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
  • Internally, made utils.decimalize a bit more robust; now throws errors on non-decimalizable items.
  • Now explicitly ignoring some (obscure) pdfminer object attributes.
  • Raw input for .draw_line from a bounding box to ((x, y), (x, y)), for consistency with curve["points"] and with Pillow's underlying method.

Fixed

  • Fixed typo bug when .rect_edges is called before .edges

0.5.1

February 26, 2017

Added

  • Quick-draw PageImage methods: .draw_vline, .draw_vlines, .draw_hline, and .draw_hlines.
  • Boolean parameter keep_blank_chars for .extract_words(...) and TableFinder settings.

Changed

  • Increased default text_tolerance and intersection_tolerance TableFinder values from 1 to 3.

Fixed

  • Properly handle conversion of PDFs with transparency to pillow images.
  • Properly handle pandas DataFrames as inputs to multi-draw commands (e.g., PageImage.draw_rects(...)).

0.5.0

February 25, 2017

Added

  • Visual debugging features, via Page.to_image(...) and PageImage. (Introduces wand and pillow as package requirements.)
  • More powerful options for extracting data from tables. See changes below.

Changed

  • Entirely overhaul the table-extraction methods. Now based on Anssi Nurminen's master's thesis.
  • Disentangle .crop from .intersects_bbox and .within_bbox.
  • Change default x_tolerance and y_tolerance for word extraction from 5 to 3

Fixed

  • Fix bug stemming from non-decimalized page heights. [h/t @jsfenfen]

0.4.6

January 26, 2017

Added

  • Provide access to Page.page_number

Changed

  • Use .page_number instead of .page_id as primary identifier. [h/t @jsfenfen]
  • Change default x_tolerance and y_tolerance for word extraction from 0 to 5

Fixed

  • Provide proper support for rotated pages

0.4.5

December 9, 2016

Fixed

  • Fix bug stemming from when metadata includes a PostScript literal. [h/t @boblannon]

0.4.4

Whoops.

0.4.3

April 12, 2016

Changed

  • When extracting table cells, use chars' midpoints instead of top-points.

Fixed

  • Fix find_gutters — should ignore " " chars