Repository - API - Source
"matrix"
property to char
objects, representing the current transformation matrix. (ae6f99e)pdfplumber.ctm
submodule with class CTM
, to calculate scale, skew, and translation of a current transformation matrix obtained from a char
's "matrix"
property. (ae6f99e)page.search(...)
, an experimental feature that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. (#201 + 58b1ab1)--include-attrs
/--exclude-attrs
to CLI (and corresponding params to .to_json(...)
, .to_csv(...)
, and Serializer
. (4deac25)py.typed
for PEP561 compatibility and detection of typing hints by mypy. (ca795d1) [h/t @jhonatan-lopes]pdfminer.six
version to 20220524
. (486cea8)utils.collate_chars(...)
, the old name (and then alias) for utils.extract_text(...)
. (24f3532)mypy --strict
. (cdfdb87)TableSettings
class, a behind-the-scenes handler for managing and validating table-extraction settings. (9587cc7).to_csv(...)
and .to_json(...)
from types
to object_types
. (9587cc7).to_json(...)
so that, if an object type is not present for a given page, it has no key in the page's object representation. (9587cc7)utils.filter_objects(...)
and move the functionality to within the FilteredPage.objects
property calculation, the only part of the library that used it. (9587cc7)pdfminer.pdftypes.STRICT = True
and pdfminer.pdfinterp.STRICT = True
, since that has now been the default for a while. (9587cc7).extract_text(layout=True)
, an experimental feature which attempts to mimic the structural layout of the text on the page. (#10)utils.merge_bboxes(bboxes)
, which returns the smallest bounding box that contains all bounding boxes in the bboxes
argument. (f8d5e70)--precision
argument to CLI (#520)snap_x_tolerance
and snap_y_tolerance
to table extraction settings. (#51 + #475) [h/t @dustindall]join_x_tolerance
and join_y_tolerance
to table extraction settings. (cbb34ce)pdfminer.six
from 20200517
to 20211012
; see that library's changelog for details, but a key difference is an improvement in how it assigns line
, rect
, and curve
objects. (Diagonal two-point lines, for instance, are now line
objects instead of curve
objects.) (#515)pdfminer.six
(#346 + #520).extract_text(...)
returns ""
instead of None
when character list is empty. (#482 + cb9900b) [h/t @tungph].extract_words(...)
now includes doctop
among the attributes it returns for each word. (66fef89)text_strategy
, so that it uses the top and bottom of every word, not just the top of every word and the bottom of the last. (#467 + #466 + #265) [h/t @bobluda + @samkit-jain]table.merge_edges(...)
behavior when join_tolerance
(and x
/y
variants) <= 0
, so that joining is attempted regardless, to handle cases of overlapping lines. (cbb34ce).extract_words(...)
/WordExtractor.iter_chars_to_words(...)
on very long words, caused by repeatedly re-calculating bounding box. (#483)UnicodeDecodeError
when trying to decode utf-16-encoded annotations (#463) [h/t @tungph](text|intersection)_(x|y)_tolerance
settings. (#539) [h/t @yoavxyoav]pdfplumber.load(...)
method, which has been deprecated since 0.5.23
(54cbbc5)--laparams
flag to CLI. (#407).convert_csv(...)
to order objects first by page number, rather than object type. (#407).convert_csv(...)
, .convert_json(...)
, and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. (#407).extract_text(...)
so that it can accept generator objects as its main parameter. (#385) [h/t @alexreg]LTAnno
objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when setting laparams
.) (#388)Page.extract_table(...)
so that it honors text tolerance settings (#415) [h/t @trifling]0.5.26
/b1849f4) in closing files opened by PDF.open
textboxhorizontal
) when laparams
is passed to pdfplumber.open(...)
. Had been removed in 0.5.24
via 1f87898. (#359 + #364)python setup.py build sdist
test to main GitHub action. (#365)Page.close/__enter__/__exit__
methods, by generalizing that behavior through the Container
class (b1849f4)Decimal
objects and do not round themTableFinder
to return tables in order of topmost-and-then-leftmost, rather than leftmost-and-then-topmost (#336)Page.to_image()
's handling of alpha layer, to remove aliasing artifacts (#340) [h/t @arlyon]psf/black
and flake8
on tests/
(#327strict_metadata
(default False
) to pdfplumber.open(...)
method for handling metadata resolution failures (f2c510d)setup.py
(7854328) (#304)pdfplumber.open(...)
so that it does not close file objects passed to it (408605f) (#312)extra_attrs=[...]
parameter to .extract_text(...)
(c8b200e) (#28)utils/page.dedupe_chars(...)
(04fd56a + b132d45) (#71)upright
from int
to bool
(per original pdfminer.six
representation) (1f87898)Container.figures
, given that they are not fundamental objects (8e74cb9)explicit_horizontal_lines
/explicit_vertical_lines
descs passed to TableFinder
methods (bc40779) (#290)utils.resolve
(non-recursive .resolve_all) (7a90630)page.annots
and page.hyperlinks
, replacing non-functional page.annos
, and mirroring pdfminer's language ("annot" vs. "anno"). (aa03961)page/pdf.to_json
and page/pdf.to_csv
(cbc91c6)relative=True/False
parameter to .crop
and .within_bbox
; those methods also now raise exceptions for invalid and out-of-page bounding boxes. (047ad34) [h/t @samkit-jain]pdfminer.from_path
and pdfminer.load
as deprecated; now pdfminer.open
is the canonical way to load a PDF. (00e789b).extract_words
, which had been returning incorrect results when horizontal_ltr = False
(d16aa13)utils.resize_object
, which had been failing in various permutations (d16aa13)lines_strict
table-finding strategy, which a typo had prevented from being usable (f0c9b85)utils.resolve_all
to guard against two known sources of infinite recursion (cbc91c6)pandas
from dev requirements and tests (a5e7d7f)Page.extract_table(...)
to return None
instead of crashing when no table is found (d64afa8) [h/t @stucka]rect
and curve
objects also to be passed to "explicit_..._lines" setting when table-finding. (And disallow other types of dicts to be passed.)utils.extract_text
bug introduced in prior versionutils.extract_text
handles vertical text (see commit 8a5d858b) [h/t @dwalton76]Page.to_image
use bytes stream instead of file path (Issue #124 / PR #179) [h/t @cheungpat]Page.extract_tables
did not pass kwargs to Table.extract
[h/t @jsfenfen]pdfminer.six
requirement to ==20200104
pillow
requirement >=7.0.0
tox
testspage.extract_table()
cdecimal
support for Python 2.decimalize()
methodpdfminer.six==20181108
.travis.yml
, but failing on .to_image()
pycrypto
to pycryptodome
pdfminer.six
to 20170720
__version__
from main namespacedecode_text
's argument typepdfminer.six
to version 20151013
(for now), fixing incompatibilityimport pdfplumber
even if ImageMagick not installed.curve
points. (E.g., page.curves[0]["points"]
.).draw_line
to draw curve
points.utils.decimalize
a bit more robust; now throws errors on non-decimalizable items.pdfminer
object attributes..draw_line
from a bounding box to ((x, y), (x, y))
, for consistency with curve["points"]
and with Pillow
's underlying method..rect_edges
is called before .edges
PageImage
methods: .draw_vline
, .draw_vlines
, .draw_hline
, and .draw_hlines
.keep_blank_chars
for .extract_words(...)
and TableFinder
settings.text_tolerance
and intersection_tolerance
TableFinder values from 1 to 3.pillow
images.pandas
DataFrames as inputs to multi-draw commands (e.g., PageImage.draw_rects(...)
).Page.to_image(...)
and PageImage
. (Introduces wand
and pillow
as package requirements.).crop
from .intersects_bbox
and .within_bbox
.x_tolerance
and y_tolerance
for word extraction from 5
to 3
Page.page_number
.page_number
instead of .page_id
as primary identifier. [h/t @jsfenfen]x_tolerance
and y_tolerance
for word extraction from 0
to 5
Whoops.
" "
chars