TextLab is a web application that helps scholars, editors, and students analyze revisions of any written work, in manuscript or print, in order to create a fluid-text edition of that work. How Does it Work? TextLab’s XML editor automatically inserts TEI tags for text and image transcription markup. Spencer-Brown and His World of Distinction. Note Preview Review Media Work Life. TextLab is a life-saver!!! The colored formatting is very helpful - makes my own mistakes jump out at me.' —- Danny Hall, www.rainfroginc.com. 'I love how easy it is to go from XML to JSON.
Latest versionReleased:
A Text Analytics Toolkit (TextAnalyticsLab/TextLab) for Python
Project description
Current release: TextLab [v0.1.5]
TextAnalyticsLab (TextLab) - a collection of Text Analytics tools for Python.
Introduction
'TextAnalyticsLab'/'TextLab' is a Python package providing a set of text analytics toolsfor data mining and machine learning projects and end-to-end text analyticsapplication development. It is compatible with and interoperate with dataanalysis and manipulation library Pandas, natural language processing librarynltk, Machine Lerning TookKit (pymltoolkit|mltk), and many other AI and machinelearning platforms.
Installation
If the installation failed with dependancy issues, execute the above command with --no-dependencies
Functions
- Text Similarity
- OCR (A wrapper to convert image documents to text using Tesseract-OCR and Ghostscript)
- Text Mining and Information Extraction (in v0.2.0)
- Cleaning Text content
- Web Scraping (in v0.1.6)
- Email Data Extraction
- Classification of Text Conent (in v0.2.0)
Usage
Warning: Python Variable, Function or Class names
The Python interpreter has a number of built-in functions. It is possible to overwrite thier definitions when coding without any rasing a warning from the Python interpriter. (https://docs.python.org/3/library/functions.html)Therfore, AVOID THESE NAMES as your variable, function or class names.
abs | all | any | ascii | bin | bool | bytearray | bytes |
callable | chr | classmethod | compile | complex | delattr | dict | dir |
divmod | enumerate | eval | exec | filter | float | format | frozenset |
getattr | globals | hasattr | hash | help | hex | id | input |
int | isinstance | issubclass | iter | len | list | locals | map |
max | memoryview | min | next | object | oct | open | ord |
pow | property | range | repr | reversed | round | set | |
setattr | slice | sorted | staticmethod | str | sum | super | tuple |
type | vars | zip | __import__ |
If you accedently overwrite any of the built-in function (e.g. list), execute the following to bring built-in defition.
Text Analytics Example
Text Similarity
text1
text2
Output:
Processing Documents
OCR Test
PDF To Image
Process PDF file and store results in Pandas DataFrame
Convert PDF or Image file to text
Convert PDF to Image DataFrame
Appy OCR on Images DataFrame
Email Data Extraction
EML file
Read from Exchange Web Services (using exchangelib)
License
Textlab Apk
Text Analytics Project Timeline
- 2018-07-10 [v0.0.1]: Initial set of functions for text data analysis was published to Github. (https://github.com/sptennak/TextAnalytics).
- 2019-01-03 [v0.0.2]: Created more functions for data exploration including web scraping and geo spacial data analysis for for IBM Coursera Data Science Capstone Project was published to Github. (https://github.com/sptennak/Coursera_Capstone).
- 2019-07-20 [v0.1.2]: First release of the 'TextLab' Text Analytics Python package to PyPI.
- 2019-11-10 [v0.1.3]: Enhancments and bug fixes. Integrated a wrapper to convert image documents to text using Tesseract-OCR and Ghostscript. This module was developed as a part of IBM Coursera Advanced Data Science Professional Certificate Capstone Project. (https://github.com/sptennak/IBM-Coursera-Advanced-Data-Science-Capstone) in the initial stage, but was not used in the final version due to text analytics was omitted in the final deliverable.
- 2019-11-16 [v0.1.4]: Bug Fixes, Enhanced Document Processing functions. Integrated Document Server API with OCR function.
- 2019-12-21 [v0.1.5]: Integrated email data extraction functions and cleaning text content.
Future Release Plan
- TBD [v0.1.6]: Integreate Web scraping functions. Comprehensive documentation, Major bug-fix version of the initial release with some enhancements.
- TBD [v0.1.6]: Enhance Information extraction functionality. Adding support to more opersource tools (OCR, Image Converters, etc.) avaiable.
- TBD [v0.2.0]: Integrate Text Mining, Information Extraction, and Classification.
- TBD [v0.3.0]: End-to-end Text Analytics Application Development
References
Other helpful text Anlytics and Natural Language Processing Python libraries
Release historyRelease notifications | RSS feed
0.1.5
0.1.4
0.1.3
0.1.2
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Filename, size | File type | Python version | Upload date | Hashes |
---|---|---|---|---|
Filename, size TextLab-0.1.5-py3-none-any.whl (36.6 kB) | File type Wheel | Python version py3 | Upload date | Hashes |
Filename, size TextLab-0.1.5.tar.gz (33.2 kB) | File type Source | Python version None | Upload date | Hashes |
Hashes for TextLab-0.1.5-py3-none-any.whl
Textlabel.size
Algorithm | Hash digest |
---|---|
SHA256 | 4f9c9ea09ac82f45b605b5e2dc20a6ec11dfc84da7248ae0bfebedd27bac1890 |
MD5 | 701dd2ff00bec37a2e4a7f88f8beb534 |
BLAKE2-256 | d9a88be0b72897f798942d3621fbebad1403de51a9d47e030f702936afa8ced6 |
Textlab04 Java Assignment
CloseHashes for TextLab-0.1.5.tar.gz
Algorithm | Hash digest |
---|---|
SHA256 | 7fb17f325a0e2bcbe91b9fbf3b51253490561ebf8e10859bf6f87a6e82fe41d6 |
MD5 | 65574e64708e54aca77a6122a2add8ca |
BLAKE2-256 | 1bf26c09be280148baa4ea7738b96ec33a9ffba6e5ec88f5b2d4a8becf79c30b |