Getting Started:

This repo is intended to work with the Dataturks NER annotation service.

Requirements:

Environment can be set up using the following CL code:

pip install -r requirements.txt

Setting up a Dataturks NER Project

You can create a dataturks project by doing the following:

Go to https://dataturks.com/
Create an account by clicking “sign up” in the top right of the page
Sign in and navigate to https://dataturks.com/projects
Click “Create Dataset” on the left side of the https://dataturks.com/projects page
Select NER Tagging Fill in your project specific info and submit

Uploading Unlabelled Data

Now that you have created a project you can upload unlabelled data by doing the following:

Navigate to https://dataturks.com/projects
Click the “Home” on the left side of the page.
Select your NER Project you created
Click “Options” located at the top right corner of the page, select “Add Data”
Select Upload Raw Data
Choose your formatted .txt file and click submit.

Downloading Labelled Data from Dataturks

Once you’ve labelled enough data you can download it by:

Navigating to you Dataturks Project
Clicking “Options” in the top right corner of the page and selecting download
Select “Completed Items” and “Standard NER Format” then click “Download”.

Working with this Repo

The code for working with the Dataturks service is split up into three main files:

1. formatting.py

This file is called by the others to format unlabelled and labelled data before passing it to a CRF model.

You can use format_unlabelled_data() for .txt files formatted for upload to Dataturks:

And you can use format_labelled_data() for data that has been annotated and downloaded as .tsv from dataturks:

2. training.py

You can use training.py to train and evaluate CRF models with labelled data from Dataturks. Simply call the train_crf() function and pass it the file name to get a baseline CRF model as well as information on performance.

3. pre_annotate.py

This will load a saved CRF model (from train_crf()) and use it to make predictions for unlabelled data. Simply pass in the saved CRF model, the unlabelled file and the name of the new save file to the function pre_annotate_unlabelled() and the script will pre-annotate your data and format it so it can be immediately uploaded to Dataturks.

For a notebook version of this walkthrough see "Sample Workflow.ipynb"

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
data		data
images		images
models		models
.DS_Store		.DS_Store
README.md		README.md
Sample Workflow.ipynb		Sample Workflow.ipynb
formatting.py		formatting.py
pre_annotate.py		pre_annotate.py
training.py		training.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

pycache

pycache

data

data

images

images

models

models

.DS_Store

.DS_Store

README.md

README.md

Sample Workflow.ipynb

Sample Workflow.ipynb

formatting.py

formatting.py

pre_annotate.py

pre_annotate.py

training.py

training.py

Repository files navigation

Getting Started:

Requirements:

Setting up a Dataturks NER Project

Uploading Unlabelled Data

Downloading Labelled Data from Dataturks

Working with this Repo

1. formatting.py

2. training.py

3. pre_annotate.py

About

Releases

Packages

Languages

LiamWoodRoberts/Dataturks-NER-Tools

Folders and files

Latest commit

History

Repository files navigation

Getting Started:

Requirements:

Setting up a Dataturks NER Project

Uploading Unlabelled Data

Downloading Labelled Data from Dataturks

Working with this Repo

1. formatting.py

2. training.py

3. pre_annotate.py

About

Resources

Stars

Watchers

Forks

Languages