Collaborative data collection & cleaning tools?

asked 2015-01-26 10:48:38 -0500

Texas_Bethany gravatar image

Hi everyone,

I'm working with a group of folks from NGOs and local health departments interested in the health and safety of low-wage workers in the US. We're trying to figure out the best way to collaboratively collect data about workplace deaths.

The goal: We're interested in documenting the deaths of workers throughout the US from media reports, individuals, and from government data. All of us have been collecting these sources individually for awhile, but we'd like to set up a collaborative system for data collection & sharing. (Note: This data is collected by the Bureau of Labor Statistics, but they will not release much data to the public. They will in no way release information about the businesses responsible for worker deaths.)

The problem: We have a real issue with duplication of records, and identifying duplicates is not very simple since no unique identifiers exist in our data. We've managed to eliminate duplicates by using Open Refine & Microsoft Access, but it has been a time consuming process.

The question for you: Does anyone know of any good, cloud-based tools for collaborative data collection, cleaning, and sharing?

Many thanks! Bethany Boggess

4 Answers

answered 2015-03-10 03:55:16 -0500

gunnargr gravatar image

updated 2015-03-16 03:03:23 -0500

As you know (OpenRefine) is an excellent tool for cleaning data (with extensions for (RDF)). Refine originated from the creative and highly skilled milieu around (Simile)/( (Freebase)-> (Wikidata)). There is currently a collaborative effort in public beta for (RefinePro). Maybe RefinePro could work for your needs?

answered 2015-01-27 04:04:17 -0500

You could consider Semantic Mediawiki<em>MediaWiki ( . Take a look at their Semantic Mediawiki of the Month liust for possible similar examples.<em>of</em>the<em>Month (

answered 2015-01-27 04:09:28 -0500

You could consider using the Tabular Data Manager service offered by the iMarine EU project

Tabular Data Manager offers facilities supporting the management of the entire life-cycle (creation, curation, manipulation and publication) of Tabular Resources such as datasets, codelist or generic table, i.e. tabular data representing observations of a given event or phenomenon at different time intervals. Tabular Resource are used in many domains ranging from statistics to signal processing and econometrics. Tabular Data Manager offers a rich set of facilities ranging from those supporting the assessment of data correctness to those supporting the verification of the compliance of data with given code lists, the aggregation and filtering of data.

It offers a customizable and fine grain duplicate detection and management. Moreover, it is oriented to support the collaborative management of the dataset by sharing it.

answered 2015-03-26 09:34:04 -0500

gunnargr gravatar image

Check also (DAT) open source project for building automated, reproducible data pipelines that sync. Beta release is under active development now.

