How can one share data for open science?

Question

How can one share data for open science?

commented Aug 17, 2015 by dendragon (0 points)

commented Aug 17, 2015 by HDE 226868 (415 points)

commented Aug 17, 2015 by Alexander Konovalov (155 points)

10 Answers

Tom Hardwicke · Answer 1 · 2015-08-04T17:05:52+0000

There are many options for sharing general scientific data and code such as Github, Figshare, and Dataverse. There are also some discipline specific services, such as the OpenfMRI project for neuroimaging data.

If you want to share both data and code, and keep them together, then from personal experience I can recommend The Open Science Framework. You create a 'project' for your study with a series of 'components' that can represent code, data, manuscripts, protocols, and pretty much anything else you can think of.

Amongst other features the site has built-in version control, and you can plug-in various external services, including those mentioned above. Another nice feature is the ability to initially keep your project private and share it via view-only links (e.g., to share with collaborators, reviewers, prior to publication). It is also entirely free and run by a non-profit (COS). There is a preservation fund to ensure your data, code etc, will survive, even if the company does not.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

jojo · Answer 2 · 2015-08-04T16:22:26+0000

If you don't want to rely on companies or institutions (like datahub, figshare or Dryad or many more) one way would be to simply include a torrent file in the git (or whatever control system) project.

In order to assure data integrity a checksum file can be added.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

HDE 226868 · Answer 3 · 2015-08-04T16:02:22+0000

The Royal Society has an open science journal, part of which is about sharing open data. They suggest:

Datasets should be deposited in an appropriate, recognized, publicly available repository.

DataCite has a list of repositories for different fields. It is, however, simply a list.

An example of an open science data repository is GenBank, operated by the NIH to store publicly available DNA sequences. Government repositories like GenBank are generally well-maintained because the data contained within is quiet valuable (scientifically). They are certainly good choices, provided that you can find one that specializes in the topic you are working on. The NIH does have other repositories for different subjects, so it is a good choice.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Daniel Standage · Answer 4 · 2015-08-04T16:38:02+0000

Aside from domain-specific databases like GenBank, generic data repositories like figshare and Dryad are a great choice for persistent storage of open science research outputs. Both allow anyone to create accounts and upload/manage large data files.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

m0nhawk · Answer 5 · 2015-08-04T16:40:40+0000

There is a project DataHub is powered by CKAN and can be used to share and publish data online.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Karsten 7. · Answer 6 · 2015-08-04T17:47:06+0000

For large datasets the Open Science Data Cloud (OSDC) provides resources for storing, sharing, and analyzing scientific datasets. One has to fill out a short proposal to get an OSDC resource allocation for ones project. Allocations start at 16 dedicated cores and 1TB of storage.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Franck Dernoncourt · Answer 7 · 2015-08-04T18:49:51+0000

This question has been addressed on OpenData SE, it might give interesting pointers:

Excerpt from the links (cc by-sa 3.0 with attribution required, user http://opendata.stackexchange.com/users/881/badroit):

Suppose that I have some sort of specialized data, perhaps that I've collected myself or been a part of the collection. And suppose that nothing prevents me from handing this data out to people. In what method should I go about distributing/storing this data so that others will be able to find it and use it, whenever this time may be?

Targeting specialised repositories as per @Joe's answer is indeed an excellent way to go about disseminating data, but what if no such specialised repository exists or you do not wish to target only one specific community in particular?

A methodology to expose Open Data using generic principles is the 5-star Open Data scheme originally proposed by Tim Berners-Lee here.

The core rationale of 5-star Open Data is that you make your data more easily accessible, processable and interoperable with each successive star:

★ Put your data on the Web in some format with an Open Licence. People can access it through their browsers and spend some time to figure out how they can download/access/process/use it. (Avoid problems for your client like this.)

★★ Put your data in a machine-processable format. For example, having a table in Excel is better than having a snapshot printed in PDFs or images because people can download it and start running experiments over it. (Avoid problems like this.)

★★★ Use non-proprietary formats. For example, providing data as a CSV is often better than as an Excel file because CSV can be directly processed by a wider range of (free/open source) tools and programming languages. (Can't find anyone complaining about Excel on here yet but, e.g., this is a similar problem.)

★★★★ Use URIs to denote things. For example, let's say you provide a bunch of pollution measures for cities and somebody would like to specifically reference the pollution measure for London. Assigning a URI for London in your local data provides a global unique identifier for that city that people can reference and point to. There are, for example, related proposals for embedding URI fragment identifiers in CSV files. (Avoid problems like this or this.)

★★★★★ Link your data to other data to provide context. So you have created a URI for London in your data and people can point to it. However, which London are you referring to? London, England or London, Ontario? If you link your local URI for London to the Wikipedia page about the London to which you refer (or, even better, to the DBpedia URI for the specific place to which you prefer), this provides context as to what you mean. (Avoid problems like this.)

The shift from ★★★ to ★★★★(★) is quite an ambitious one and technical proposals are still being made on how best to achieve this, but five star Open Data is great because now your data are available on the Web under open licences with open structured formats where everything of importance is given a URI that can be referenced and linked across the Web, allowing for future discovery and re-use. A common methodology to create five star Open Data (again proposed by Tim Berners-Lee) is Linked Data, which assumes RDF as a common interoperable data format. But if that all sounds too much, getting as far as ★★★ data is still great.

Again, you can check out this description of 5 Star Open Data for more information and a related question here.

A useful resource for the generic cataloguing of Open Datasets is the CKAN project, where the related DataHub repository is a great place to list and publicise your dataset. You can check out a bunch of 5-star Open Datasets here.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

kenorb · Answer 8 · 2015-08-07T11:52:15+0000

CKAN

If you want to make your data open and available, you can consider CKAN which is the open source data portal software to make your data discoverable and presentable where each dataset is given its own page with a rich collection of metadata, making it a valuable and easily searchable resource. Check the demo.

This solution is already used by private and government organisation/entities such as ^{case studies}:

data.gov.uk - UK Government’s official open data portal,
publicData.eu - a research prototype for EU data catalogue and federation mechanism,
Helsinki Region Infoshare online service - aims to make regional information quickly and easily accessible to all.

GitHub

If you are used to code repositories and want to keep both open code and open data in the same place, consider GitHub and its Git extension for versioning large files - Git Large File Storage (LFS). In this way you can version large files (even those as large as a couple GB) with Git.

On OSX you an easily install it via: brew install git-lfs.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

fsolt · Answer 9 · 2015-08-08T14:21:53+0000

@Tom Hardwicke has already mentioned Dataverse in passing, but it deserves a real plug. I'd guess it especially deserves your consideration if you're working in the social sciences--it definitely led the way in open data in political science (my discipline) a decade or so ago--but plenty of others use it too (it's among the repositories recommended by PLOS, for example). It provides persistent identifiers (Handle, DOI) for each dataset, archives old versions, and provides an easy way to track which files have changed across versions.

I've had a data-sharing project on Dataverse since 2008, and I've found it to be a very good platform for getting my work, both data and code, into the hands of other researchers.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Benteh · Answer 10 · 2015-08-11T07:41:42+0000

I would like to add a Mellon-founded project being developed as we speak: vega publishing

It is in early development, and they are open for suggestions. I think you/we should engage them here and/or send them suggestions for how it can and should be made possible.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

How can one share data for open science?

Please log in or register to add a comment.

Please log in or register to answer this question.

10 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

CKAN

GitHub

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Categories

Most popular tags

How can one share data for open science?

Please log in or register to add a comment.

Please log in or register to answer this question.

10 Answers

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

CKAN

GitHub

Please log in or register to add a comment.

Please log in or register to add a comment.

Please log in or register to add a comment.

Related questions

Categories

Most popular tags