13 like 0 dislike
in Open Science by (415 points)
edited by

This is based on this Definition question.

The Panton Principles state

Many widely recognized licenses are not intended for, and are not appropriate for, data or collections of data. A variety of waivers and licenses that are designed for and appropriate for the treatment of data are described here. Creative Commons licenses (apart from CCZero), GFDL, GPL, BSD, etc are NOT appropriate for data and their use is STRONGLY discouraged.

Why is this? What is wrong with using Creative Commons licenses for data?


This post has been migrated from the Open Science private beta at StackExchange (A51.SE)


by (2.8k points)
0 0
Here are two blog posts from several years back that address this question at a useful level of detail:
- "CC BY and data: Not always a good fit" at http://osc.universityofcalifornia.edu/2016/09/cc-by-and-data-not-always-a-good-fit/
- "Why we should publish our data under Creative Commons Zero (CC0)" at http://www.canadensys.net/2012/why-we-should-publish-our-data-under-cc0 .

3 Answers

2 like 0 dislike
by (783 points)
selected by
Best answer

Some of the answers to this question were made obsolete by version 4 of the CC license construction kit. Before that version appeared, the following caveat applied (source):

CC licenses can and should be used for data and databases — with the important caveat that CC 3.0 license conditions do not apply to uses of data and databases that do not implicate copyright.

Since CC 4.0, CC0 is generally recommended for releasing scientific data and other open data to the public domain (to the extent that various legislations across the globe permit): https://wiki.creativecommons.org/wiki/CC0_use_for_data

9 like 1 dislike
by (1.2k points)

Well, this gets complicated and legal. (Caveat: I am not a lawyer.) According to Creative Commons, their licenses:

give everyone from individual creators to large companies and institutions a simple, standardized way to grant copyright permissions to their creative work.

In short, CC licenses apply to creative works and are meant to relax or waive the copyright protections automatically guaranteed to authors (e.g., by common law tradition in Commonwealth countries).

The applicability of CC licenses to data depends on whether data can be copyrighted. If data cannot be copyrighted, then there is no point to putting a CC license on them because those licenses waive rights that the data creators do not have.

So what kinds of works are protected by copyright? Though laws vary across jurisdictions (and thus make this question difficult to answer), two important principles are the "Idea-Expression divide" and "the threshold of originality". In the former, only expressions of ideas can be copyrighted, while ideas themselves cannot be. In the latter, among expressions, only those that are original are protected (thus reproductions of works do not earn copyright protection de novo).

Thus data only have copyright protection if they are an expression of an idea rather than idea itself and if they are not simply "facts" (i.e., they are something sufficiently original).

  • In the United States, this almost universally means that data cannot be copyrighted. A classic legal case here is Feist Publications, Inc., v. Rural Telephone Service Co., which ruled that telephone number listings in a phonebook are not protected by copyright. Importantly, nothing produced by the federal government has copyright protection (all federal government works are in the public domain, but this does not necessarily apply to other levels of government).
  • In Europe, however, databases do have copyright-like protection. Not all databases are protected; protection comes from "qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents". Such rights extend for 15 years.

Thus, one has to determine whether the "data" being discussed merit copyright on their own. It may be that "data" refer to works that are themselves copyrighted (e.g., original written works, such as newspaper articles). Those "data" are protected but not because they are data, rather because they are creative works. Someone who has compiled those works into a database in the United States has no copyright protection for the works (unless they have obtained those rights for each "data point" from the original author(s)). In Europe, however, the compilation of those data into a database may entitle the compiler to a limited database right.

In conclusion, CC licenses make sense if one has copyright protections to give away. If not, then CC licenses make no sense because the data are probably free to use anyway. If the data do merit protection (due to either satisfying European-style threshold of investment, or American-style threshold of originality, or some other national standard), then I believe the argument made in the linked webpage is purely made on the opinion that CC0/(or Public Domain, where that principle exists) are preferable to more restricted waivers of rights.

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)
1 like 1 dislike
by (165 points)

Figshare has a FAQ that covers this, sort of. Whether the Panton Principles were formulated using the same reasoning I do not know, but I have heard the Figshare reasoning on several occasions. Dryad has a stronger and more lengthy explanation (click on "Why does Dryad use Creative Commons Zero" in their FAQs).

The fear is that since some data cannot be copyrighted, if you use something other than CC0, people may not realize that some data can be used without attribution (as it is not subject to copyright). This reduces use of the data, which is contrary to the goals of openness.

The claim has also been made that it might make various sorts of data aggregation difficult. Dryad, for instance, imagines that you can competently aggregate data from 50,000 sources and asks you to envision lawsuits from not attributing the sources.

Both of these strike me as tantamount to saying, "We are are bad programmers, and careless scientists."

Scientifically, you want to know where your data comes from so you can fix it if the upstream source fixes it. You don't want messy aggregate data sources with no idea where it came from. Yes, you have to be a good enough programmer to keep the tiny bit of metadata about where it came from (and can be used for attribution) associated with the data itself. For instance, if you can grab data from 50,000 sources and can't even manage to say who it is from, what confidence should we have in the quality of your work, analysis, conclusions, etc.? Having licenses enforce this kind of basic good practice seems like an advantage, not a detriment, to me.

That doesn't mean that every license is appropriate for data. Viral non-commercial licenses really limit how data can be used, as once they get mixed in to a data set, companies basically have to stop touching the data. Contrary to the ideal of sharing, this actually poisons sharing by legally enforcing a forbidden class. But the idea that CC0 is the only thing that's appropriate for data is not well-founded, even if it is a common view. CC-BY is really not problematic; the requirements are minimal. (In fact, Figshare sensibly allows CC-BY for figures and so on; last I checked, Dryad insists on CC0 for everything. And both say you should provide attribution anyway as good custom.)

This post has been migrated from the Open Science private beta at StackExchange (A51.SE)

Ask Open Science used to be called Open Science Q&A but we changed the name when we registered the domain ask-open-science.org. Everything else stays the same: We are still hosted by Bielefeld University.

If you participated in the Open Science beta at StackExchange, please reclaim your user account now – it's already here!

E-mail the webmaster

Legal notice

Privacy statement