Open Data & Open Software

Data & Software Licensing

The attention given to open data goes up and down – but very little of it is on the licence that makes the data ‘open’. The recent conversations about ‘open’ and ‘closed’ AI models is giving a little more attention to the licences those AI models are distributed under. It’s a good opportunity to talk about how we licence things to be ‘open’, or even ‘free’.

To understand things properly, we need to cover the history of two distinct concepts: free software, and IP rights. You can see the foreshadowing here: the fact that these have separate histories is what has created some of the today’s tensions.

Free software history

There are various principles of freedom and openness in software that were created back in the 80s and 90s. The ‘Free Software Definition’ in 1986 set out four essential freedoms:

The freedom to run the program as you wish, for any purpose (freedom 0💬 )
The freedom to study how the program works, and change it so it does your computing as you wish (freedom 1). Access to the source code is a precondition for this.
The freedom to redistribute copies so you can help your neighbor (freedom 2).
The freedom to distribute copies of your modified versions to others (freedom 3). By doing this you can give the whole community a chance to benefit from your changes. Access to the source code is a precondition for this.

11 years later, the Debian Free Software Guidelines were published, setting out 9 guidelines💬 to determine if a software licence is free.

Free redistribution.
Inclusion of source code.
Allowing for modifications and derived works.
Integrity of the author’s source code (as a compromise).
No discrimination against persons or groups.
No discrimination against fields of endeavor, like commercial use.
The license needs to apply to all to whom the program is redistributed.
License must not be specific to a product.
License must not restrict other software.

This list was used to create the ‘Open Source Definition’ (OSD) published by the ‘Open Source Initiative’ (but with technology neutrality added at the end).

Key here is that this was focussed on software, and that for software to truly be free, you needed to have access to the [human readable / editable] ‘source code’, not just the [machine-readable / not practically editable] ‘object code’ – along with rights to modify such code, etc.

Many licences were created on the back of these definitions, such as GPL, MIT, BSD, Apache 2, etc. and groups like the Open Source Initiative or SPDX keep lists of which licences they consider meet the ‘Open Source Definition’. These are the opinions of engineers and a few lawyers and academics – however there was very little litigation of these issues over the years to provide judicial clarity.

In the early 2000s, legal hero Lawrence Lessig brought forwards the case Eldred vs Ashcroft arguing that the repeated extension of copyright periods under US law did not meet the constitutional requirement that the period of copyright be for “limited Times”.

In the end, Eldred lost, and the copyrights were extended under the Sonny Bono Act, and it was only in 2024 (rather than 2004) that pre-1978 works became public domain. However, the case led to the establishment of ‘Creative Commons’ – a structure and licensing approach to allow people to licence their creative works (not just software) for public purposes (with the ‘some rights reserved’ tagline). There are many varieties of Creative Commons licence depending whether the works can be re-mixed, attribution is needed, etc. – but the idea is to give a ‘free’ licensing framework for creative works that are not software.

On Free and Open Source Software

When you first encounter the phrase ‘free and open-source software‘ you might wonder if it means the union of the two categories (all free software, and all open-source software), or the intersection (all software which is both free and open-source). This is a good question if you are thinking of ‘free‘ as in ‘no-charge‘. However, as we can see above, the ‘free‘ is akin to freedom, and so ‘open-source‘ is necessary, but not sufficient, for ‘freedom‘. It is part of it, and so the intersection or union question is the wrong one.

Getting ahead of ourselves, the way people create and release AI models is complicating this. For example, the Open Source Initiative is proposing a definition of ‘Open Source AI‘, in which parts of the training material might be subject to a commercial licence.

Intellectual Property Rights

For simplicity, let’s consider ‘intellectual property’ rights as a state-granted monopoly over [some activity] that relates to [some kind of idea]. The policy reason for this, at a very high level, is that the monopoly allows for creators to get revenue in respect of their idea for a period of time, which they would not be economically incentivised to do if copy-cats could also do the same. There are many nuances to this, of course – trade marks don’t expire, patents involve the trade-off that work-arounds are sort of encouraged (the patent brings the invention into the light, when it might have otherwise remained a trade secret), copyright is the expression of an idea, not the idea itself, etc, etc. – but the broad theme is right.

Copyright and patents are hundreds of years old (15^th Century ish). Trade marks are very old (Roman times), but state monopolies over them came about in the 19^th Century.

‘Data’ can be protected in different ways under different IP rights, but it wasn’t always so, and indeed isn’t always so today. Under English law, the Copyright, Designs and Patents Act 1988 didn’t contemplate databases until 1998 when we implemented the Database Directive. This allowed for copyright protection of a database (e.g. the new section 3A to the CDPA), and also a separate database right (directly in the Database Regulations).

There is a lot of caselaw here on what exactly is protected, and what constitutes an infringement. In broad strokes however a database gets:

copyright protection if the selection and arrangement of the data into the database was an original intellectual creation (it doesn’t matter what the data is – but how it was selected or arranged)

the database right if there has been a substantial investment in obtaining, verifying or presenting the contents of the database

The IP rights apply to the database – not the data. Each protection does (in a different way) restrict the use/extraction of data from the database, but the state granted monopoly is because of the database (very broadly interpreted).

If the database doesn’t meet the criteria for those IP rights, then there is no IP right in the data itself, and therefore no IP based restriction on someone copying, publishing, or commercialising the data or database.
Me

The reason the law came at it this way, is the individual data items in a database are typically not the original expression of an idea (normal text copyright), a trading brand (trade mark) – and are typically bare facts about the world (e.g. the FA Cup final score, the Bank of England base rate, how much rain fell this summer, etc.) – which it is doesn’t meet any policy goals, and is not appropriate or enforceable, to restrict the use of.

Note however that this was all a creature of EU law, and these sorts of database rights are very much not the case in lots of other parts of the world – importantly including the US.

Bringing it together

You can perhaps see where this is going: ‘openness’ isn’t typically needed over data or a database – as it is directly readable – you don’t need the ‘source code’ version in addition to the data/database.

‘Freedom’ isn’t needed over something that doesn’t have IP rights, as you don’t’ need a licence to use ideas that don’t have intellectual property protections (for example data in the US, or a very small portion that isn’t original of a work) – once you have a copy of it you can use it how you want💬.

The third piece of the puzzle after ‘openness’ and ‘freedom’, is ‘copyleft’ – the mechanic that makes ‘open source’ code ‘go viral’. The catch is that if you want to redistribute derivative works, you have to make them available under the very same licence (and typically include a notice). It’s not enough to just give the first recipient permission to use the works, the author wants to bind the first recipient to themselves use a similar process if they go on to distribute derivative works. Some popular free software licences are copyleft (e.g. the GPL) and some are not (e.g. MIT).

Licence to contract

And this brings us to one of the key uncertainties surrounding ‘free licences’ – are they licences, or are they contracts?

There has been a conversation for many years as to whether free software licences (such as the GPL) are a copyright licence, or a contract. A licence is a permission to do a thing that would otherwise be unlawful (in this case, copy, distribute, etc. the works). A contract is a 2 way bargain to do/not do things, which can relate to works whether they have IP rights or not.

Many groups like FSF, have historically put forward the view the GPL was only a copyright licence. The idea in the GPL that the licence is conditional on complying with the licence requirements – if you don’t meet the conditions the licence doesn’t apply and you are in breach of copyright.💬

Alice and Bob might contract in respect of Alice sharing some information with Bob, where Bob agrees not to use it for a particular purpose – and that restriction would be enforceable whether or not the information benefits from IP rights. Contracts however have a few more formalities to them than licences however – such as ‘consideration’ meaning you have to give something up to enter into it, and also only relates to the parties in the contract – so Alice and Bob might have a contract, but Charlie isn’t bound by the contract it if they got hold of a copy through other means.

More modern open source licences state that they are both licences and contracts – attempting to handle this situation.

For example ,Creative Commons Attribution licence places an attribution obligation on recipients of the works. In version 3 they added a line that:

“to the extent this Public License may be interpreted as a contract, the Licensor grants You the Rights contained here in consideration of Your acceptance of these terms and conditions…”.
CC v3

The Blue Oak licence states:

“In order to receive this license, you must agree to its rules. The rules of this license are both obligations under that agreement and conditions to your license. You must not do anything with this software that triggers a rule that you cannot or will not follow.”.
Blue Oak

I am not aware of any English judgment setting out whether GPL (or any other ambiguous open source licence is a contract or a licence). The French case Entr’ouvert vs Orange set out the GPL as a contract (seemingly so plainly that it doesn’t get any analysis in the judgment, and it is just described throughout as a “contrat de licence”). GPL-violations.org has brought some actions in Germany, e.g. in 2004 a preliminary injunction was granted, including analysis that the GPL was incorporated into a German law contract (and the terms were not in breach of German Civil Code).

Chris Simkins gave me a fascinating backdrop to this – that philosophical differences between civil law and common law may have driven differences in the importance of consideration to form a contract. The idea is that common law systems are grounded on enforcing ‘promises’, which makes consideration critical (though recent English caselaw would seem to have deviated from this). We can contrast this with civil law systems where contracts take their authority more from the civil code, which allows more unilateral contracts without consideration💬.

GenAI licences

We now know there are 2 questions – is content under a licence open source, and is it free.

The ‘open source-ness’ is on a spectrum, and depends on your perspective. On the one hand, we have GenAIaaS offerings like OpenAI or Anthropic – very much not open source. At the other end, are people like the Open Language Model who make available the training data and the weights in the model – so with enough compute you could make some changes and create something a bit different💬. In between are ‘open-weight’ models where you can download the weights (and read them, and in theory check them and edit them), but not the materials used to create the model (e.g. Llama, Grok and Mistral).

If we consider the classic ‘source code’, ‘object code’ and ‘SaaS’ framework, it becomes clearer why people end up talking past each other when they talk about AI model licences. An AI model released for use without charge has some elements of being ‘source code’ (you can run it yourself, inspect it, run it on any infrastructure you want, create derivative works by fine-tuning it, etc.). It also has some elements of being ‘object code’ (it is the output of a previous process which the software user doesn’t have access to, or a copy of, and so you are limited in how you can modify and make derivative works).

From one perspective, these ‘open weight’ models are like ‘closed source’ software, where you can operate it, but can’t tinker with it. There’s no conclusive answer as to whether ‘open weight’ AI models count as ‘open source’ or not – views differ (e.g. the Open Source Initiative seem to be adopting a half way house approach in their Open Source AI definition). The most important thing is clarity, I think, more than one approach being better necessarily superior.

Similarly, how free the licences for GenAI models are also varies. Closed source models are not provided under free licences. Fully open source models tend to be provided under the Apache 2.0 free / open source licence. In the middle, it varies. Some, like Mistral and Grok are licensed under the Apache 2.0 licence. Llama 3 however, is not provided under a free licence- the licence has a volume cap of 700million monthly active users (admittedly that’s a lot), and you have to comply with the Meta AUP (that prohibits for example warfare, planning activities that risk harm to individuals, impersonating people, representing the outputs are human generated, etc.), and also applicable laws and regulations.

Meta and others are very careful about agreeing to their terms in order to download their ‘open’ models. As much GenAI training data is owned by third parties, and the GenAI model can output verbatim portions of that training material – the original data owner could well bring a claim against the model user, who would then in turn look to recover their losses from the model developer. Those terms are in place to place exclusions and limits on the model developer’s liability.

What happens if you use an ‘open source software’ licence for data?

Lots of open source licences have provisions that don’t quite make sense for data. For example, LGPL 2.0 famously addresses dynamic linking of libraries. AGPL addresses the operation of the software over a network. GPL requires you to make the source code available if you convey the work (to a third party). These don’t really make sense in respect of data – sometimes this is harmless, sometimes it might undermine the contract more seriously.💬

Aside from bad grammar and requirements that don’t make sense, at a more fundamental level, the main problem you might face is a licence stated as a licence only, and not also as a contract. In the absence of IP rights in the underlying content – you may struggle to enforce a bare licence against a user that declines to licence it from you, but still gets and uses a copy. Apache 2.0 does not say it is a contract – just a licence.💬

Because data: (a) doesn’t come in open and closed forms like software; and (b) doesn’t attract IP rights like software, it seemed to some that an update to the definitions of ‘open’ needed some adjustments to properly consider the openness of a data licence. The Open Knowledge Foundation felt this way, and created an ‘open’ definition to use in relation to data and content. This is a derivation of the FSD and DFSG principles set out above – but with lots of bells and whistles added.

The first is that there are a few non-licence based requirements (somewhat similar to the integrity and availability pillars in data protection), of access being given, and the data being machine readable and in an open format.

In terms of the licence, it starts with the use, redistribution and modification, like in the four essential freedoms and in the OSD. Later on, we have ‘propagation’, ‘no charge’, ‘non discrimination against persons’ and ‘application to any purpose’ much like in the OSD. The ‘separation’ and ‘compilation’ requirements are worth noting. Arguably ‘separation’ (the licence allowing any part of the work to be used and modified separately to the rest) is a subset of ‘modification’. ‘Compilation’ however (that the licensed work must be allowed to be distributed along with other distinct works without placing restrictions on those works) is quite anti-copyleft. Some GPL type licences are notorious for requiring linked works to be licensed under the same GPL licence.

In an exercise of pragmatism, the open definition then lists seven permitted derogations from those requirements (such as allowing a requirement to attribute certain people, or allowing a requirement to change the name). In the context of the ‘compilation’ requirement above, the derogation that mandatory copyleft provisions are permitted is interesting. An ‘open definition’ licence must allow distinct bundled works to be under any licence, but can require distributions of the work (and presumably, but it doesn’t say so, derivative works?) to be under the original licence.

What about data specialist free licences

Of course there are a number of free data licences out there. Most famous of these is Creative Commons Zero (CC0) ‘no rights reserved’ which first tries to waive all rights in the works, and as a fallback grants a licence to use the works in any way. In both cases, trade marks and patent rights are carved out.

The UK Government’s Open Government Licence (OGL) is another free licence one used by many UK government departments. The OGL does not mention ‘contract’ at all, and its name and webpage give the impression it is a licence. Some parts of the website do however refer to the licence as a set of ‘terms and conditions’ – implying perhaps it should be taken to be a contract. The OGL licences the use of copyright and database right material – so it doesn’t give any rights over content or data which does not attract those IP rights (none are generally needed – see above), nor does it grant licences to use any trade mark, patent, or any other IP rights – so you should still consider if you would need those. It seems that, if you didn’t want to agree to the OGL, you could do things with the relevant government data that don’t require a licence (e.g. fair dealing and fair use, like the OGL says, or insubstantial extraction from a database (which is not a restricted activity in respect of the database right).

Summary

For years, a key message in the open-source movement is that it’s mainly about ‘free as in freedom’💬, not just about software being ‘open’. The increased importance of data, and the huge commercial uses that GenAI is unlocking, have placed a new emphasis on things – the ‘openness’ of the data is taking a back seat, with the freedom to use it getting a closer look. About time.

Credits

Kyle Mitchell’s writing helped me get my thinking straight on some of this, and also really useful on the latest developments.

Chris Simkins – who contributed, in particular on common law vs civil law history, and the contrast of a promises grounded common law system versus a legislation grounded civil code system.

Comments

Please comment away – it’s a big topic, and I am sure I am missing lots of history and latest developments here.