Copyright © 2016 the Contributors to the ML Schema CG Specification, published by the W3C Machine Learning Schema Community Group under the W3C Community Contributor License Agreement (CLA). A human-readable summary is available.
The ML Schema is a simple shared schema that provides a set of classes, properties, and restrictions that can be used to represent and interchange information on data mining and machine learning algorithms, datasets, and experiments. It can be specialized to create new classes and properties. It can be mapped to more complex, specific ontologies on data mining and machine learning, and also used as a basis for markup languages and data exchange standards.
The namespace for ML Schema terms is http://www.w3.org/ns/mls#.
The OWL encoding of the ML Schema is available here.
This specification was published by the Machine Learning Schema Community Group. It is not a W3C Standard nor is it on the W3C Standards Track. Please note that under the W3C Community Contributor License Agreement (CLA) there is a limited opt-out and other conditions apply. Learn more about W3C Community and Business Groups.
The core vocabulary of ML Schema deals with machine learning (ML) algorithms. The schema can be used to represent the algorithms, the machine learning tasks they address, their implementations and executions, as well as inputs (e.g., data) and outputs (e.g., models) they specify.
This lightweight schema may be used as a basis for ontology development projects, markup languages and data exchange standards. In particular, it aims to align existing machine learning ontologies and to support development of more specific ontologies with specific purposes/applications. The main purpose is to increase interoperability by preventing a proliferation of incompatible machine learning ontologies as well as to provide a high-level standard to represent machine learning data. Thus, this scenario leads to a more representative and comprehensive ontology derived from existing state-of-the-art ML schemas.
The schema also defines a relationship between machine learning algorithms and their single executions and experiments and studies encompassing them. It aims at stimulating the development of standards in order to achieve high level of interoperability among scientific experiments concerning machine learning. By facilitating the metadata interchange process, the ML Schema may foster reproducible research. Another goal of ML Schema related to interoperability and reproducible research it to facilitate turning machine learning algorithms and results into linked open data.
MLSchema.Schema data = mlschema.convert('myfile.ttl', MLSchema.Ontology.OntoDM, MLSchema.Ontology.MEX)
).
Besides a higher level of interoperability, it direct benefits ML Ecosystems (e.g.: OpenML) and ML Metadata Repositories (e.g.: WASOTA) which can rely on a more representative standard in their architecures.
In OpenML, MLSchema is used to export all machine learning datasets, tasks, workflows and runs as linked open data (in RDF). This allows scientists to connect the results of their machine learning experiments to other knowledge sources, or to build novel knowledge bases for machine learning research.
This section is non-normative.
The following namespace prefixes are used throughout this document.
prefix | namespace IRI | definition |
rdf | http://www.w3.org/1999/02/22-rdf-syntax-ns# | The RDF namespace [RDF-CONCEPTS] |
xsd | http://www.w3.org/2000/10/XMLSchema# | XML Schema Namespace [XMLSCHEMA11-2] |
owl | http://www.w3.org/2002/07/owl# | The OWL namespace [OWL2-OVERVIEW] |
mls | http://www.w3.org/ns/mls# | The ML Schema namespace |
frapo | http://www.sparontologies.net/ontologies/frapo; http://purl.org/cerif/frapo | Funding, Research Administration and Projects Ontology |
dc | http://purl.org/dc/elements/1.1/ | Dublin Core [DUBLIN-CORE] |
foaf | http://xmlns.com/foaf/0.1/ | FOAF Vocabulary Specification [FOAF] |
(others) | (various) | Other namespace prefixes appearing only in examples. |
We will illustrate the ML schema by means of two examples.
Firstly, we illustrate the ML schema with an example derived from the OpenML portal (see Fig. 3). The example describes entities involved to model a single run of the implementation of a logistic regression algorithm from a Weka machine learning environment. The referenced individuals can easily be looked up online. For instance, run 100241 can be found on http://www.openml.org/r/100241.
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . :run100241 rdf:type owl:NamedIndividual , mls:Run ; mls:executes :wekaLogistic ; mls:hasInput :credit-a , :wekaLogisticMSetting29 , :wekaLogisticRSetting29 ; mls:hasOutput :modelEvaluation100241 , :wekaLogisticModel100241 ; mls:realizes :logisticRegression ; mls:achieves :task29 . :wekaLogistic rdf:type owl:NamedIndividual , mls:Implementation ; mls:hasHyperParameter :wekaLogisticC , :wekaLogisticDoNotCheckCapabilities , :wekaLogisticM , :wekaLogisticOutputDebugInfo , :wekaLogisticR ; mls:implements :logisticRegression . :weka rdf:type mls:Software, mls:hasPart :wekaLogistic. :logisticRegression rdf:type owl:NamedIndividual , mls:Algorithm . :wekaLogisticC rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticDoNotCheckCapabilities rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticM rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticOutputDebugInfo rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticR rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticMSetting29 rdf:type owl:NamedIndividual , mls:HyperParameterSetting ; mls:specifiedBy :wekaLogisticM ; mls:hasValue -1 . :wekaLogisticRSetting29 rdf:type owl:NamedIndividual , mls:HyperParameterSetting ; mls:specifiedBy :wekaLogisticR ; mls:hasValue "1.0E-8"^^xsd:float . :credit-a rdf:type owl:NamedIndividual , mls:Dataset ; mls:hasQuality :defaultAccuracy , :numberOfFeatures , :numberOfInstances . :defaultAccuracy rdf:type owl:NamedIndividual , mls:DatasetCharacteristic ; mls:hasValue "0.56"^^xsd:float . :numberOfFeatures rdf:type owl:NamedIndividual , mls:DatasetCharacteristic ; mls:hasValue "16"^^xsd:long . :numberOfInstances rdf:type owl:NamedIndividual , mls:DatasetCharacteristic ; mls:hasValue "690"^^xsd:long . :wekaLogisticModel100241 rdf:type owl:NamedIndividual , mls:Model . :modelEvaluation100241 rdf:type owl:NamedIndividual , mls:ModelEvaluation ; mls:specifiedBy :predictiveAccuracy ; mls:hasValue 0.8478 . :predictiveAccuracy rdf:type owl:NamedIndividual , mls:EvaluationMeasure . :task29 rdf:type owl:NamedIndividual , mls:Task ; mls:definedOn :credit-a . :evaluationSpecification1 rdf:type owl:NamedIndividual , mls:EvaluationSpecification ; mls:defines :task29 ; mls:hasPart :TenFoldCrossValidation , :predictiveAccuracy . :TenFoldCrossValidation rdf:type owl:NamedIndividual , mls:EvaluationProcedure .
In the example, the run :run100241
executes the implementation :wekaLogistic
of the algorithm :logisticRegression
which this execution realizes.
The run has on input the :credit-a
dataset and its output is both the model :wekaLogisticModel100241
and the model evaluation :modelEvaluation100241
.
The run achieves the task :task29
.
The implementation :wekaLogistic
implements the algorithm :logisticRegression
.
The implementation has five hyperparameters: :wekaLogisticC
, :wekaLogisticDoNotCheckCapabilities
, :wekaLogisticM
, :wekaLogisticOutputDebugInfo
, :wekaLogisticR
.
Two of these hyperparameters are set for the run :run100241
.
The hyperparameter :wekaLogisticM
has value set to -1
(expressed via the hyperparameter setting :wekaLogisticMSetting29
), and the hyperparameter :wekaLogisticR
that has its value set to "1.0E-8"^^xsd:float
(expressed via the hyperparameter setting :wekaLogisticRSetting29
).
The dataset :credit-a
has several characteristics such as: :decisionStumpAUC
, :defaultAccuracy
, :numberOfInstances
, :numberOfMissingValues
.
The predictive model :wekaLogisticModel100241
is evaluated (:modelEvaluation100241
) based on the specified evaluation measure :predictiveAccuracy
.
The value of the evaluation measure modeled via the model evaluation :modelEvaluation100241
is 0.8478
.
The task :task29
is defined on the dataset (credit-a
) and on the evaluation specification :evaluationSpecification1
.
The evaluation specification :evaluationSpecification1
has parts: the evaluation procedure :TenFoldCrossValidation
and the evaluation measure:predictiveAccuracy
.
Secondly, we illustrate ML schema with an example describing ML study (:study1
) and the corresponding dataset :mtl_dataset
, providing reference to a publication (:article1
), and acknowledging the funding body (see Fig. 3) .
This example refers to the article “Multi-Task Learning with a Natural Metric for Quantitative Structure Activity Relationship Learning” by Sadawi et al which reports on the ML study carried out within the Meta-QSAR project (:meta-qsar_project
) funded by :EPSRC
(:grant1
with number EP/K030582/1).
The referred dataset is freely available in OpenML.
Exposing such metadata may be of use for possible collaborators who may wish to analyse research networks and try to assess the 'trustwothiness' of what is published in the literature. Such information that a study is done within a funded project, may increase their level of trust to the published results.
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . @prefix dc: <http://purl.org/dc/elements/1.1/>> . @prefix frapo: <http://purl.org/cerif/frapo/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . dc:BibliographicResource rdf:type owl:Class . mls:Study rdfs:subClassOf frapo:Investigation . :EPSRC rdf:type frapo:FundingAgency , owl:NamedIndividual ; frapo:awards :grant1 . :article1 rdf:type dc:BibliographicResource , owl:NamedIndividual ; rdfs:label "Article: Multi Task Learning with a Natural Metric for Quantitative Structure Activity Relationship Learning" . :grant1 rdf:type frapo:Grant , owl:NamedIndividual ; frapo:hasGrantNumber "EP/K030582/1" ; frapo:funds :meta-qsar_project . :meta-qsar_project rdf:type owl:NamedIndividual , foaf:Project . :mtl_dataset rdf:type owl:NamedIndividual , mls:Dataset ; dc:licence :CC_by_3.0 ; dc:dateSubmitted "06/09/15" ; dc:source "http://www.openml.org/s/3" . :CC_by_3.0 rdf:type owl:NamedIndividual , dc:LicenceDocument . :study1 rdf:type owl:NamedIndividual , mls:Study ; frapo:enables :meta-qsar_project ; frapo:hasOutput :mtl_dataset .
IRI: http://www.w3.org/ns/mls#Algorithm
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :logisticRegression rdf:type owl:NamedIndividual , mls:Algorithm .
IRI: http://www.w3.org/ns/mls#Data
IRI: http://www.w3.org/ns/mls#DataCharacteristic
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :credit-a rdf:type owl:NamedIndividual , mls:Dataset ; mls:hasQuality :decisionStumpAUC , :defaultAccuracy , :numberOfInstances , :numberOfMissingValues .
IRI: http://www.w3.org/ns/mls#DatasetCharacteristic
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :numberOfInstances rdf:type owl:NamedIndividual , mls:DatasetCharacteristic . :numberOfMissingValues rdf:type owl:NamedIndividual , mls:DatasetCharacteristic . :decisionStumpAUC rdf:type owl:NamedIndividual , mls:DatasetCharacteristic . :defaultAccuracy rdf:type owl:NamedIndividual , mls:DatasetCharacteristic .
IRI: http://www.w3.org/ns/mls#EvaluationMeasure
IRI: http://www.w3.org/ns/mls#EvaluationProcedure
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :TenFoldCrossValidation rdf:type owl:NamedIndividual , mls:EvaluationProcedure .
IRI: http://www.w3.org/ns/mls#EvaluationSpecification
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :evaluationSpecification1 rdf:type owl:NamedIndividual , mls:EvaluationSpecification ; mls:defines :task29 ; mls:hasPart :TenFoldCrossValidation , :predictiveAccuracy .
IRI: http://www.w3.org/ns/mls#Experiment
IRI: http://www.w3.org/ns/mls#FeatureCharacteristic
IRI: http://www.w3.org/ns/mls#HyperParameter
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :wekaLogisticC rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticDoNotCheckCapabilities rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticM rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticOutputDebugInfo rdf:type owl:NamedIndividual , mls:HyperParameter . :wekaLogisticR rdf:type owl:NamedIndividual , mls:HyperParameter .
IRI: http://www.w3.org/ns/mls#HyperParameterSetting
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . :wekaLogisticMSetting29 rdf:type owl:NamedIndividual , mls:HyperParameterSetting ; mls:specifiedBy :wekaLogisticM ; mls:hasValue -1 . :wekaLogisticRSetting29 rdf:type owl:NamedIndividual , mls:HyperParameterSetting ; mls:specifiedBy :wekaLogisticR ; mls:hasValue "1.0E-8"^^xsd:float .
IRI: http://www.w3.org/ns/mls#Implementation
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :wekaLogistic rdf:type owl:NamedIndividual , mls:Implementation ; mls:hasHyperParameter :wekaLogisticC , :wekaLogisticDoNotCheckCapabilities , :wekaLogisticM , :wekaLogisticOutputDebugInfo , :wekaLogisticR ; mls:implements :logisticRegression .
IRI: http://www.w3.org/ns/mls#ImplementationCharacteristic
IRI: http://www.w3.org/ns/mls#Model
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :wekaLogisticModel100241 rdf:type owl:NamedIndividual , mls:Model .
IRI: http://www.w3.org/ns/mls#InformationEntity
IRI: http://www.w3.org/ns/mls#ModelCharacteristic
IRI: http://www.w3.org/ns/mls#ModelEvaluation
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :modelEvaluation100241 rdf:type owl:NamedIndividual , mls:ModelEvaluation ; mls:specifiedBy :predictiveAccuracy ; mls:hasValue 0.8478 .
IRI: http://www.w3.org/ns/mls#Process
IRI: http://www.w3.org/ns/mls#Quality
IRI: http://www.w3.org/ns/mls#Run
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :run100241 rdf:type owl:NamedIndividual , mls:Run mls:executes :wekaLogistic ; mls:hasInput :credit-a , :wekaLogisticMSetting29 , :wekaLogisticRSetting29 ; mls:hasOutput :modelEvaluation100241 , :wekaLogisticModel100241 ; mls:realizes :task29 .
IRI: http://www.w3.org/ns/mls#Software
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :weka rdf:type mls:Software, mls:hasPart :wekaLogistic.
IRI: http://www.w3.org/ns/mls#Study
IRI: http://www.w3.org/ns/mls#Task
@prefix : <http://example.org#> . @prefix mls: <http://www.w3.org/ns/mls#> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . :task29 rdf:type owl:NamedIndividual , mls:Task ; mls:definedOn :credit-a .
IRI: http://www.w3.org/ns/mls#achieves
IRI: http://www.w3.org/ns/mls#definedOn
IRI: http://www.w3.org/ns/mls#defines
IRI: http://www.w3.org/ns/mls#executes
IRI: http://www.w3.org/ns/mls#hasHyperParameter
IRI: http://www.w3.org/ns/mls#hasInput
IRI: http://www.w3.org/ns/mls#hasOutput
IRI: http://www.w3.org/ns/mls#hasPart
IRI: http://www.w3.org/ns/mls#hasQuality
IRI: http://www.w3.org/ns/mls#implements
IRI: http://www.w3.org/ns/mls#realizes
IRI: http://www.w3.org/ns/mls#specifiedBy
Term from ML Schema | OntoDM-core [OntoDM-core] | DMOP [DMOP] | Expose [Expose] | MEX Vocabulary [MEX] |
Task | Data mining task | DM-Task | Task | mexcore:ExperimentConfiguration |
Algorithm | Data mining algorithm | DM-Algorithm | Algorithm | mexalgo:Algorithm |
Software | Data mining software | DM-Software | N/A | mexalgo:Tool |
Implementation | Data mining algorithm implementation | DM-Operator | Algorithm implementation | N/A |
HyperParameter | Parameter | Parameter | Parameter | mexalgo:HyperParameter |
HyperParameterSetting | Parameter setting | OpParameterSetting | Parameter setting | N/A |
Study | Investigation | N/A | N/A | mexcore:Experiment |
Experiment | N/A | DM-Experiment | Experiment | N/A |
Run | Data mining algorithm execution | DM-Operation | Algorithm execution | mexcore:Execution |
Data | Data item | DM-Data | N/A | mexcore:Example |
Dataset | DM dataset | DataSet | Dataset | mexcore:Dataset |
Feature | N/A | Feature | N/A | mexcore:Feature |
DataCharacteristic | Data specification | DataCharacteristic | Dataset specification | N/A |
DatasetCharacteristic | Dataset specification | DataSetCharacteristic | Data quality | N/A |
FeatureCharacteristic | Feature specification | FeatureCharacteristic | N/A | N/A |
Model | Generalization | DM-Hypothesis (DM-Model / DM-PatternSet) | Model | mexcore:Model |
ModelCharacteristic | Generalization quality | HypothesisCharacteristic | Model Structure, Parameter, ... | N/A |
ModelEvaluation | Generalization evaluation | ModelPerformance | Evaluation | N/A |
EvaluationMeasure | Evaluation datum | ModelEvaluationMeasure | Evaluation measure | mexperf:PerformanceMeasure |
EvaluationSpecification | N/A | N/A | N/A | N/A |
EvaluationProcedure | Evaluation algorithm | ModelEvaluationAlgorithm | Performance Estimation | N/A |
For the domain of data mining there are several developed ontologies, with the aim of providing formal descriptions of domain entities. One of the proposed ontologies is the OntoDM-core ontology [OntoDM-core]. In one of the preliminary versions of the ontology [OntoDM-core-init], the authors decided to align the proposed ontology with the Ontology of Biomedical Investigations (OBI) [OBI] and consequently with the Basic Formal Ontology (BFO) at the top level [BFO], in terms of top-level classes and the set of relations. That was beneficial for structuring the domain in a more elegant way and the basic differentiation of information entities, implementation entities and processual entities. In this context, the authors proposed a horizontal description structure that includes three layers: a specification layer, an implementation layer, and an application layer. The specification layer in general contains information entities. In the domain of data mining, example classes are data mining task and data mining algorithm. The implementation layer in general contains qualities and entities that are realized in a process, such as parameters and implementations of algorithms. The application layer contains processual classes, such as the execution of the data mining algorithm.
The main goal of [EXPOSE] is to describe (and reason about) machine learning experiments. It is built on top of OntoDM, as well as top-level ontologies from bio-informatics. It is currently used in OpenML, as a way to structure data (e.g. database design) and share data (APIs). MLSchema will be used to export all OpenML data as linked open data (in RDF).
The Data Mining OPtimization Ontology (DMOP) [DMOP] has been developed with a primary use case in meta-mining, that is meta-learning extended to an analysis of full DM processes. At the level of both single algorithms and more complex workflows, it follows a very similar modeling pattern as described in the MLSchema. To support meta-mining, DMOP contains a taxonomy of algorithms used in DM processes which are described in detail in terms of their underlying assumptions, cost functions, optimization strategies, generated models or pattern sets, and other properties. Such a "glass box" approach which makes explicit internal algorithm characteristics allows meta-learners using DMOP to generalize over algorithms and their properties, including those algorithms which were not used for training meta-learners.
The MEX vocabulary has been designed to reuse existing ontologies (i.e., [PROV-O], [DUBLIN-CORE], and [DOAP]) for representing basic machine learning information. The aim is not to describe a complete data-mining process, which can be modeled by more complex and semantically refined structures. Instead, MEX is designed to provide a simple and lightweight vocabulary for exchanging machine learning metadata in order to achieve a high level of interoperability as well as supporting data management for ML outcomes.