Skip to content

cdm_schema

Schema for KBase CDM

URI: http://kbase.github.io/cdm-schema/cdm_schema

Name: cdm_schema

Classes

Class Description
Any Used as a range for slots that have more than one possible type.
Table root class for all schema entities
        Association An association between an object--typically an entity such as a protein or a feature--and a classification system or ontology, such as the Gene Ontology, the Enzyme Classification, or TIGRFAMS domains.
        AssociationXPublication Links associations to supporting literature.
        AssociationXSupportingObject Links associations to entities to capture supporting objects in an association. ay be a biological entity, such as a protein or feature, or a URL to a resource (e.g. a publication) that supports the association. Where possible, CDM identifiers should be used.
        AttributeValue A generic class for capturing tag-value information in a structured form.
                Geolocation A normalized value for a location on the earth's surface
                QuantityValue A simple quantity, e.g. 2cm. May be used to describe a range using the minimum_value and maximum_value fields.
                        Measurement A qualitative or quantitative observation of an attribute of an object or event against a standardized scale, to enable it to be compared with other objects or events.
                                ProcessedMeasurement A measurement that requires additional processing to generate a result.
                TextValue A basic string value
        AttributeValueEntity Represents the link between an entity and its attribute values.
        Cluster Represents an individual execution of a clustering protocol. See the ClusterMember class for clustering results.
        ClusterMember Relationship representing membership of a cluster. An optional score can be assigned to each cluster member.
        Contig A contig (derived from the word "contiguous") is a set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region. A contig should not contain any gaps.
        ContigXContigCollection Captures the relationship between a contig and a contig collection; equivalent to contig part-of contig collection.
        ContigXEncodedFeature Captures the relationship between a contig and an encoded feature.
        ContigXFeature Captures the relationship between a contig and a feature; equivalent to feature part-of contig.
        ContigXProtein Captures the relationship between a contig and a protein; equivalent to protein is ribosomal translation of (http://purl.obolibrary.org/obo/RO_0002512) contig.
        ContigCollection A set of individual, overlapping contigs that represent the complete sequenced genome of an organism.
        ContigCollectionXEncodedFeature Captures the relationship between a contig collection and an encoded feature.
        ContigCollectionXFeature Captures the relationship between a contig collection and a feature; equivalent to feature part-of contig collection.
        ContigCollectionXProtein Captures the relationship between a contig collection and a protein; equivalent to protein is ribosomal translation of (http://purl.obolibrary.org/obo/RO_0002512) contig collection.
        Contributor Represents a contributor to the resource.

Contributors must have a 'contributor_type', either 'Person' or 'Organization', and
one of the 'name' fields: either 'given_name' and 'family_name' (for a person), or 'name' (for an organization or a person).

The 'contributor_role' field takes values from the DataCite and CRediT contributor
roles vocabularies. For more information on these resources and choosing
appropriate roles, please see the following links:

DataCite contributor roles: https://support.datacite.org/docs/datacite-metadata-schema-v44-recommended-and-optional-properties#7a-contributortype

CRediT contributor role taxonomy: https://credit.niso.org
        ContributorXRoleXExperiment
        ContributorXRoleXProject
        DataSource The source dataset from which data within the CDM was extracted. This might be an API query; a set of files downloaded from a website or uploaded by a user; a database dump; etc. A given data source should have either version information (e.g. UniProt's release number) or an access date to allow the original raw data dump to be recapitulated.
        EncodedFeature An entity generated from a feature, such as a transcript.
        EncodedFeatureXFeature Captures the relationship between a feature and its transcription product.
        EntailedEdge A relation graph edge that is inferred
        Entity A database entity.
        EntityXMeasurement Captures a measurement made on an entity.
        Event Something that happened.
        Experiment A discrete scientific procedure undertaken to make a discovery, test a hypothesis, or demonstrate a known fact.
        ExperimentXProject Captures the relationship between an experiment and the project that it is a part of.
        ExperimentXSample Represents the participation of a sample in an experiment.
        Feature A feature localized to an interval along a contig.
        FeatureXProtein Captures the relationship between a feature and a protein; equivalent to feature encodes protein.
        GoldEnvironmentalContext Environmental context, described using JGI's five level system.
        IdentifiedEntity Represents the link between an entity and its identifiers.
        Identifier A string used as a resolvable (external) identifier for an entity. This should be a URI or CURIE. If the string cannot be resolved to an URL, it should be added as a 'name' instead.

This table is used for capturing external IDs. The internal CDM identifier should be used in the *_id field (e.g. feature_id, protein_id, contig_collection_id).
        MixsEnvironmentalContext Environmental context, described using the MiXS convention of broad and local environment, plus the medium.
        Name A string used as the name or label for an entity. This may be a primary name, alternative name, synonym, acronym, or any other label used to refer to an entity.

Identifiers that look like CURIEs or database references, but which cannot be resolved using bioregistry or identifiers.org should be added as names.
        NamedEntity Represents the link between an entity and its names.
        Prefix Maps CURIEs to URIs
        Project Administrative unit for collecting data related to a certain topic, location, data type, grant funding, and so on.
        Protein Proteins are large, complex molecules made up of one or more long, folded chains of amino acids, whose sequences are determined by the DNA sequence of the protein-encoding gene.
        Protocol Defined method or set of methods.
        ProtocolXProtocolParticipant
        ProtocolParticipant Either an input or an output of a protocol.
        Publication A publication (e.g. journal article).
        Sample A material entity that can be characterised by an experiment.
        Sequence
        Statements Represents an RDF triple

Slots

Slot Description
aggregator_knowledge_source The knowledge source that aggregated the association. Should be a UUID from the DataSource table.
annotation_date The date when the annotation was made.
asm_score A composite score for comparing contig collection quality
association_id Internal (CDM) unique identifier for an association.
attribute_cv_term_id If the attribute is a term from a controlled vocabulary, the ID of the term.
attribute_name The attribute being captured in this annotation.
base The base URI a prefix will expand to
cds_phase For features of type CDS, the phase indicates where the next codon begins relative to the 5' end (where the 5' end of the CDS is relative to the strand of the CDS feature) of the current CDS feature. cds_phase is required if the feature type is CDS.
checkm2_completeness Estimate of the completeness of a contig collection (MAG or genome), estimated by CheckM2 tool
checkm2_contamination Estimate of the contamination of a contig collection (MAG or genome), estimated by CheckM2 tool
checksum The checksum of the sequence, used to verify its integrity.
cluster_id Internal (CDM) unique identifier for a cluster.
comments Any comments about the association.
contig_bp Total size in bp of all contigs
contig_collection_id Internal (CDM) unique identifier for a contig collection.
contig_collection_type The type of contig collection.
contig_id Internal (CDM) unique identifier for a contig.
contributor_id Internal (CDM) unique identifier for a contributor.
contributor_role Role(s) played by the contributor when working on the experiment. If more than one role was played, additional rows should be added to represent each role.
contributor_type Must be either 'Person' or 'Organization'.
created Date/timestamp for when the entity was created or added to the CDM.
created_at The time at which the event started or was created.
ctg_L50 Given a set of contigs, the L50 is defined as the sequence length of the shortest contig at 50% of the total contig collection length
ctg_L90 The L90 statistic is less than or equal to the L50 statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the sum of the lengths of all contigs
ctg_logsum The sum of the (length*log(length)) of all contigs, times some constant.
ctg_max Maximum contig length
ctg_N50 Given a set of contigs, each with its own length, the N50 count is defined as the smallest number_of_contigs whose length sum makes up half of contig collection size
ctg_N90 Given a set of contigs, each with its own length, the N90 count is defined as the smallest number of contigs whose length sum makes up 90% of contig collection size
ctg_powsum Powersum of all contigs is the same as logsum except that it uses the sum of (length*(length^P)) for some power P (default P=0.25)
data_source_created Date/timestamp for when the entity was created or added to the data source.
data_source_entity_id The primary ID of the entity at the data source.
data_source_id Internal (CDM) unique identifier for a data source.
data_source_updated Date/timestamp for when the entity was updated in the data source.
datatype the rdf datatype of the value, for example, xsd:string
date_accessed The date when the data was downloaded from the data source.
description Brief textual definition or description.
doi The DOI for a protocol.
e_value The 'score' of the feature. The semantics of this field are ill-defined. E-values should be used for sequence similarity features.
ecosystem JGI GOLD descriptor representing the top level ecosystem categorization.
ecosystem_category JGI GOLD descriptor representing the ecosystem category.
ecosystem_subtype JGI GOLD descriptor representing the subtype of ecosystem. May be "Unclassified".
ecosystem_type JGI GOLD descriptor representing the ecosystem type. May be "Unclassified".
encoded_feature_id Internal (CDM) unique identifier for an encoded feature.
end The start and end coordinates of the feature are given in positive 1-based int coordinates, relative to the landmark given in column one. Start is always less than or equal to end. For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature. For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark.
entity_id Internal (CDM) unique identifier for an entity.
entity_type Type of entity being clustered.
env_broad_scale Report the major environmental system the sample or specimen came from. The system(s) identified should have a coarse spatial grain, to provide the general environmental context of where the sampling was done (e.g. in the desert or a rainforest). We recommend using subclasses of EnvO's biome class: http://purl.obolibrary.org/obo/ENVO_00000428. EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS
env_local_scale Report the entity or entities which are in the sample or specimen's local vicinity and which you believe have significant causal influences on your sample or specimen. We recommend using EnvO terms which are of smaller spatial grain than your entry for env_broad_scale. Terms, such as anatomical sites, from other OBO Library ontologies which interoperate with EnvO (e.g. UBERON) are accepted in this field. EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS.
env_medium Report the environmental material(s) immediately surrounding the sample or specimen at the time of sampling. We recommend using subclasses of 'environmental material' (http://purl.obolibrary.org/obo/ENVO_00010483). EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS . Terms from other OBO ontologies are permissible as long as they reference mass/volume nouns (e.g. air, water, blood) and not discrete, countable entities (e.g. a tree, a leaf, a table top).
event_id Internal (CDM) unique identifier for an event.
evidence_for_existence The evidence that this protein exists. For example, the protein may have been isolated from a cell, or it may be predicted based on sequence features.
evidence_type The type of evidence supporting the association. Should be a term from the Evidence and Conclusion Ontology (ECO).
experiment_id Internal (CDM) unique identifier for an experiment.
family_name The family name(s) of the contributor.
feature_id Internal (CDM) unique identifier for a feature.
gap_pct The gap size percentage of all scaffolds
gc_avg The average GC content of the contig collection, expressed as a percentage
gc_content GC content of the contig, expressed as a percentage.
gc_std The standard deviation of GC content across the contig collection
given_name The given name(s) of the contributor.
gold_environmental_context_id Internal (CDM) unique identifier for a GOLD environmental context.
has_stop_codon Captures whether or not the sequence includes stop coordinates.
hash A hash value generated from one or more object attributes that serves to ensure the entity is unique.
id An identifier for an element. Note blank node ids are not unique across databases
identifier Fully-qualified URL or CURIE used as an identifier for an entity.
is_representative Whether or not this member is the representative for the cluster. If 'is_representative' is false, it is assumed that this is a cluster member.
is_seed Whether or not this is the seed for this cluster.
language the human language in which the value is encoded, e.g. 'en'
latitude
length Length of the contig in bp.
location The location for this event. May be described in terms of coordinates.
longitude
maximum_value If the quantity describes a range, represents the upper bound of the range.
measurement_id Internal (CDM) unique identifier for a measurement.
minimum_value If the quantity describes a range, represents the lower bound of the range.
mixs_environmental_context_id Internal (CDM) unique identifier for a mixs environmental context.
n_contigs Total number of contigs
n_scaffolds Total number of scaffolds
name A string used as a name or title.
negated If true, the relationship between the subject and object is negated. For example, consider an association where the subject is a protein ID, the object is the GO term for "glucose biosynthesis", and the predicate is "involved in". With the "negated" field set to false, the association is interpreted as " is involved in glucose biosynthesis". With the "negated" field set to true, the association is interpreted as " is not involved in glucose biosynthesis".
object Note the range of this slot is always a node. If the triple represents a literal, instead value will be populated
p_value The 'score' of the feature. The semantics of this field are ill-defined. P-values should be used for ab initio gene prediction features.
participant_type The type of participant in the protocol.
predicate The predicate of the statement
prefix A standardized prefix such as 'GO' or 'rdf' or 'FlyBase'
primary_knowledge_source The knowledge source that created the association. Should be a UUID from the DataSource table.
project_id Internal (CDM) unique identifier for a project.
protein_id Internal (CDM) unique identifier for a protein.
protocol_id Internal (CDM) unique identifier for a protocol.
protocol_participant_id The unique identifier for the protocol participant.
publication_id Unique identifier for a publication - e.g. PMID, DOI, URL, etc.
quality The quality of the measurement, indicating the confidence that one can have in its correctness.
raw_value Raw value from the source data. May or may not include units or other unstructured information.
relationship Relationship between this identifier and the entity in the entity_id field.
sample_id Internal (CDM) unique identifier for a sample.
scaf_bp Total size in bp of all scaffolds
scaf_L50 Given a set of scaffolds, the L50 is defined as the sequence length of the shortest scaffold at 50% of the total contig collection length
scaf_L90 The L90 statistic is less than or equal to the L50 statistic; it is the length for which the collection of all scaffolds of that length or longer contains at least 90% of the sum of the lengths of all scaffolds.
scaf_l_gt50k The total length of scaffolds longer than 50,000 base pairs
scaf_logsum The sum of the (length*log(length)) of all scaffolds, times some constant. Increase the contiguity, the score will increase
scaf_max Maximum scaffold length
scaf_N50 Given a set of scaffolds, each with its own length, the N50 count is defined as the smallest number of scaffolds whose length sum makes up half of contig collection size
scaf_N90 Given a set of scaffolds, each with its own length, the N90 count is defined as the smallest number of scaffolds whose length sum makes up 90% of contig collection size
scaf_n_gt50K The number of scaffolds longer than 50,000 base pairs.
scaf_pct_gt50K The percentage of the total assembly length represented by scaffolds longer than 50,000 base pairs
scaf_powsum Powersum of all scaffolds is the same as logsum except that it uses the sum of (length*(length^P)) for some power P (default P=0.25).
score Output from the clustering protocol indicating how closely a member matches the representative.
sequence The protein amino acid sequence.
sequence_id Internal (CDM) unique identifier for a sequence.
source The source for a specific piece of information; should be a CDM internal ID of a source in the DataSource table.
source_database ID of the data source from which this entity came.
specific_ecosystem JGI GOLD descriptor representing the most specific level of ecosystem categorization. May be "Unclassified".
start The start and end coordinates of the feature are given in positive 1-based int coordinates, relative to the landmark given in column one. Start is always less than or equal to end. For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature. For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark.
strand The strand of the feature.
subject The subject of the statement
type The type of the entity. Should be a term from the sequence ontology.
unit The unit of the quantity. Should be a term from UCUM.
updated Date/timestamp for when the entity was updated in the CDM.
url The URL from which the data was loaded.
value Note the range of this slot is always a string. Only used the triple represents a literal assertion
value_cv_term_id If the term comes from the controlled vocabulary, the CURIE for the term. This will always be null if the text string is not from a controlled vocabulary.
version For versioned data sources, the version of the dataset.

Enumerations

Enumeration Description
CdsPhaseType For features of type CDS (coding sequence), the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon.
ClusterType The type of the entities in a cluster. Must be represented by a table in the CDM schema.
ContigCollectionType The type of the contig set; the type of the 'omics data set. Terms are taken from the Genomics Standards Consortium where possible. See the GSC checklists at https://genomicsstandardsconsortium.github.io/mixs/ for the controlled vocabularies used.
ContributorRole The role of a contributor to a resource.
ContributorType The type of contributor being represented.
EntityType The type of an entity. Must be represented by a table in the CDM schema.
ProteinEvidenceForExistence The evidence for the existence of a biological entity. See https://www.uniprot.org/help/protein_existence and https://www.ncbi.nlm.nih.gov/genbank/evidence/.
RefSeqStatusType RefSeq status codes, taken from https://www.ncbi.nlm.nih.gov/genbank/evidence/.
SequenceType The type of sequence being represented.
StrandType The strand that a feature appears on relative to a landmark. Also encompasses unknown or irrelevant strandedness.

Types

Type Description
Boolean A binary (true or false) value
Curie a compact URI
DataSourceUuid A UUID that identifies a data source in the CDM.
Date a date (year, month and day) in an idealized calendar
DateOrDatetime Either a date or a datetime
Datetime The combination of a date and time
Decimal A real number with arbitrary precision that conforms to the xsd:decimal specification
Double A real number that conforms to the xsd:double specification
Float A real number that conforms to the xsd:float specification
Integer An integer
Iso8601 A date in ISO 8601 format, e.g. 2024-04-05T12:34:56Z. "Z" indicates UTC time.
Jsonpath A string encoding a JSON Path. The value of the string MUST conform to JSON Point syntax and SHOULD dereference to zero or more valid objects within the current instance document when encoded in tree form.
Jsonpointer A string encoding a JSON Pointer. The value of the string MUST conform to JSON Point syntax and SHOULD dereference to a valid object within the current instance document when encoded in tree form.
LiteralAsStringType
LocalCurie A CURIE that exists as a subject in the statements table (i.e. Statements.subject). Should not be used for external identifiers.
Ncname Prefix part of CURIE
NodeIdType IDs are either CURIEs, IRI, or blank nodes. IRIs are wrapped in <>s to distinguish them from CURIEs, but in general it is good practice to populate the [prefixes][Prefixes.md] table such that they are shortened to CURIEs. Blank nodes are ids starting with _:.
Nodeidentifier A URI, CURIE or BNODE that represents a node in a model.
Objectidentifier A URI or CURIE that represents an object in the model.
Sparqlpath A string encoding a SPARQL Property Path. The value of the string MUST conform to SPARQL syntax and SHOULD dereference to zero or more valid objects within the current instance document when encoded as RDF.
String A character string
Time A time object represents a (local) time of day, independent of any particular day
Uri a complete URI
Uriorcurie a URI or a CURIE
UUID A universally unique ID, generated using uuid4, with the prefix "CDM:".

Subsets

Subset Description