cdm_schema
Schema for KBase CDM
URI: http://kbase.github.io/cdm-schema/cdm_schema
Name: cdm_schema
Classes
Class | Description |
---|---|
Any | Used as a range for slots that have more than one possible type. |
Table | root class for all schema entities |
Association | An association between an object--typically an entity such as a protein or a feature--and a classification system or ontology, such as the Gene Ontology, the Enzyme Classification, or TIGRFAMS domains. |
AssociationXPublication | Links associations to supporting literature. |
AssociationXSupportingObject | Links associations to entities to capture supporting objects in an association. ay be a biological entity, such as a protein or feature, or a URL to a resource (e.g. a publication) that supports the association. Where possible, CDM identifiers should be used. |
AttributeValue | A generic class for capturing tag-value information in a structured form. |
Geolocation | A normalized value for a location on the earth's surface |
QuantityValue | A simple quantity, e.g. 2cm. May be used to describe a range using the minimum_value and maximum_value fields. |
Measurement | A qualitative or quantitative observation of an attribute of an object or event against a standardized scale, to enable it to be compared with other objects or events. |
ProcessedMeasurement | A measurement that requires additional processing to generate a result. |
TextValue | A basic string value |
AttributeValueEntity | Represents the link between an entity and its attribute values. |
Cluster | Represents an individual execution of a clustering protocol. See the ClusterMember class for clustering results. |
ClusterMember | Relationship representing membership of a cluster. An optional score can be assigned to each cluster member. |
Contig | A contig (derived from the word "contiguous") is a set of DNA segments or sequences that overlap in a way that provides a contiguous representation of a genomic region. A contig should not contain any gaps. |
ContigXContigCollection | Captures the relationship between a contig and a contig collection; equivalent to contig part-of contig collection. |
ContigXEncodedFeature | Captures the relationship between a contig and an encoded feature. |
ContigXFeature | Captures the relationship between a contig and a feature; equivalent to feature part-of contig. |
ContigXProtein | Captures the relationship between a contig and a protein; equivalent to protein is ribosomal translation of (http://purl.obolibrary.org/obo/RO_0002512) contig. |
ContigCollection | A set of individual, overlapping contigs that represent the complete sequenced genome of an organism. |
ContigCollectionXEncodedFeature | Captures the relationship between a contig collection and an encoded feature. |
ContigCollectionXFeature | Captures the relationship between a contig collection and a feature; equivalent to feature part-of contig collection. |
ContigCollectionXProtein | Captures the relationship between a contig collection and a protein; equivalent to protein is ribosomal translation of (http://purl.obolibrary.org/obo/RO_0002512) contig collection. |
Contributor | Represents a contributor to the resource. Contributors must have a 'contributor_type', either 'Person' or 'Organization', and one of the 'name' fields: either 'given_name' and 'family_name' (for a person), or 'name' (for an organization or a person). The 'contributor_role' field takes values from the DataCite and CRediT contributor roles vocabularies. For more information on these resources and choosing appropriate roles, please see the following links: DataCite contributor roles: https://support.datacite.org/docs/datacite-metadata-schema-v44-recommended-and-optional-properties#7a-contributortype CRediT contributor role taxonomy: https://credit.niso.org |
ContributorXRoleXExperiment | |
ContributorXRoleXProject | |
DataSource | The source dataset from which data within the CDM was extracted. This might be an API query; a set of files downloaded from a website or uploaded by a user; a database dump; etc. A given data source should have either version information (e.g. UniProt's release number) or an access date to allow the original raw data dump to be recapitulated. |
EncodedFeature | An entity generated from a feature, such as a transcript. |
EncodedFeatureXFeature | Captures the relationship between a feature and its transcription product. |
EntailedEdge | A relation graph edge that is inferred |
Entity | A database entity. |
EntityXMeasurement | Captures a measurement made on an entity. |
Event | Something that happened. |
Experiment | A discrete scientific procedure undertaken to make a discovery, test a hypothesis, or demonstrate a known fact. |
ExperimentXProject | Captures the relationship between an experiment and the project that it is a part of. |
ExperimentXSample | Represents the participation of a sample in an experiment. |
Feature | A feature localized to an interval along a contig. |
FeatureXProtein | Captures the relationship between a feature and a protein; equivalent to feature encodes protein. |
GoldEnvironmentalContext | Environmental context, described using JGI's five level system. |
IdentifiedEntity | Represents the link between an entity and its identifiers. |
Identifier | A string used as a resolvable (external) identifier for an entity. This should be a URI or CURIE. If the string cannot be resolved to an URL, it should be added as a 'name' instead. This table is used for capturing external IDs. The internal CDM identifier should be used in the *_id field (e.g. feature_id, protein_id, contig_collection_id). |
MixsEnvironmentalContext | Environmental context, described using the MiXS convention of broad and local environment, plus the medium. |
Name | A string used as the name or label for an entity. This may be a primary name, alternative name, synonym, acronym, or any other label used to refer to an entity. Identifiers that look like CURIEs or database references, but which cannot be resolved using bioregistry or identifiers.org should be added as names. |
NamedEntity | Represents the link between an entity and its names. |
Prefix | Maps CURIEs to URIs |
Project | Administrative unit for collecting data related to a certain topic, location, data type, grant funding, and so on. |
Protein | Proteins are large, complex molecules made up of one or more long, folded chains of amino acids, whose sequences are determined by the DNA sequence of the protein-encoding gene. |
Protocol | Defined method or set of methods. |
ProtocolXProtocolParticipant | |
ProtocolParticipant | Either an input or an output of a protocol. |
Publication | A publication (e.g. journal article). |
Sample | A material entity that can be characterised by an experiment. |
Sequence | |
Statements | Represents an RDF triple |
Slots
Slot | Description |
---|---|
aggregator_knowledge_source | The knowledge source that aggregated the association. Should be a UUID from the DataSource table. |
annotation_date | The date when the annotation was made. |
asm_score | A composite score for comparing contig collection quality |
association_id | Internal (CDM) unique identifier for an association. |
attribute_cv_term_id | If the attribute is a term from a controlled vocabulary, the ID of the term. |
attribute_name | The attribute being captured in this annotation. |
base | The base URI a prefix will expand to |
cds_phase | For features of type CDS, the phase indicates where the next codon begins relative to the 5' end (where the 5' end of the CDS is relative to the strand of the CDS feature) of the current CDS feature. cds_phase is required if the feature type is CDS. |
checkm2_completeness | Estimate of the completeness of a contig collection (MAG or genome), estimated by CheckM2 tool |
checkm2_contamination | Estimate of the contamination of a contig collection (MAG or genome), estimated by CheckM2 tool |
checksum | The checksum of the sequence, used to verify its integrity. |
cluster_id | Internal (CDM) unique identifier for a cluster. |
comments | Any comments about the association. |
contig_bp | Total size in bp of all contigs |
contig_collection_id | Internal (CDM) unique identifier for a contig collection. |
contig_collection_type | The type of contig collection. |
contig_id | Internal (CDM) unique identifier for a contig. |
contributor_id | Internal (CDM) unique identifier for a contributor. |
contributor_role | Role(s) played by the contributor when working on the experiment. If more than one role was played, additional rows should be added to represent each role. |
contributor_type | Must be either 'Person' or 'Organization'. |
created | Date/timestamp for when the entity was created or added to the CDM. |
created_at | The time at which the event started or was created. |
ctg_L50 | Given a set of contigs, the L50 is defined as the sequence length of the shortest contig at 50% of the total contig collection length |
ctg_L90 | The L90 statistic is less than or equal to the L50 statistic; it is the length for which the collection of all contigs of that length or longer contains at least 90% of the sum of the lengths of all contigs |
ctg_logsum | The sum of the (length*log(length)) of all contigs, times some constant. |
ctg_max | Maximum contig length |
ctg_N50 | Given a set of contigs, each with its own length, the N50 count is defined as the smallest number_of_contigs whose length sum makes up half of contig collection size |
ctg_N90 | Given a set of contigs, each with its own length, the N90 count is defined as the smallest number of contigs whose length sum makes up 90% of contig collection size |
ctg_powsum | Powersum of all contigs is the same as logsum except that it uses the sum of (length*(length^P)) for some power P (default P=0.25) |
data_source_created | Date/timestamp for when the entity was created or added to the data source. |
data_source_entity_id | The primary ID of the entity at the data source. |
data_source_id | Internal (CDM) unique identifier for a data source. |
data_source_updated | Date/timestamp for when the entity was updated in the data source. |
datatype | the rdf datatype of the value, for example, xsd:string |
date_accessed | The date when the data was downloaded from the data source. |
description | Brief textual definition or description. |
doi | The DOI for a protocol. |
e_value | The 'score' of the feature. The semantics of this field are ill-defined. E-values should be used for sequence similarity features. |
ecosystem | JGI GOLD descriptor representing the top level ecosystem categorization. |
ecosystem_category | JGI GOLD descriptor representing the ecosystem category. |
ecosystem_subtype | JGI GOLD descriptor representing the subtype of ecosystem. May be "Unclassified". |
ecosystem_type | JGI GOLD descriptor representing the ecosystem type. May be "Unclassified". |
encoded_feature_id | Internal (CDM) unique identifier for an encoded feature. |
end | The start and end coordinates of the feature are given in positive 1-based int coordinates, relative to the landmark given in column one. Start is always less than or equal to end. For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature. For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark. |
entity_id | Internal (CDM) unique identifier for an entity. |
entity_type | Type of entity being clustered. |
env_broad_scale | Report the major environmental system the sample or specimen came from. The system(s) identified should have a coarse spatial grain, to provide the general environmental context of where the sampling was done (e.g. in the desert or a rainforest). We recommend using subclasses of EnvO's biome class: http://purl.obolibrary.org/obo/ENVO_00000428. EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS |
env_local_scale | Report the entity or entities which are in the sample or specimen's local vicinity and which you believe have significant causal influences on your sample or specimen. We recommend using EnvO terms which are of smaller spatial grain than your entry for env_broad_scale. Terms, such as anatomical sites, from other OBO Library ontologies which interoperate with EnvO (e.g. UBERON) are accepted in this field. EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS. |
env_medium | Report the environmental material(s) immediately surrounding the sample or specimen at the time of sampling. We recommend using subclasses of 'environmental material' (http://purl.obolibrary.org/obo/ENVO_00010483). EnvO documentation about how to use the field: https://github.com/EnvironmentOntology/envo/wiki/Using-ENVO-with-MIxS . Terms from other OBO ontologies are permissible as long as they reference mass/volume nouns (e.g. air, water, blood) and not discrete, countable entities (e.g. a tree, a leaf, a table top). |
event_id | Internal (CDM) unique identifier for an event. |
evidence_for_existence | The evidence that this protein exists. For example, the protein may have been isolated from a cell, or it may be predicted based on sequence features. |
evidence_type | The type of evidence supporting the association. Should be a term from the Evidence and Conclusion Ontology (ECO). |
experiment_id | Internal (CDM) unique identifier for an experiment. |
family_name | The family name(s) of the contributor. |
feature_id | Internal (CDM) unique identifier for a feature. |
gap_pct | The gap size percentage of all scaffolds |
gc_avg | The average GC content of the contig collection, expressed as a percentage |
gc_content | GC content of the contig, expressed as a percentage. |
gc_std | The standard deviation of GC content across the contig collection |
given_name | The given name(s) of the contributor. |
gold_environmental_context_id | Internal (CDM) unique identifier for a GOLD environmental context. |
has_stop_codon | Captures whether or not the sequence includes stop coordinates. |
hash | A hash value generated from one or more object attributes that serves to ensure the entity is unique. |
id | An identifier for an element. Note blank node ids are not unique across databases |
identifier | Fully-qualified URL or CURIE used as an identifier for an entity. |
is_representative | Whether or not this member is the representative for the cluster. If 'is_representative' is false, it is assumed that this is a cluster member. |
is_seed | Whether or not this is the seed for this cluster. |
language | the human language in which the value is encoded, e.g. 'en' |
latitude | |
length | Length of the contig in bp. |
location | The location for this event. May be described in terms of coordinates. |
longitude | |
maximum_value | If the quantity describes a range, represents the upper bound of the range. |
measurement_id | Internal (CDM) unique identifier for a measurement. |
minimum_value | If the quantity describes a range, represents the lower bound of the range. |
mixs_environmental_context_id | Internal (CDM) unique identifier for a mixs environmental context. |
n_contigs | Total number of contigs |
n_scaffolds | Total number of scaffolds |
name | A string used as a name or title. |
negated | If true, the relationship between the subject and object is negated. For example, consider an association where the subject is a protein ID, the object is the GO term for "glucose biosynthesis", and the predicate is "involved in". With the "negated" field set to false, the association is interpreted as " |
object | Note the range of this slot is always a node. If the triple represents a literal, instead value will be populated |
p_value | The 'score' of the feature. The semantics of this field are ill-defined. P-values should be used for ab initio gene prediction features. |
participant_type | The type of participant in the protocol. |
predicate | The predicate of the statement |
prefix | A standardized prefix such as 'GO' or 'rdf' or 'FlyBase' |
primary_knowledge_source | The knowledge source that created the association. Should be a UUID from the DataSource table. |
project_id | Internal (CDM) unique identifier for a project. |
protein_id | Internal (CDM) unique identifier for a protein. |
protocol_id | Internal (CDM) unique identifier for a protocol. |
protocol_participant_id | The unique identifier for the protocol participant. |
publication_id | Unique identifier for a publication - e.g. PMID, DOI, URL, etc. |
quality | The quality of the measurement, indicating the confidence that one can have in its correctness. |
raw_value | Raw value from the source data. May or may not include units or other unstructured information. |
relationship | Relationship between this identifier and the entity in the entity_id field. |
sample_id | Internal (CDM) unique identifier for a sample. |
scaf_bp | Total size in bp of all scaffolds |
scaf_L50 | Given a set of scaffolds, the L50 is defined as the sequence length of the shortest scaffold at 50% of the total contig collection length |
scaf_L90 | The L90 statistic is less than or equal to the L50 statistic; it is the length for which the collection of all scaffolds of that length or longer contains at least 90% of the sum of the lengths of all scaffolds. |
scaf_l_gt50k | The total length of scaffolds longer than 50,000 base pairs |
scaf_logsum | The sum of the (length*log(length)) of all scaffolds, times some constant. Increase the contiguity, the score will increase |
scaf_max | Maximum scaffold length |
scaf_N50 | Given a set of scaffolds, each with its own length, the N50 count is defined as the smallest number of scaffolds whose length sum makes up half of contig collection size |
scaf_N90 | Given a set of scaffolds, each with its own length, the N90 count is defined as the smallest number of scaffolds whose length sum makes up 90% of contig collection size |
scaf_n_gt50K | The number of scaffolds longer than 50,000 base pairs. |
scaf_pct_gt50K | The percentage of the total assembly length represented by scaffolds longer than 50,000 base pairs |
scaf_powsum | Powersum of all scaffolds is the same as logsum except that it uses the sum of (length*(length^P)) for some power P (default P=0.25). |
score | Output from the clustering protocol indicating how closely a member matches the representative. |
sequence | The protein amino acid sequence. |
sequence_id | Internal (CDM) unique identifier for a sequence. |
source | The source for a specific piece of information; should be a CDM internal ID of a source in the DataSource table. |
source_database | ID of the data source from which this entity came. |
specific_ecosystem | JGI GOLD descriptor representing the most specific level of ecosystem categorization. May be "Unclassified". |
start | The start and end coordinates of the feature are given in positive 1-based int coordinates, relative to the landmark given in column one. Start is always less than or equal to end. For features that cross the origin of a circular feature (e.g. most bacterial genomes, plasmids, and some viral genomes), the requirement for start to be less than or equal to end is satisfied by making end = the position of the end + the length of the landmark feature. For zero-length features, such as insertion sites, start equals end and the implied site is to the right of the indicated base in the direction of the landmark. |
strand | The strand of the feature. |
subject | The subject of the statement |
type | The type of the entity. Should be a term from the sequence ontology. |
unit | The unit of the quantity. Should be a term from UCUM. |
updated | Date/timestamp for when the entity was updated in the CDM. |
url | The URL from which the data was loaded. |
value | Note the range of this slot is always a string. Only used the triple represents a literal assertion |
value_cv_term_id | If the term comes from the controlled vocabulary, the CURIE for the term. This will always be null if the text string is not from a controlled vocabulary. |
version | For versioned data sources, the version of the dataset. |
Enumerations
Enumeration | Description |
---|---|
CdsPhaseType | For features of type CDS (coding sequence), the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. |
ClusterType | The type of the entities in a cluster. Must be represented by a table in the CDM schema. |
ContigCollectionType | The type of the contig set; the type of the 'omics data set. Terms are taken from the Genomics Standards Consortium where possible. See the GSC checklists at https://genomicsstandardsconsortium.github.io/mixs/ for the controlled vocabularies used. |
ContributorRole | The role of a contributor to a resource. |
ContributorType | The type of contributor being represented. |
EntityType | The type of an entity. Must be represented by a table in the CDM schema. |
ProteinEvidenceForExistence | The evidence for the existence of a biological entity. See https://www.uniprot.org/help/protein_existence and https://www.ncbi.nlm.nih.gov/genbank/evidence/. |
RefSeqStatusType | RefSeq status codes, taken from https://www.ncbi.nlm.nih.gov/genbank/evidence/. |
SequenceType | The type of sequence being represented. |
StrandType | The strand that a feature appears on relative to a landmark. Also encompasses unknown or irrelevant strandedness. |
Types
Type | Description |
---|---|
Boolean | A binary (true or false) value |
Curie | a compact URI |
DataSourceUuid | A UUID that identifies a data source in the CDM. |
Date | a date (year, month and day) in an idealized calendar |
DateOrDatetime | Either a date or a datetime |
Datetime | The combination of a date and time |
Decimal | A real number with arbitrary precision that conforms to the xsd:decimal specification |
Double | A real number that conforms to the xsd:double specification |
Float | A real number that conforms to the xsd:float specification |
Integer | An integer |
Iso8601 | A date in ISO 8601 format, e.g. 2024-04-05T12:34:56Z. "Z" indicates UTC time. |
Jsonpath | A string encoding a JSON Path. The value of the string MUST conform to JSON Point syntax and SHOULD dereference to zero or more valid objects within the current instance document when encoded in tree form. |
Jsonpointer | A string encoding a JSON Pointer. The value of the string MUST conform to JSON Point syntax and SHOULD dereference to a valid object within the current instance document when encoded in tree form. |
LiteralAsStringType | |
LocalCurie | A CURIE that exists as a subject in the statements table (i.e. Statements.subject ). Should not be used for external identifiers. |
Ncname | Prefix part of CURIE |
NodeIdType | IDs are either CURIEs, IRI, or blank nodes. IRIs are wrapped in <>s to distinguish them from CURIEs, but in general it is good practice to populate the [prefixes][Prefixes.md] table such that they are shortened to CURIEs. Blank nodes are ids starting with _: . |
Nodeidentifier | A URI, CURIE or BNODE that represents a node in a model. |
Objectidentifier | A URI or CURIE that represents an object in the model. |
Sparqlpath | A string encoding a SPARQL Property Path. The value of the string MUST conform to SPARQL syntax and SHOULD dereference to zero or more valid objects within the current instance document when encoded as RDF. |
String | A character string |
Time | A time object represents a (local) time of day, independent of any particular day |
Uri | a complete URI |
Uriorcurie | a URI or a CURIE |
UUID | A universally unique ID, generated using uuid4, with the prefix "CDM:". |
Subsets
Subset | Description |
---|---|