Gobierto Data / CSV Schema
When importing a dataset using the Gobierto Data API, the schema of the data being imported is informed with a schema file.
The schema file is a JSON containing information of the type of each column in the CSV.
If no schema is provided, all columns are considered text.
Example
Suppose that the CSV has this form:
DATE,TIMESTAMP,WEEK DAY, DISTRICT NUMBER,AGE,PRICE,STOCK
31/12/2018,2018-12-31 16:30,MONDAY,3,34,27.56,YES
1/12/2018,2018-12-1 07:15,SATURDAY,5,27,200.4,NO
If this file is sent to create a dataset automatically the following schema is generated:
{
"date": {
"original_name": "DATE",
"type": "text"
},
"timestamp": {
"original_name": "TIMESTAMP",
"type": "text"
},
"week_day": {
"original_name": "WEEK DAY",
"type": "text"
},
"district_number": {
"original_name": "DISTRICT NUMBER",
"type": "text"
},
"age": {
"original_name": "AGE",
"type": "text"
},
"price": {
"original_name": "PRICE",
"type": "text"
},
"stock": {
"original_name": "STOCK",
"type": "text"
}
}
Each column of the csv has an entry in the schema:
- The key represent the column name in the destination database table
- The value is an object with different attributes:
original_name
: Mandatory - The name of the column in the original CSV filetype
: Mandatory - The type of the column in the CSV file. A type has associated a data type in the database and possibly a set of transformation functions that are performed on the load process. Currently there are 6 types defined:integer
: Destination data type is integer (from -2147483648 to +2147483647).bigint
: Destination data type is bigint (from -9223372036854775808 to 9223372036854775807).numeric
: Destination data type is numeric.numeric_with_custom_decimal_separator
: Destination type is numeric.text
: Destination type is text.date
: Destination type is date.time
: Destination type is time.timestamp
: Destination type is timestamp without timezone. The default expected format is "YYY-MM-DD HH24:MI:SS"timestamptz
: Destination type is timestamp with timezone. The timezone used is the timezone configured for database. The default expected format is "YYY-MM-DD HH24:MI:SS"boolean
: Destination type is boolean.
optional_params
: Each type can have a list of settings. If omitted, default values are used. All types share anull_value
option which is "" by default. The value of this attribute will be transformed to NULL on load. There are other specific optional params on some types:numeric_with_custom_decimal_separator
:decimal_separator
: String. Default: ",". Character which represents the decimal separator.
date
,timestamp
andtimestamptz
date_format
: String. Default: "YYY-MM-DD HH24:MI:SS". String which represents the date format, according to theto_date
function documentation
time
time_format
: String. Default: "HH24:MI". String which represents the time format, according to theto_timestamp
function documentation
boolean
:false_value
: String. Default: "0". String which has to be interpreted as false.true_value
: String. Default: "1". String which has to be interpreted as true.
For our example this could be a good schema JSON:
{
"date": {
"original_name": "DATE",
"type": "date",
"optional_params": { "date_format": "DD/MM/YYYY" }
},
"timestamp": {
"original_name": "TIMESTAMP",
"type": "timestamptz",
"optional_params": { "date_format": "YYY-MM-DD HH24:MM" }
},
"week_day": {
"original_name": "WEEK DAY",
"type": "text"
},
"district_number": {
"original_name": "DISTRICT NUMBER",
"type": "integer"
},
"age": {
"original_name": "AGE",
"type": "integer"
},
"price": {
"original_name": "PRICE",
"type": "numeric"
},
"stock": {
"original_name": "STOCK",
"type": "boolean",
"optional_params": { "false_value": "NO", "true_value": "YES" }
}
}
Updated over 1 year ago