Input Format vs Output Format

Line Protocol
Annotated CSV
From Series to Tables on Disk
1. Adding Fields
2. Adding Tags
Real World Data
1. Exploring the Real Word Data Schema with Flux
2. Exercises with Real World Data
Further Reading

The InfluxDB input format is line protocol. The InfluxDB output format is Annotated CSV. InfluxDB users are sometimes initially confused because using line protocol can be similar to writing a single table, but then when they query the same data, the query returns multiple tables in CSV. This is because as the data formatted in line protocol is ingested by InfluxDB it is seperated into series. Understanding this subtle relationship between the ingestion format and time series is important for good schema design and for using InfluxDB optimally.

Line Protocol

The smallest allowed line of line protocol includes a measurement, a single field, and a value for that field. A measurement is the highest grouping of data inside a bucket. A field is a part of that measurement and defines a single point. Fields are separated by measurements by a space. So, the smallest readable line looks like:

measurement1 field1=1i

Adding multiple fields

You need to supply at least one field, but you can supply as many as you need in a single line, separated by a comma:

measurement1 field1=1i,field2=1,field3="a"

Line protocol is very compact, but each of the fields will be in their own series when stored in the database. More on that in From Series to Tables on Disk.

Types explained

The types of a field value can be an integer, a float, or a string. By default a number will be interpreted as a float. The addition of the “i” after each number for my_field tells InfluxDB that I wanted those to be integers. Using quotes ensures that InfluxDB knows I want a string type.

Type conflicts

Once you have created a series, you cannot change the type field type of the series. InfluxDB will reject the the following write:

measurement1 field1=1,field2="1",field3="a"

After having written this line:

measurement1 field1=1i,field2=1,field3="a"

Notice how the field2 field value has been changed from a string to a float.

Adding tags

Tags are another kind of data that you can add to a line in line protocol. Tags are useful because they are automatically indexed by InfluxDB. Using and querying for tags allows you to significantly improve your query performance. Additionally, tags are useful for categorizing your queries.

Remember, a series is defined by a measurement(s), tag sets(s), and field key(s). Tags are defined after the measurement name, and separated from it by a comma. I can add a tag to the previous line protocol as such:

measurement1,tag1=tagvalue1 field1=1i,field2=1,field3="a" 1626118680000000000
measurement1,tag1=tagvalue2 field1=2i,field2=2,field3="b" 1626118740000000000

The introduction of the tag key, “tag1”, with 2 different tag values, “tagvalue1” and “tagvalue2” produces 6 series–3 series come from the different fields for each of the 2 tag values.

Adding timestamps

In cases where a timestamp is omitted from line protocol, InfluxDB will add a timestamp based on the current server time at the time of the write.

Letting InfluxDB automatically supply a timestamp is very convenient, but is likely too imprecise for your application, so you will typically supply a timestamp as well. InfluxDB expects timestamps to be unix timestamps. InfluxDB also expects the timestamps to be in nanosecond resolution, but the write API allows you to define lower resolution precision if needed.

The timestamp comes at the end of a line of line protocol and is separated from the last field value by a space. Supplying a timestamp to the above line protocol would look like this:

measurement1 field1=1i,field2=1,field3="a" 1626118680000000000
measurement1 field1=2i,field2=2,field3="b" 1626118740000000000

Note on Timestamp Precision

The native resolution of time stamps in InfluxDB is nanoseconds. In InfluxDB, the unit of resolution (for example nanosecond, microsecond, millisecond) is called the “precision” of the time stamp. For context:

There are 1,000 nanoseconds in a microsecond
There are 1,000,000 nanoseconds in a millisecond
There are 1,000,000,000 nanoseconds in a second

This is important to keep in mind while constructing your application, because many systems do not handle nanosecond resolution, so it is necessary to convert between them.

Note that InfluxDB tools do allow you to define the precision of your timestamps, so when writing, you can allow InfluxDB to handle the conversion for you. This will be covered in Part 3.

Overwriting points

You can overwrite points in InfluxDB when you write data with the same series and same timestamp. For example, if you wrote this line of line protocol to InfluxDB:

measurement1 field1=2i,field2=2,field3="b" 1626118740000000000

You would then overwrite the 3 points by writing this line next.

measurement1 field1=10i,field2=10,field3="overwritten" 1626118740000000000
<more examples that include partial overwrites, subsets of tags>

Annotated CSV

Annotated CSV is the output format for InfluxDB. The annotated CSV output matches the InfluxDB persistence format with simple queries or when you don’t apply additional transformations to your data. Annotated CSV result is a stream of tables returned by Flux where each table represents a series for simple queries only. For example if you wrote this line protocol line to InfluxDB:

measurement1,tag1=tagvalue1 field1=1i

You would return the following full annotated CSV output when querying for all fields and tags (of which there are none) within the measurement:

#group,false,false,true,true,false,false,true,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,long,string,string,string
#default,_result,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,tag1
,,0,rfc3339time1,rfc3339time2,2021-08-17T21:23:39.000000000Z,1,field1,Measurement1,tagvalue1

Remember, that line of line protocol produces 1 series which is why one table is included in the annotated CSV output.

Raw CSV vs Annotated CSV

The first thing to notice about the output format of annotated CSV is that it resembles the CSV format that you’re familiar with. To easily distinguish between the two we’ll refer to CSV as Raw CSV. Unlike raw CSV, annotated CSV contains the following rows:

Header Row
Records Rows
Annotation Rows

Header Row

The header row is similar to any header row in a CSV. It describes the column names for your time series data.The header row is found below the 3 Annotation rows. Our header row is:

,result,table,_start,_stop,_time,_value,_field,_measurement,tag1

Some of these headers are an intuitive translation from line protocol while others are not. The headers are:

result. This column includes the name of the result as specified by the query. If no name is provided it will be empty.
table. This column contains a unique ID for each table in the annotated CSV output result. stream. In the example above we are only writing and querying for 1 series so the value of that column is set to 0 for that table.
_start.This column includes the start time for the time range over which you queried your data for. Specifying a range is necessary for any Flux query.
_stop.This column includes the stop time for the time range over which you queried your data for. Specifying a range is necessary for any Flux query.
_time.This column is the timestamp for your time series. The value of this column is either added upon write or included in the line protocol explicitly.
_value. This column contains the field values for the corresponding field keys in the same row under the _field column.
_field. This column contains the field keys.
_measurement. This column contains the name of the measurement.
tag1. This column contains the tag value for our tag1 tag key.

For the rest of this section you can ignore the values for the _start, _stop, and result columns. Those values are assigned by the user during query execution. For now, just focus on understanding the similarities between your line protocol input format and the resulting annotated CSV.

Record Rows

The records row(s) is directly below the header row. These rows contain our time series data. Each row contains one record. A record is a tuple of named values. Each table contains at least one record. When querying for your time series without adding any additional Flux transformations, a table

Records vs Points

Remember, a point is a datapoint from a series at a specific timestamp. A record can be a point but not always. For example, you could use Flux to drop all tag, field, and measurement columns. At this point the records in the annotated CSV wouldn’t reflect a point but rather time series data instead.

Annotation Rows

The annotation rows are found in the first three lines of the full annotated CSV output. Annotations include metadata about your annotated CSV. When querying InfluxDB with the API, you can specify which headers you want to include in your annotated CSV output. All 3 headers are included by default when you export the results of your query through the InfluxDB UI.

The 3 annotations are:

#group: A boolean that indicates the column is part of the group key. A group key is a list of columns for which every row in the table has the same value. A column is part of the group key if its #group annotation is set to true.
Important Note: The exception is for the table column. The group key for the table is set to false because users can’t directly change the table number. The table record will always be the same across rows even though the group key is set to false.
#datatype: Describes the type of data or which line protocol element the column represents.
#default: The value to use for rows with an empty value.

If annotations confuse you, don’t worry. The importance of annotations will become apparent in subsequent sections of this book, specifically around understanding Flux. Additionally it’s worth mentioning that the #group annotation is the most important annotation for using Flux successfully. For now, be aware that InfluxDB applies a default group key to your data so that the tables in your annotated CSV output will each represent a single series for simple queries. This default application of group keys is the result of the way that series are stored on disk, described in the following section.

From Series to Tables on Disk

Each series is stored as a table on disk. Data gets written to InfluxDB into buckets. When data is written to a bucket it is added to an appropriate table, or a new table is created if needed. Each table has exactly one measurement, one field, and a unique set of tag values. Indexes are created or amended on write to enable quickly finding these tables when queried. Additionally, all rows in all tables are indexed by time.

Some people consider InfluxDB to be a “schemaless” database. However, that is not really accurate. More accurately, InfluxDB is a “schema on write.” That is to say, InfluxDB does not require a schema to be defined beforehand nor enforce a schema on writes. Instead, Influxdb builds a schema implicitly based on the writes you do make.

While schema on write is a true convenience for developers, you should be cognizant of the schema that you are implicitly creating as you write data. A poorly designed schema can have a negative impact on things like ease of querying, performance, and cardinality.

In this section we’ll learn about how line protocol produces series which get stored as tables on disk.

Adding Fields

In this section we’ll break down how the line protocol is converted to series and written as tables in InfluxDB. Let’s review the line protocol examples above:

measurement1 field1=1i,field2=1,field3="a"
measurement1 field1=1i,field2=2,field3="b"

When written, this will form 3 different time series with 2 points in each series, assuming each line was written one minute apart. The data will then be persisted in three separate tables:

_measurement	_field	_value	_time
measurement1	field1	1i	2021-07-12T19:38:00.000000000Z
measurement1	field1	1i	2021-07-12T19:39:00.000000000Z

_measurement	_field	_value	_time
measurement1	field2	1	2021-07-12T19:38:00.000000000Z
measurement1	field2	2	2021-07-12T19:39:00.000000000Z

_measurement	_field	_value	_time
measurement1	field3	a	2021-07-12T19:38:00.000000000Z
measurement1	field3	b	2021-07-12T19:39:00.000000000Z

This example provides the first insights into how the input format and the persistence/output format differ. Each series in InfluxDB is persisted with exactly one field, though line protocol allows writing with multiple fields. This is a critically important concept to understand, so it is worth repeating. The InfluxDB input format is different from the InfluxDB persistence and output format.

These tables represent the minimum data roles that can exist in a series. A series must have at a minimum:

A measurement name
A field key
A value for the field
A time

These are represented in the database with a leading “_”. The leading underscore conveys that these are “slots” in the series that are enforced by the storage engine, as well as helps to avoid naming conflicts with potential tag names.

When simply querying for this data, without adding any additional Flux transformations, the annotated CSV output looks like:

#group,false,false,true,true,false,false,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,long,string,string
#default,_result,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement
,,0,rfc3339time1,rfc3339time2,2021-07-12T19:39:00.000000000Z,1,field1,measurement1
,,0,rfc3339time1,rfc3339time2,2021-07-12T19:38:00.00000000Z,1,field1,measurement1

#group,false,false,true,true,false,false,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,string,string,string
#default,_result,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement
,,1,rfc3339time1,rfc3339time2,2021-07-12T19:39:00.000000000Z,a,field3,measurement1
,,1,rfc3339time1,rfc3339time2,2021-07-12T19:38:00.00000000Z,b,field3,measurement1

#group,false,false,true,true,false,false,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,double,string,string
#default,_result,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement
,,2,rfc3339time1,rfc3339time2,2021-07-12T19:39:00.000000000Z,1,field2,measurement1
,,2,rfc3339time1,rfc3339time2,2021-07-12T19:38:00.000000000Z,2,field2,measurement1

Notice how the resulting annotated CSV contains 3 tables in the output. This is evident by the row separation and also by the value of the table column in the last stream of the table which is equal to 2 (remember annotated CSV counts the table results from 0). Group keys have been added to the data to produce these tables so that each table represents a series by default. Remember a column is part of a group key if all of the values in that column are identical within a single table. For example, the time and value columns are assigned a #group annotation of false. Setting the #group annotation to false allows the different timestamps and field values of points across a single series to be included in the same table.

Adding Tags

Let’s review the line protocol example above with an added tag:

measurement1,tag1="tagvalue1" field1=1i,field2=1,field3="a" 1626118680000000000
measurement1,tag1="tagvalue2" field1=2i,field2=2,field3="b" 1626118740000000000

The introduction of tag1 produces the following 6 series:

_measurement	tag1	_field	_value	_time
measurement1	tagvalue1	field1	1i	2021-07-12T19:38:00.000000000Z

_measurement	tag1	_field	_value	_time
measurement1	tagvalue1	field2	1	2021-07-12T19:38:00.000000000Z

_measurement	tag1	_field	_value	_time
measurement1	tagvalue1	field3	a	2021-07-12T19:38:00.000000000Z

_measurement	tag1	_field	_value	_time
measurement1	tagvalue2	field1	2i	2021-07-12T19:39:00.000000000Z

_measurement	tag1	_field	_value	_time
measurement1	tagvalue2	field2	2	2021-07-12T19:39:00.000000000Z

_measurement	tag1	_field	_value	_time
measurement1	tagvalue2	field3	b	2021-07-12T19:39:00.000000000Z

Ensure that you take the time to study and understand the relationship between the line protocol and the resulting tables as they are represented on disk by the storage engine. This relationship is critically important to achieving the most effective schema and querying for your application.

When simply querying for this data, without adding any additional Flux transformations the annotated CSV output looks like:

#group,false,false,true,true,false,false,true,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,string,string,string,string
#default,_result,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,tag1
,,0,rfc3339time1,rfc3339time2,2021-07-12T19:39:000000000Z,b,field3,measurement1,tagvalue2

#group,false,false,true,true,false,false,true,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,double,string,string,string
#default,_result,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,tag1
,,1,rfc3339time1,rfc3339time2,2021-07-12T19:38:000000000Z,1,field2,measurement1,tagvalue1

#group,false,false,true,true,false,false,true,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,long,string,string,string
#default,_result,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,tag1
,,2,2021-01-18T20:59:37Z,2021-08-17T19:59:37.097Z,2021-07-12T19:38:000000000Z,1,field1,measurement1,tagvalue1

#group,false,false,true,true,false,false,true,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,string,string,string,string
#default,_result,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,tag1
,,3,rfc3339time1,rfc3339time2,2021-07-12T19:38:000000000Z,a,field3,measurement1,tagvalue1

#group,false,false,true,true,false,false,true,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,long,string,string,string
#default,_result,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,tag1
,,4,rfc3339time1,rfc3339time2,2021-07-12T19:39:000000000Z,2,field1,measurement1,tagvalue2

#group,false,false,true,true,false,false,true,true,true
#datatype,string,long,dateTime:RFC3339,dateTime:RFC3339,dateTime:RFC3339,double,string,string,string
#default,_result,,,,,,,,
,result,table,_start,_stop,_time,_value,_field,_measurement,tag1
,,5,rfc3339time1,rfc3339time2,2021-07-12T19:39:000000000Z,2,field2,measurement1,tagvalue2

Notice how the resulting annotated CSV contains 6 tables in the output. This is evident by the row separation and also by the value of the table column in the last stream of the table which is equal to 5 (remember annotated CSV counts the table results from 0). Group keys have been added to the data to produce these tables so that each table represents a series by default. Remember a column is part of a group key if all of the values in that column are identical within a single table. For example, the time column is assigned a #group annotation of false. Setting the #group annotation to false allows the different timestamps of points across a single series to be included in the same table. Conversely, the _measurement column is assigned a #group annotation of true. Setting the #group annotation to true enforces that all of the records in that table have the same measurement value. Remember, a series is identified by a unique combination of measurements, tag sets, and fields. If a table is to represent a single series, the table must contain records with the same measurement, tag sets, and fields across all of the rows.

Important note: You can use Flux to manipulate the group keys and the resulting number of tables in the the output Annotated CSV table stream. We’ll learn about how to do this in later chapters.

For now, let’s focus on understanding how line protocol results in different series. We can extend the example by adding an additional tag, but in this case, note that there is only a single tag value for tag2:

measurement1,tag1="tagvalue1",tag2="tagvalue3" field1=1i,field2=1,field3="a" 1626118680000000000
measurement1,tag1="tagvalue2",tag2="tagvalue3" field1=2i,field2=2,field3="b" 1626118740000000000

Again, each series is identified by their unique tag keys, tag values, and field key combinations. Because a series is defined in part by a unique set of tag values, in this case, the introduction of tag2 does not change the table count in the underlying data model. When the introduction of a tag does not change the table count in the underlying data model, the tag is referred to as a dependent tag:

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue1	tagvalue3	field1	1i	2021-07-12T19:38:00.000Z

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue1	tagvalue3	field2	1	2021-07-12T19:38:00.000Z

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue1	tagvalue3	field3	a	2021-07-12T19:38:00.000Z

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue2	tagvalue3	field1	2i	2021-07-12T19:39:00.000Z

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue2	tagvalue3	field2	2	2021-07-12T19:39:00.000Z

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue2	tagvalue3	field3	b	2021-07-12T19:39:00.000Z

To demonstrate the impact of combinations of tag values on the creation of time series, here are three lines of line protocol, but with only one field:

measurement1,tag1="tagvalue1",tag2="tagvalue4" field1=1i 1626118620000000000
measurement1,tag1="tagvalue2",tag2="tagvalue5" field1=2i 1626118680000000000
measurement1,tag1="tagvalue3",tag2="tagvalue6" field1=3i 1626118740000000000

In those 3 lines there are 3 unique combinations of tag values and the single field, so, despite the presence of six total tag values, there are only 3 series created:

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue1	tagvalue4	field1	1i	2021-07-12T19:37:00.000Z

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue2	tagvalue5	field1	2i	2021-07-12T19:38:00.000Z

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue3	tagvalue6	field1	3i	2021-07-12T19:39:00.000Z

In this example, there are only 2 unique combinations of tag values:

measurement1,tag1="tagvalue1",tag2="tagvalue4" field1=1i 1626118620000000000
measurement1,tag1="tagvalue1",tag2="tagvalue4" field1=2i 1626118680000000000
measurement1,tag1="tagvalue2",tag2="tagvalue4" field1=3i 1626118740000000000

As a result, the first series contains two points because those two points have the same combination of field name and tag values, whereas the third point has a different set of tag values.

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue1	tagvalue4	field1	1i	2021-07-12T19:37:00.000Z
measurement1	tagvalue1	tagvalue4	field1	2i	2021-07-12T19:38:00.000Z

_measurement	tag1	tag2	_field	_value	_time
measurement1	tagvalue3	tagvalue4	field1	3i	2021-07-12T19:39:00.000Z

Real World Data

There is an excellent repository of semi-live line protocol data maintained by InfluxData. This is generally intended as sample data to help you get started exploring InfluxDB. Currently, 4 datasets are kept up to date there:

Air Sensor Data: This dataset includes a single tag, which is an id for the specific air quality sensor that is reporting 3 fields, temperature, humidity, and carbon monoxide levels.
Bird Migration Data: This is geo-spatial dataset represents migratory movements of birds. It is tagged to aid geo-spacial querying.
NOAA National Buoy Center Data: This dataset provides the latest observations from the NOAA NDBC network of buoys. It contains a large number of tags and fields.
USGS Earthquake Data. The United States Geological Survey (USGS) earthquake dataset contains seismic activity data. This is a very large dataset, and contains even more tags and fields.

While you can simply copy the Flux from any of the real world sample datasets into the Script Editor in the Data Explorer and visualize the data. I recommend creating a bucket and using the to() function to write the data to that bucket, as described in Write and Query Sample Data in Part 1. Writing the data to a bucket in InfluxDB allows you to use Flux to explore the schema of your dataset.

Important Note: You might hit your series cardinality limit for Free Tier accounts if you write the larger datasets to InfluxDB. I recommend just writing the Air Sensor Data if you’re using the Free Tier account.

Important Note: This section recommends that you use the InfluxDB UI to write data to InfluxDB only because writing data with other tools hasn’t been covered yet. We recommend writing data with the CLI or VS Code. If you prefer developing in with those tools look at the

Exploring the Real Word Data Schema with Flux

Let us turn our attention to 2 real world data sets: Air Sensor Data and the NOAA National Buoy Center Data. We’ll use Flux to get an understanding of our schema. Then we’ll run through some exercises to ensure our understanding of the relationship between the line protocol input format and the InfluxDB data model as persisted on disk by the storage engine.

We’ll start by focusing on the Air sensor dataset, as it is the simplest dataset. The a Air sensor dataset contains:

1 measurement: airSensors
3 fields:
- co
- humidity
- temperature
1 tag: sensor_id
- 8 sensor_id tag values

As you can see, the fields are the actual data, in this case all of type float, where the tag is metadata, defining which sensor produced the data.

Explore your data schema with Flux to obtain the number of measurements, tag keys, tag values, and field keys in your data by using the schema package.

To get the number of fields in the airSensors measurement from the Air sensor sample dataset, run the following Flux query in your preferred tool (CLI, VS Code, InfluxDB UI):

import "influxdata/influxdb/schema"

schema.measurementFieldKeys(
  bucket: "Air sensor sample dataset",
  measurement: "airSensors"
)
|> count()

To get the number of tag keys in the airSensors measurement from the Air sensor sample dataset, run the following Flux query in your preferred tool (CLI, VS Code, InfluxDB UI):

import "influxdata/influxdb/schema"

schema.measurementTagKeys(
  bucket: "Air sensor sample dataset",
  measurement: "airSensors"
)
|> count()

To get the number of tag values for the sensor_id tag key in the airSensors measurement from the Air sensor sample dataset, use the following Flux query (CLI, VS Code, InfluxDB UI):

import "influxdata/influxdb/schema"

schema.measurementTagValues(
  bucket: "Air sensor sample dataset",
  tag: "sensor_id",
  measurement: "example-measurement"
)
|> count()

We can repeat the same approach for the NOAA National Buoy Center Data. We find that the NOAA National Buoy Center Data has the following schema:

1 measurement: ndbc
21 fields:
- air_temp_degc
- avg _wave_period_sec
- dewpoint_temp_degc
- dominate_wave _period_sec
- gust_speed_mps
- lat
- lon
- pressure_temdancy_hpa
- sea_level_pressure_hpa
- sea_surface_temp_degc
- significant_Wave_height_m
- station_currents
- station_dart
- station_elev
- sation_met
- station_visibility_mei
- station_waterquality
- water_level_ft
- wave_dir_degt
- wind_dir_degt
- wind_spead_mps
5 tag keys:
- station_id
  - 113 station_id tag values
- station_name
  - 828
- station_owner
  - 57 station_owner tag values
- station_pgm
  - 6 station_pgm tag values
- station_type
  - 4 station_type tag values

Again, note that the 21 fields are all the different kinds of data that might be collected by a weather station, whereas the 5 tags contain metadata about which stations collected it.

Exercises with Real World Data

Now that we understand the schema of the 2 real world data sets, Air Sensor Data and the NOAA National Buoy Center Data, try to answer the following questions to test your understanding of schema design, line protocol, and

Question 1: How many series will be created by the Air Sensor Data given the schema above?

Answer 1: (1 sensor_id tag x 8 unique tag values) x (3 fields) = 4 x 3 = 24

Question 2: How many series will be created by the NOAA National Buoy Center Data given the schema above (assuming no tags are dependent tags)?

Answer 2: (1 station_id tag x 113 unique tag values) x (1 station_name tag x 828 unique tag values) x (1 station_owner tag x 57 unique tag values) x (1 station_pgm tag x 6 unique tag values) x (1 station_type tag x 47 unique tag values) x (21 fields) = 113 x 828 x 56 x 6 x 47 x 21 = 31028816448

Question 3: How would the following line protocol form the Air sensor sample dataset be organized into tables on disk? And how many points are in each series?

airSensors,sensor_id=TLM0100 temperature=71.17615703642676,humidity=35.12940716174776,co=0.5024058630839136 1626537623000000000
airSensors,sensor_id=TLM0101 temperature=71.80350992863588,humidity=34.864121891949736,co=0.4925449578765155 1626537623000000000
airSensors,sensor_id=TLM0102 temperature=72.02673296407973,humidity=34.91147650009415,co=0.4941631223400505 1626537623000000000
airSensors,sensor_id=TLM0103 temperature=71.34822444566278,humidity=35.19576623496297,co=0.4046734235304059 1626537623000000000
airSensors,sensor_id=TLM0200 temperature=73.57230556533555,humidity=35.77102288427073,co=0.5317633226995193 1626537623000000000
airSensors,sensor_id=TLM0201 temperature=72.57230556533555,humidity=35.17327249047271,co=0.5000439017037601 1626537623000000000
airSensors,sensor_id=TLM0202 temperature=75.28582430811852,humidity=35.668729783597556,co=0.48071553398947864 1626537623000000000
airSensors,sensor_id=TLM0203 temperature=74.75927935923579,humidity=35.89268792033798,co=0.4089308476612381 1626537623000000000
airSensors,sensor_id=TLM0100 temperature=71.2194835668512,humidity=35.12891266051405,co=0.4958773037139102 1626537633000000000
airSensors,sensor_id=TLM0101 temperature=71.78232293801005,humidity=34.88621453634278,co=0.5074032895942003 1626537633000000000
airSensors,sensor_id=TLM0102 temperature=72.07101160147653,humidity=34.938529830668536,co=0.5102716855442547 1626537633000000000
airSensors,sensor_id=TLM0103 temperature=71.32889101333731,humidity=35.21581883021604,co=0.4245915521103036 1626537633000000000
airSensors,sensor_id=TLM0200 temperature=73.55081075397399,humidity=35.74330537831752,co=0.5435288991742965 1626537633000000000
airSensors,sensor_id=TLM0201 temperature=74.06284877512215,humidity=35.17611147751894,co=0.4813785832360323 1626537633000000000
airSensors,sensor_id=TLM0202 temperature=75.29425020175684,humidity=35.64366062740866,co=0.4911462705616819 1626537633000000000
airSensors,sensor_id=TLM0203 temperature=74.77142594525142,humidity=35.941017361190255,co=0.42797647488504065 1626537633000000000

Answer 3:

The line protocol would result in the following series and tables on disk. Each series contains two points from the line protocol above.

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0100	temperature	73.57230556533555	rfc3339time1
airSensors	TLM0100	temperature	71.2194835668512	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0101	temperature	72.02673296407973	rfc3339time1
airSensors	TLM0101	temperature	71.78232293801005	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0102	temperature	73.57230556533555	rfc3339time1
airSensors	TLM0102	temperature	72.07101160147653	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0103	temperature	71.34822444566278	rfc3339time1
airSensors	TLM0103	temperature	71.32889101333731	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0200	temperature	73.57230556533555	rfc3339time1
airSensors	TLM0200	temperature	73.55081075397399	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0201	temperature	73.57230556521233	rfc3339time1
airSensors	TLM0201	temperature	74.06284877512215	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0202	temperature	75.28582430811852	rfc3339time1
airSensors	TLM0202	temperature	75.29425020175684	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0203	temperature	74.75927935923579	rfc3339time1
airSensors	TLM0203	temperature	74.77142594525142	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0100	humidity	35.12940716174776	rfc3339time1
airSensors	TLM0100	humidity	35.12891266051405	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0101	humidity	34.864121891949736	rfc3339time1
airSensors	TLM0101	humidity	34.88621453634278	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0102	humidity	34.91147650009415	rfc3339time1
airSensors	TLM0102	humidity	34.938529830668536	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0103	humidity	35.19576623496297	rfc3339time1
airSensors	TLM0103	humidity	35.21581883021604	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0200	humidity	35.77102288427073	rfc3339time1
airSensors	TLM0200	humidity	35.74330537831752	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0201	humidity	35.17327249047271	rfc3339time1
airSensors	TLM0201	humidity	35.17611147751894	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0202	humidity	35.668729783597556	rfc3339time1
airSensors	TLM0202	humidity	35.64366062740866	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0203	humidity	35.89268792033798	rfc3339time1
airSensors	TLM0203	humidity	35.941017361190255	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0100	co	0.4925449578765155	rfc3339time1
airSensors	TLM0100	co	0.495877303713910	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0101	co	0.4925449578765155	rfc3339time1
airSensors	TLM0101	co	0.5074032895942003	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0102	co	0.4941631223400505	rfc3339time1
airSensors	TLM0102	co	0.5102716855442547	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0103	co	0.4046734235304059	rfc3339time1
airSensors	TLM0103	co	0.4245915521103036	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0200	co	0.5317633226995193	rfc3339time1
airSensors	TLM0200	co	0.5435288991742965	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0201	co	0.5000439017037601	rfc3339time1
airSensors	TLM0201	co	0.4813785832360323	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0202	co	0.48071553398947864	rfc3339time1
airSensors	TLM0202	co	0.4911462705616819	rfc3339time2

_measurement	sensor_id	_field	_value	_time
airSensors	TLM0203	co	0.4089308476612381	rfc3339time1
airSensors	TLM0203	co	0.42797647488504065	rfc3339time2

Next Section