InfluxDB Python integration

Author: Eva Černčič, Data Analyst / Programmer

With a number of deployed IoT devices and amount of collected data on the constant rise, so are increasing database solutions, analytics, monitoring and visualization tools to store and make sense of the overwhelming amount of new data. And rightly so – often the investment in new smart technology and its integration does not deliver to customers expectation, mainly due to overwhelming amount of collected data and no meaningful result. It is safe to say that not many smart devices, even if working properly, actually empower users with meaningful data that are easy to understand and, ultimately, monetize.

Since we specialize in the development of specialized IoT devices, we aim to utilize data analysis and visualization tools, during the development process for our internal purposes. This is conducted separately from the client implementation of the IoT management that is outside our scope.

We find that with combining two open-source tools, InfluxDB – time-series database and Grafana – analytics, monitoring and visualization tool, one can achieve effective and very adaptive data visualization, log monitoring and alerting form your devices and sensors. While multiple plug-ins allow for endless visualization and data analysis possibilities, we still see a high demand for customization and optimization of these steps so the outcomes can easily be fed-back to the development process.

Examples of effective debugging and testing

Implementation

For that purpose, InfluxDB offers libraries for multiple languages support, including influxdb-python library, that is hosted by InfluxDB’s GitHub account. Combined with other data-analysis packets Python has to offer, such as PANDAS for large data structures handling, NumPy for numerical features and plotly for fast user-interactive visualization, one can easily achieve custom data processing that is needed for a certain application. The most important reason behind choosing Python-based data analysis during the development process is high-level library support, the flexibility of fast implementation and major development time saving over other language options.

To start working with InfluxDB data in python environment, install python package and import it into your work file. Next, we want to create a new instance of the InluxDBClient

from influxdb import InfluxDBClient, DataFrameClient

# Server data
myhost = 'myhost_name'
myuser = 'myuser_name'
mypass = 'mypass_name'
mydatabase = 'mydatabase_name'

client = InfluxDBClient(host=myhost, port=8086, username=myuser, password=mypass, database=mydatabase)

If unsure of the desired database to use, parameter default database=None can be left out. To see more parameter options on all functions see documentation. To explore available database options use

client.get_list_database()

which returns database dictionary. If you want to switch between databases proceed with

client.switch_database(mydatabase)

Once a new client is created, data can be then retrieved using query forms with SELECT, FROM and WHERE statements and complementary clauses. If we have time-series database, with database_name, which includes measurements defined by measurement_name with different fields, labelled by field_keys we can retrieve data using SELECT statement, always specifying complementary clause

SELECT <field_key>, – specify single or multiple keys

SELECT * – select all

with FROM statement we specify the measurements of interest

FROM <measurement_name>, – specify single or multiple measurements to select

FROM <database_name>.<measurement_name> – specify single or multiple measurements from specific database

FROM * – from all

To further filter the selected data WHERE statement can be utilized to add conditional expression with AND OR. Conditional expression supports standard logical operators: =, !=, <, <=, >=, >. Final query form should follow

SELECT select_clause FROM from_clause WHERE <conditional_expression> AND\OR <conditional_expression> ...

WHERE statement can be especially useful to limit the time frame of selected data. Keep in mind that the whole query statement is passed to the InfluxDB query() function as a string. For example:

results = client.query('SELECT * FROM datapackets WHERE time > 2020-02-07T15:00:00 AND time < 2020-02-14T00:00:00')

where the start and end time are provided in the form of timestamps in the standard UTC form '2020-02-07T15:00:00'. Since WHERE statements can easily get complicated, employing multiple conditions, bind_parms parameter, in a form of a dictionary, can be specified in python query. Above query example can be rewritten in the form:

date_parms = {'start_time': '2020-02-07T15:00:00', 'end_time': '2020-02-14T00:00:00'}

results = client.query('SELECT * FROM datapackets WHERE time > $start_time AND time < $end_time", bind_params=date_parms)

Such an approach is very useful in an exploratory analysis, as it allows for fast and, even more importantly, transparent change in query conditions. A useful feature here is the epoch parameter, particularly in applications where precise measurement timestamp is important. This can be set to 'h''m''s''ms''u', or 'ns' precision. To find more about query constructions, look at the InfluxQL data exploration.

Once raw data results are obtained, you can explore it in the raw JSON format

results.raw

or parse it to a “python-friendlier” version by converting it to points, with the  option of selecting a specific key, that can be iterated and used in further analysis.

points = results.get_points()
for point in points:
    print('Time: ', point['time'])

If you are dealing with a large dataset, which is often the case with IoT applications, and fast visualization and exploratory analysis are in place, InfluxDB offers another option, where direct data integration with Python PANDAS library is possible. Instead of defining InfluxDBClient you want to create an instance of DataFrameClient client, following the same principles as before.

from influxdb import InfluxDBClient, DataFrameClient

# Server data
myhost = 'myhost_name'
myuser = 'myuser_name'
mypass = 'mypass_name'
mydatabase = 'mydatabase_name'

client_df = DataFrameClient(host=myhost, port=8086, username=myuser, password=mypass, database=mydatabase)

Data can be retrieved using the same query form and functions as in the previous case with the difference that obtained data is in the form of PANDAS DataFrame, convenient for direct analysis and visualization.

results_df = client_df.query('SELECT * FROM datapackets WHERE time > 2020-02-07T15:00:00 AND time < 2020-02-14T00:00:00')

From there fast data manipulation and analysis is possible due to multiple integrated DataFrame functionality, including handling missing data, straightforward visualization, correlation calculation, statistical functionality and much more.

About the author

Eva Černčič

Eva Černčič

Translates numbers into plain English. She manipulates and analyses collected data sets from prototyped systems to evaluate their performance and develop market-oriented solutions.
Partnering with the software development team, she aids in algorithm design and implementation.

Share this post

Share on facebook
Share on linkedin
Share on twitter
Share on google
Share on print
Share on email

Stay Updated!

Subscribe to receive News on interesting projects we’re involved in and open hardware projects we create.