BLOG: Time Series / Big Data Management

Continuous Streaming Limitations
Data Transmission Technology	Sample Rate Limit
Fiber Cable (< 4000 mi)	42 Hz
Ethernet (< 100 m)	42 Hz
5G Cellular (< 4000 mi)	50 Hz
4G Cellular (< 28000 mi)	7 Hz

Type	Latency	Sample Rate	Maximum Measurable Event Frequency
Satellite	800 ms	1.3 Hz	0.13 Hz
4G Cellular (< 28000 mi)	150 ms	7 Hz	0.7 Hz
5G Cellular (< 4000 mi)	20 ms	50 Hz	5.0 Hz
Copper Wire (Ethernet limited to 100 m)	24 ms	42 Hz	4.2 Hz
Fiber Cable (across USA)	24 ms	42 Hz	4.2 Hz

**Comparison of Peformance Reading Text Files**
# Rows	File Size	DIAdem DataPlugin (~38-56 GiB/hr)	Excel
10000	1.1 MiB	0.2 sec	5.3 sec
100000	11.5 MiB	0.8 sec	16 sec
1000000	115 MiB	7.2 sec	168 sec
10000000	(1.1 GiB) 1126 MiB	104 sec	(1) 387 sec

DAQ Planning

These tips are intended to help ensure that the data you record with your data acquisition / data logger equipment provides the desired outcome at the end of the process.

Planning For Field Test Success

Collecting field test data and putting it into the hands of engineers who can use it to make informed product / process decisions, requires careful planning from the start of the process to the end. Any interruption in the timely delivery of of the data to the engineers at the end of the process, or a compromise in the quality of the data, will affect the outcome.

Start with the end in mind. Identify from the end users of the data, what their needs are, what would they really like, and when they need it. But before creating a data collection plan, carefully consider what analysis and data processing will be required to turn that raw data into useful information, and who is going to do that. You also need to perform a risk analysis on the data that has been identified as critical, and assess the potential that a sensor or a other equipment could fail and compromise the quality or complete delivery of the data. Redundent sensors / hardware and even duplicate recorded data channels may be a prudent choice to save the test activity.

For some projects, it may be sufficient to collect the data, and then some time later the analysis will begin after all of the recording is complete. However, this is very risky, and should be avoided unless a complete pilot run is performed to insure that a sample of the data (perhaps collected in the laboratory where the test specimen is being instrumented) is recorded and then fully processed as intended for the actual final data set. A much better approach is to do the pilot test, but also during the actual testing small batches of data are released as frequently as possible, and fully analyzed. This gives you the best opportunity to catch any mistakes or omissions as early as possible.

Think carefully about the size of the raw data you will be transferring from a remote field test location to the final long term storage destination. Continuous streaming of data over thousands of miles limits you to a sample rate of about 42 Hz. Daily collected data (< 24 hours/day) raw data files with a total size in the gigabyte range will not transfer quickly over WiFi, cellular, or over any network with thousands of miles between the two points unless a specialized service is employed. If you can afford a few days delay between pulling a batch of files from the field test unit and final analysis at the final destination, consider physically mailing the portable storage device.

Small Raw Files

Most data loggers and data acquisition equipment can be configured how often they can generate a new file while streaming data to a flash drive. Whenever possible, try to keep the recorded file size to 100 MB or less. This may sound small, but smaller files will dramatically reduce data transfer and conversion time later.

If the data is CAN/LIN bus log data, then this data will expand 10x in size or more when decoded with the bus log databases. If you have a 3.3 GB raw CAN/LIN bus log data file, and decode it, it easily becomes a 30 GB file!

It is always easier to analyze and concatenate small files, than it is to split large files. A DIAdem channel can have up to 2 billion (2^31) values so it has outstanding capability for working with large amounts of file data. But the performance will be significantly better if you keep those analyzed data file sizes smaller. Ideally, one or more files should together cover an event of interest. You don't want to later manually split large files because the test specimen was performing a different task, or operating in a different environment.

NoValues

Sometimes a sensor was not able to record a value at a particular time when a sample is recorded, or the value exceeds the maximum measurable limit of the sensor. These are just some of the conditions where the value will be stored as NoValue when imported. NoValues are an important DIAdem feature that allow you to clearly differentiate when the value of the channel is not known at a particular time a sample was taken. It would be highly undesirable if the value were instead set as zero, the minimum allowed measured value, the sensor full scale value, or interpolated.

Some DIAdem functions work well with NoValues, and others require the user to resolve them to something else in order to perform the intended data manipulation or analyis. You have several options for the management of NoValues, including interpolation, using the last value, or deleting them. DIAdem includes tools to make it easy to manage NoValues in channel data.

Ambiguous Date/Time

If you configure your data logger / data acquisition equipment in one location while installed in the test specimen, and then ship it to the test destination another time zone away, and then ship the recorded data to another time zone for long term storage and data analysis, will you really know when the data was recorded? The best practice is to have a procedure where the DAQ / data logger equipment is set for Coordinated Universal Time (UTC), and the time zone at the time of the recording is also documented. If you cannot document the time zone, then an alternative is to record GPS location data, so that later the date/time relative to UTC can be determined, and the local time zone can be inferred from the GPS coordinates.

GPS Data

It is amazing what information can be determined about the past environmental conditions (weather) and topography at a particular location and date/time using online API's . DIAdem has Shuttle Radar Topography Mission (SRTM) data tools that fetch high-resolution digital topographic data covering nearly all of the earth at a resolution of 30 m. Online API services such as https://timezonedb.com can tell you the country code, time zone name, time zone offset, for a pair of GPS coordinates. DIAdem includes tools that calculate the linear distance and elevation range between two GPS coordinates.

GPS signals include a date/time reference relative to Coordinated Universal Time (UTC). The resolution of this value could be as good as 1.5 seconds, but drift the initial synchronization on 1 Jan 1980 has caused a considerable, but known difference. As of 2021, the GPS reported time is 18 seconds ahead of the actual UTC value. The correction to the GPS reported UTC time is calculated from the number of leap seconds that have occurred since 1 Jan 1980. Each leap second adds one second differrence to the GPS report UTC time, and the actual UTC time. When properly adjusted, the GPS signals for UTC date/time can provide a valuable reference identifying when data has been recorded.

Blog - Time Series / Big Data Management

How To Choose A Time Series Database Solution

Sample Rate

Data Collection Interval

Data Packet Size

Data Generation Rate

Continuous Data Recording

Define Analytical Needs

Specifying the Functional Requirements

Data Center Options

Central

Distributed

Standalone

Database Vs. File

Big Data Big Distance Transfers

Continuous Data Streaming

Excel for Big Data?

DAQ Planning

Planning For Field Test Success

Small Raw Files

NoValues

Ambiguous Date/Time

GPS Data

For End Users

For Businesses

Other