The sample rate is the frequency at which a single data stream is recorded. Typically it is expressed in units of samples per second, or a somewhat confusing unit of Hz (Hertz, or cycles per second). The distinction of it being the recorded rate is important, because this implies the sample rate has been chosen carefully based on the smallest interval that needs to be measured, the time constant of the sensor, and then increased appropriately to avoid any aliasing effects.
The recording of data may be a one-time (isolated) event, repeated every interval (e.g. daily), or continuous. The continuous streaming of data has practacle limits, depending on the path the data will take between the data recording device, and the final storage location, as shown in the table below.
| Data Transmission Technology | Sample Rate Limit |
|---|---|
| Fiber Cable (< 4000 mi) |
42 Hz |
| Ethernet (< 100 m) |
42 Hz |
| 5G Cellular (< 4000 mi) |
50 Hz |
| 4G Cellular (< 28000 mi) |
7 Hz |
The term data packet is used here in the context of defining the size of the digital data in bytes that needs to be stored into a time series database. It may include a unique identifier for the source of the data, a date/time stamp, and then floating point values for each of the (channel / sensor) data values.
Ideally the packet will contain a RFC3339 date/time stamp (e.g. "20200216T150644Z" or "20200216T150644.000000000Z") with the data values, rather than relying on the recipient to date/time stamp the data, because the latter scenario could be considerably different. Note that the RFC3339 date/time stamp could include a resolution of one second, or one nanosecond, requiring either 16 or 26 bytes.
A single unencrypted signed floating point value with six significant digits (IEEE-754 64-bit floating point number) consumes 13 bytes (Ex. "+6.534755E+06"). A generalized formula for the unencrypted storage requirements would be the following for a nanosecond precision RFC3339 date/time stamp:
packet size [bytes] = (len(identifier)+1) + (# channels) * 14 + 26
For example, the packet for three channels of data and a nanosecond precision RFC3339 date/time stamp would look like: "myMeasurements;20200216T150644.000000000Z;+6.534755E+06;+6.534755E+06;+6.534755E+06"
Encrypting a data packet may or may not increase the size of the packet. It depends on the encryption employed, and the capabilities of the device / application preparing and sending the data packet.
The amount of data generated for every collection interval in bytes is calculated by multiplying the packet size in bytes by the data collection interval (in seconds), and by the sample rate in Hz (or samples/sec). If the collection interval is continuous / streaming, then you can calculate the amount of data collected per day, for example, by multiplying the packet size in bytes by 86400 sec/day, and by the sample rate in Hz or samples/sec.
For example, the packet size of 82 bytes streaming continuously at 42 Hz would generate 298 MB of data per day:
data generated in bytes per day = (42 samples/sec) * (84600 sec/day) * (82 bytes/sample) = 297,561,600 bytes/day
If the same packet size of 82 bytes was recorded for 12 hours per day at 42 Hz, then 149 MB/day would be generated.
data generated in bytes per day = (42 samples/sec) * (3600*12 sec/day) * (82 bytes/sample) = 148,780,800 bytes/day
Under the constraint of a maximum sample rate of 42 Hz for data transmission on
You might say, "I can watch YouTube videos on my smartphone over a 4G cellular connection without an issue, why is that different than continously streaming data files?". Well, consider these facts about YouTube videos:
A 10 GB/month cellular data plan would allow 224 MB/min of continuous use at that data plan limit. That is nearly enough to watch 2160p videos on YouTube continuously. Within the constraint of 224 MB/min of continuous streaming on a cellular 4G connection, you could sample 6300 channels of data at 42 Hz, or 264 channels at 1000 Hz.
The functional requirements for the time series database solution should include consideration for both ingestion and analysis of the data.
Having all of your Big Data located at a single location for long term storage and making it available globally to all of the engineers in your organization initially sounds like a fantastic way to manage data and put it into the hands of the engineers who need it. BUT IT IS NOT THE BEST CHOICE. If any of the engineers trying to access the data at the single central data center are geographically thousands of miles away, the performance of uploading raw data files and analyzing them later will be very poor. The only scenario where a central location can reasonably service remote users globally is a system with the following configuration:
A distributed data center configuration can provide the best performance option when it is properly configured and supported. If you have groups of activities where the primary use of the data is localized, and you put in place a means where all of the data collected by each location is shared with each distributed data center, then you can achieve the best performance scenario overall. You will need global standards for processing of the local data, and a good method for updating each location with new data from all of the other locations. Redudency of the data in order to achieve good data analysis access performance has obvious additional benefits. It is essential that at each location, a very high speed network connection between each engineer and the server where the data resides is configured.
A frequent scenario is the engineer who records the data, keeps the data locally stored on a group shared drive, or the engineer's laptop / PC. Assuming the laptop / PC and any used local network connection is all high performance, then the data processing experience will be very good. The downside to this arrangement is that the data is not open for access by other engineers, and it is up to to that engineer to properly categorize and organize the data.
When you think about the storage of Big Data from testing data acquisition activities, then the paradigm is similar to working with many large video files, WITH THESE IMPORTANT EXCEPTIONS:
Can your relational or NoSQL database service all of those needs? These needs are more easily serviced by a binary file based approach to the storage of each data acquisition file, where the unique needs of this type of data is accommodated. One such structured file format is the NI TDM Data Model.
Transferring Big Data across geo-distributed locations thousands of miles apart is going to be primarily constrained by the transport protocol if over the internet or a dedicated network, and finally by latency. The the time it takes to send data between the source and destination points, plus the time it takes for delivery acknowledgment, is round-trip time (RTT). Moving a 10 GB file or data set across the US will take 10-20 hours on a typical 100 Mbps line using standard TCP-based file transfer tools (source). Specialized services are available that can reduce that time as little as 14 minutes, but beyond that, latency will limit what can be done. Only breaking the laws of physics will overcome the latency constraint. Latency will also prevent you from continuously sending a single channel of data from a sensor to a distant wired/wireless location at anything better than a sample rate of 42 Hz.
If your budget is small, or your data sizes are even larger, and you can afford a few days of delay, then consider physically moving the portable storage device or a copy of the data on a portable flash drive by mail. Do it yourself, or consider Seagate's Lyve Mobile high capacity edge storage solution. NI's Edge Storage and Data Transfer Service (DTaaS) claim a storage capacity of 200+TB and data throughput of greater than 6GB/s.
Once your raw data makes it to it's final destination for long term storage and data analysis, then look at how the TSDMS Bulk Operations can accelerate your extraction of information from that data. Bulk Import will help you decode and import thousands of raw data files. Bulk Analysis will perform statistical calculations on the entire set of data files. And Bulk Report will generate reports from the analyzed files, providing you with insight on your data in minutes.
The continuous streaming of a single channel of data is limited by latency. The table below provides a realistic expectation for the continuous transmission of a single channel of data by various available methods.
| Type | Latency | Sample Rate | Maximum Measurable Event Frequency |
|---|---|---|---|
| Satellite | 800 ms | 1.3 Hz | 0.13 Hz |
| 4G Cellular (< 28000 mi) |
150 ms | 7 Hz | 0.7 Hz |
| 5G Cellular (< 4000 mi) |
20 ms | 50 Hz | 5.0 Hz |
| Copper Wire(Ethernet limited to 100 m) | 24 ms | 42 Hz | 4.2 Hz |
| Fiber Cable(across USA) | 24 ms | 42 Hz | 4.2 Hz |
If your sampling requirements exceed what can be done with continuous data streaming, then consider using a data logger that can record it directly to portable media, and then swapping out and postal mailing that portable media to your long term storage location. The TSDMS Bulk Operations will have you quickly extracting value from that data.
The software you choose to manage text based test measurement data can have a significant influence on the time required to import a medium size (115 MiB) data file for analysis. If you need to analyze groups of files for trends, then the application data processing requirements can be even more of a challenge.
Lets look at the case of reading medium size (115 MiB) and large (1 GiB) text files. The table below compares the time required to read text files with comma separated value (CSV) data, ranging in size from 1.1 MiB to 1.1 GiB (gigabyte). Excel does a fairly good job at reading the typical text file imported by a user for general business purposes. However, medium size test measurement files (115 MiB) and larger begin to take considerably long to import, and anything with more than 1 million rows of data will exceed the limits of the application. Look at the performance of a DIAdem DataPlugin created using the text file DataPlugin wizard. This is the performance you need when processing many new text based measurement data files.
| # Rows | File Size | DIAdem DataPlugin (~38-56 GiB/hr) |
Excel |
|---|---|---|---|
| 10000 | 1.1 MiB | 0.2 sec | 5.3 sec |
| 100000 | 11.5 MiB | 0.8 sec | 16 sec |
| 1000000 | 115 MiB | 7.2 sec | 168 sec |
| 10000000 | (1.1 GiB) 1126 MiB | 104 sec | (1) 387 sec |
(1) 387 sec (6.5 minutes) for 1,048,577 rows (10% of the data). Excel cannot store more than 1,048,576 rows of data in a worksheet.
Now consider the case where you need to analyze multiple files for trends. If you save the 1 million row text file you read into Excel to a Excel binary .xlsb file format, it will cost you nearly ten seconds to read that file (again) with Excel. With DIAdem, you use a DataPlugin to import the data once, save it to the TDM/TDMS file format, and then you can read a file that is 10x bigger than what you can store in Excel in less than 0.04 seconds! If you need to analyze many data files, perhaps looking for trends over time, then consider the performance advantage of working with them in DIAdem TDM/TDMS binary file format.
Analyzing many large data files as a group for trends is commonly approached by concatenating those files sequentially in a specific order, and then analyzing the final assembled file as if it were one data file. The problem with that approach is that when you go to analyze the file, you probably need to read it completely into memory, and that causes you to quickly reach the limits of what you can do in Excel. In DIAdem, a channel (or measurement) can have up to 2 billion (2^31) values.
Processing / analyzing many large files in DIAdem is where this application really sets itself apart from the rest. To begin with, DIAdem has a file oriented storage methodology. The TDMS file format is designed to provde easy and fast access to the metadata associated with the data file (so you can search and find the files of interest). DIAdem provides several options for loading the data. For example, you can load only the structure of the data (not the data values), and determine what channels you ultimately want to load. It also has a file format that was designed to allow you to read only the specific channels of interest from that file. The combination of very fast file reading, and flexability in what content is read from each file, results in an application that is superior for processing and analyzing test measurement data.
Collecting field test data and putting it into the hands of engineers who can use it to make informed product / process decisions, requires careful planning from the start of the process to the end. Any interruption in the timely delivery of of the data to the engineers at the end of the process, or a compromise in the quality of the data, will affect the outcome.
Start with the end in mind. Identify from the end users of the data, what their needs are, what would they really like, and when they need it. But before creating a data collection plan, carefully consider what analysis and data processing will be required to turn that raw data into useful information, and who is going to do that. You also need to perform a risk analysis on the data that has been identified as critical, and assess the potential that a sensor or a other equipment could fail and compromise the quality or complete delivery of the data. Redundent sensors / hardware and even duplicate recorded data channels may be a prudent choice to save the test activity.
For some projects, it may be sufficient to collect the data, and then some time later the analysis will begin after all of the recording is complete. However, this is very risky, and should be avoided unless a complete pilot run is performed to insure that a sample of the data (perhaps collected in the laboratory where the test specimen is being instrumented) is recorded and then fully processed as intended for the actual final data set. A much better approach is to do the pilot test, but also during the actual testing small batches of data are released as frequently as possible, and fully analyzed. This gives you the best opportunity to catch any mistakes or omissions as early as possible.
Think carefully about the size of the raw data you will be transferring from a remote field test location to the final long term storage destination. Continuous streaming of data over thousands of miles limits you to a sample rate of about 42 Hz. Daily collected data (< 24 hours/day) raw data files with a total size in the gigabyte range will not transfer quickly over WiFi, cellular, or over any network with thousands of miles between the two points unless a specialized service is employed. If you can afford a few days delay between pulling a batch of files from the field test unit and final analysis at the final destination, consider physically mailing the portable storage device.
Most data loggers and data acquisition equipment can be configured how often they can generate a new file while streaming data to a flash drive. Whenever possible, try to keep the recorded file size to 100 MB or less. This may sound small, but smaller files will dramatically reduce data transfer and conversion time later.
If the data is CAN/LIN bus log data, then this data will expand 10x in size or more when decoded with the bus log databases. If you have a 3.3 GB raw CAN/LIN bus log data file, and decode it, it easily becomes a 30 GB file!
It is always easier to analyze and concatenate small files, than it is to split large files. A DIAdem channel can have up to 2 billion (2^31) values so it has outstanding capability for working with large amounts of file data. But the performance will be significantly better if you keep those analyzed data file sizes smaller. Ideally, one or more files should together cover an event of interest. You don't want to later manually split large files because the test specimen was performing a different task, or operating in a different environment.
Sometimes a sensor was not able to record a value at a particular time when a sample is recorded, or the value exceeds the maximum measurable limit of the sensor. These are just some of the conditions where the value will be stored as NoValue when imported. NoValues are an important DIAdem feature that allow you to clearly differentiate when the value of the channel is not known at a particular time a sample was taken. It would be highly undesirable if the value were instead set as zero, the minimum allowed measured value, the sensor full scale value, or interpolated.
Some DIAdem functions work well with NoValues, and others require the user to resolve them to something else in order to perform the intended data manipulation or analyis. You have several options for the management of NoValues, including interpolation, using the last value, or deleting them. DIAdem includes tools to make it easy to manage NoValues in channel data.
If you configure your data logger / data acquisition equipment in one location while installed in the test specimen, and then ship it to the test destination another time zone away, and then ship the recorded data to another time zone for long term storage and data analysis, will you really know when the data was recorded? The best practice is to have a procedure where the DAQ / data logger equipment is set for Coordinated Universal Time (UTC), and the time zone at the time of the recording is also documented. If you cannot document the time zone, then an alternative is to record GPS location data, so that later the date/time relative to UTC can be determined, and the local time zone can be inferred from the GPS coordinates.
It is amazing what information can be determined about the past environmental conditions (weather) and topography at a particular location and date/time using online API's . DIAdem has Shuttle Radar Topography Mission (SRTM) data tools that fetch high-resolution digital topographic data covering nearly all of the earth at a resolution of 30 m. Online API services such as https://timezonedb.com can tell you the country code, time zone name, time zone offset, for a pair of GPS coordinates. DIAdem includes tools that calculate the linear distance and elevation range between two GPS coordinates.
GPS signals include a date/time reference relative to Coordinated Universal Time (UTC). The resolution of this value could be as good as 1.5 seconds, but drift the initial synchronization on 1 Jan 1980 has caused a considerable, but known difference. As of 2021, the GPS reported time is 18 seconds ahead of the actual UTC value. The correction to the GPS reported UTC time is calculated from the number of leap seconds that have occurred since 1 Jan 1980. Each leap second adds one second differrence to the GPS report UTC time, and the actual UTC time. When properly adjusted, the GPS signals for UTC date/time can provide a valuable reference identifying when data has been recorded.