Best database and table design for billions of rows of data [closed]

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 6 years ago .

I am writing an application that needs to store and analyze large amounts of electrical and temperature data. Basically I need to store large amounts of hourly electricity usage measurements for the past several years and for many years to come for tens of thousands of locations and then analyze the data in a not very complex manner. The information that I need to store (for now) is Location ID, Timestamp (Date and Time), Temperature and Electricity Usage. About the amount of the data that needs to be stored, this is an approximation, but something along those lines:
20 000+ locations, 720 records per month (hourly measurements, approximately 720 hours per month), 120 months (for 10 years back) and many years into the future. Simple calculations yield the following results:

20 000 locations x 720 records x 120 months (10 years back) = 1 728 000 000 records.
  1. Retrieve the data for a certain date AND time period: all records for a certain Location ID between the dates 01.01.2013 and 01.01.2017 and between 07:00 and 13:00.
  2. Simple mathematical operations for a certain date AND time range, e.g. MIN, MAX and AVG temperature and electricity usage for a certain Location ID for 5 years between 07:00 and 13:00.

The data will be written monthly, but will be read by hundreds of users (at least) constantly, so the read speed is of significantly more importance.

I have no experience with NoSQL databases but from what I've gathered, they are the best solution to use here. I've read on the most popular NoSQL databases, but since they are quite different and also allow for very different table architecture, I have not been able to decide what is the best database to use.

My main choices were Cassandra and MongoDB, but I since I have very limited knowledge and no real experience when it comes to large data and NoSQL I am not very certain. I also read that PostreSQL also handles such amounts of data well.

My questions are the following:

  1. Should I use a NoSQL database for such large amounts of data. If not can I stick to MySQL?
  2. What database should I use?
  3. Should I keep the date and time in separate, indexed (if possible) columns to retrieve and process the data quickly for certain time and date periods, or can this be done by keeping the timestamp in a single column?
  4. Is a time series data modeling approach appropriate here, and if not could you give me pointers for a good table design?
11.7k 4 4 gold badges 24 24 silver badges 38 38 bronze badges asked Oct 17, 2017 at 14:13 1,313 3 3 gold badges 9 9 silver badges 5 5 bronze badges

2017. While not small, this is not particularly a LARGE amount of data for proper hardware. And I hate to tell you, but so far what you have there sounds like relational data.

Commented Oct 17, 2017 at 14:16

Since I read that MySQL (which I am currently using) is not the best choice for 1 000 000 000+ records and NoSQL is usually the solution, I posted my questions here, since I lack experience with NoSQL databases or working with billions of records of data. I also wrote that I may be wrong, thus asking for advice. Thanks.

Commented Oct 17, 2017 at 14:20

I've stored multi-TB tables with tens of billions of rows in MS SQL Server 2008-2014 by using a good key (epoch date), compression, partitioning, and ensuring my queries/indexes are partition aligned. I had to move to NoSQL (Hadoop) when I started getting petabytes of data to analyze and index differently. NoSQL should have other considerations and in this case, it doesn't seem to fit.

Commented Oct 17, 2017 at 16:42

@AliRazeghi Hadoop has nothing to do with SQL or NoSQL -- it's just a storage engine. There are plenty of SQL interfaces backed by Hadoop out there.

Commented Oct 17, 2017 at 16:54 What are your constraints re:money to spend on software/licenses? Commented Oct 17, 2017 at 16:57

5 Answers 5

This is exactly what I do every day, except instead of using the hourly data, I use the 5 minute data. I download about 200 million records everyday, so the amount you talk about here is not a problem. The 5 minute data is about 2 TB in size and I have weather data going back 50 years at an hourly level by location. So let me answer you questions based on my experience:

  1. Don't use NoSQL for this. The data is highly structured and fits a relational database perfectly.
  2. I personally use SQL Server 2016 and I have no problems applying computations across that volume of data. It was originally on a PostgreSQL instance when I started my job and it couldn't handle the volume of data as it was on a small AWS instance.
  3. I would highly recommend extracting the hour portion of the date and storing it separate from the date itself. Believe me, learn from my mistakes!
  4. I store the majority of data list-wise (DATE,TIME,DATAPOINT_ID,VALUE) but that is not how people will want to interpret the data. Be prepared for some horrendous queries against the data and vast amounts of pivoting. Don't be afraid to create a de-normalized table for result sets that are just too large to compute on the fly.

General tip: I store most of the data between two databases, the first is straight-up time series data and is normalized. My second database is very de-normalized and contains pre-aggregated data. As fast as my system is, I am not blind to the fact that users don't even want to wait 30 seconds for a report to load – even if I personally think 30 seconds to crunch 2 TB of data is extremely fast.

To elaborate on why I recommend storing the hour separate from the date, here are a few reasons why I do it that way:

  1. The way that the electrical data is presented is by Hour Ending – therefore, 01:00 is actually the average of the electrical power for the previous hour and 00:00 is Hour Ending 24. (This is important because you actually have to search for two dates to include the 24 hour value – the day you are looking for plus the first mark of the following day.) However, the weather data is actually presented in a forward manner (actual and forecast for the next hour). In my experience with this data, consumers wish to analyze the effect that the weather has on the power price/demand. If you were to use a straight-up date comparison, you would actually be comparing the average price for the previous hour versus the average temperature for the following hour, even though the time stamps are the same. Storing the hour separate from the date allows you to apply transformations to the time with less performance impact than you would see applying a calculation to a DATETIME column.
  2. Performance. I would say at least 90% of the reports that I generate are graphs, normally plotting the price against the hour either for a single date or for a range of dates. Having to split out the time from the date can bog down the speed of the query used to generate the report depending on the date range that you want to see. It is not uncommon for consumers to want to see a single date, Year-on-Year for the past 30 years (in fact for weather this is required to generate the 30 year normals) – this can be slow. Of course you can optimize your query and add indexes, and trust me I have some insane indexes that I would rather not have but it makes the system run fast.
  3. Productivity. I hate having to write the same piece of code more than once. I used to store the date and time in the same column, until I had to write the same query over and over again to extract the time portion. After a while I just got sick of having to do this and extracted it to its own column. The less code you have to write the less chance there is of an error in it. Also, having to write less code means that you can get your reports out faster, nobody wants to be waiting all day for reports.
  4. End users. Not all end users are power users (i.e. know how to write SQL). Having the data already stored in a format that they can bring into Excel (or other similar tool) with minimal effort will make you a hero in the office. If the users cannot access or manipulate the data easily, they will not use your system. Believe me, I designed the perfect system a couple of years ago and nobody used it because of this reason. Database design is not just about adhering to a predefined set of rules/guidelines, it is about making the system usable.

As I said above, this is all based on my personal experience, and let me tell you, it has been a hard few years and a lot of redesigns to get to where I am now. Don't do what I did, learn from my mistakes and make sure you involve the end users of your system (or developers, report authors etc. ) when making decisions about your database.

23.1k 6 6 gold badges 59 59 silver badges 103 103 bronze badges answered Oct 17, 2017 at 14:28 Mr.Brownstone Mr.Brownstone 13.1k 4 4 gold badges 38 38 silver badges 54 54 bronze badges

I disagree with a lot of this. None of this is a real concern with a modern database as demonstrated with actual numbers here. If users of the data are too stupid to use the sql, then you need to create them an interface -- you don't munge the schema. Extracting the hour is a bad idea

Commented Oct 19, 2017 at 15:17

@EvanCarrollQWERHJKL I'm sorry you feel that way and a few years ago I would have agreed with you. Your answer is a good answer but the queries contained within it don't come close to some of the ones I have to run on a daily basis and my answer is based upon the work I do everyday with the exact data set specified by the OP.

Commented Oct 19, 2017 at 15:31 What is your hardware like? Commented Oct 15, 2018 at 21:51

@kennes physical, 16 Cores, 256GB RAM, 100GB OS Drive, 500GB local SSD with TempDB data on it, hybrid SAN with 8TB SSD Cache and 40TB of spindle disks capable of 100,000 iops/sec. Database implementation uses ColumnStore, compression, in-memory tables, partitioning and a tabular SSAS instance.

Commented Oct 16, 2018 at 19:51

That is incredible hardware depending on how many users you serve. Since this is a pseudo-optimization response, I think including your technology is useful. I was in complete shock to hear you can crunch 2TB in 30 seconds -- that is incredibly fast. My own personal judgement aside, I think it would be useful for future people looking to optimize time-series data!

Commented Oct 21, 2018 at 15:15

PostgreSQL and BRIN indexes

Test it for yourself. This isn't a problem on a 5 year old laptop with an ssd.

EXPLAIN ANALYZE CREATE TABLE electrothingy AS SELECT x::int AS id, (x::int % 20000)::int AS locid, -- fake location ids in the range of 1-20000 now() AS tsin, -- static timestmap 97.5::numeric(5,2) AS temp, -- static temp x::int AS usage -- usage the same as id not sure what we want here. FROM generate_series(1,1728000000) -- for 1.7 billion rows AS gs(x); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series gs (cost=0.00..15.00 rows=1000 width=4) (actual time=173119.796..750391.668 rows=1728000000 loops=1) Planning time: 0.099 ms Execution time: 1343954.446 ms (3 rows) 

So it took 22min to create the table. Largely, because the table is a modest 97GB. Next we create the indexes,

CREATE INDEX ON electrothingy USING brin (tsin); CREATE INDEX ON electrothingy USING brin (id); VACUUM ANALYZE electrothingy; 

It took a good long while to create the indexes too. Though because they're BRIN they're only 2-3 MB and they store easily in ram. Reading 96 GB isn't instantaneous, but it's not a real problem for my laptop at your workload.

Now we query it.

explain analyze SELECT max(temp) FROM electrothingy WHERE id BETWEEN 1000000 AND 1001000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=5245.22..5245.23 rows=1 width=7) (actual time=42.317..42.317 rows=1 loops=1) -> Bitmap Heap Scan on electrothingy (cost=1282.17..5242.73 rows=993 width=7) (actual time=40.619..42.158 rows=1001 loops=1) Recheck Cond: ((id >= 1000000) AND (id Bitmap Index Scan on electrothingy_id_idx (cost=0.00..1281.93 rows=993 width=0) (actual time=39.769..39.769 rows=1280 loops=1) Index Cond: ((id >= 1000000) AND (id  

Update with timestamps

Here we generate a table with different timestamps in order to satisify the request to index and search on a timestamp column, creation takes a bit longer because to_timestamp(int) is substantially more slow than now() (which is cached for the transaction)

EXPLAIN ANALYZE CREATE TABLE electrothingy AS SELECT x::int AS id, (x::int % 20000)::int AS locid, -- here we use to_timestamp rather than now(), we -- this calculates seconds since epoch using the gs(x) as the offset to_timestamp(x::int) AS tsin, 97.5::numeric(5,2) AS temp, x::int AS usage FROM generate_series(1,1728000000) AS gs(x); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series gs (cost=0.00..17.50 rows=1000 width=4) (actual time=176163.107..5891430.759 rows=1728000000 loops=1) Planning time: 0.607 ms Execution time: 7147449.908 ms (3 rows) 

Now we can run a query on a timestamp value instead,,

explain analyze SELECT count(*), min(temp), max(temp) FROM electrothingy WHERE tsin BETWEEN '1974-01-01' AND '1974-01-02'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=296073.83..296073.84 rows=1 width=7) (actual time=83.243..83.243 rows=1 loops=1) -> Bitmap Heap Scan on electrothingy (cost=2460.86..295490.76 rows=77743 width=7) (actual time=41.466..59.442 rows=86401 loops=1) Recheck Cond: ((tsin >= '1974-01-01 00:00:00-06'::timestamp with time zone) AND (tsin Bitmap Index Scan on electrothingy_tsin_idx (cost=0.00..2441.43 rows=77743 width=0) (actual time=40.217..40.217 rows=7680 loops=1) Index Cond: ((tsin >= '1974-01-01 00:00:00-06'::timestamp with time zone) AND (tsin  
 count | min | max -------+-------+------- 86401 | 97.50 | 97.50 (1 row) 

So in 83.321 ms we can aggregate 86,401 records in a table with 1.7 Billion rows. That should be reasonable.

Hour ending

Calculating the hour ending is pretty easy too, truncate the timestamps down and then simply add an hour.

SELECT date_trunc('hour', tsin) + '1 hour' AS tsin, count(*), min(temp), max(temp) FROM electrothingy WHERE tsin >= '1974-01-01' AND tsin < '1974-01-02' GROUP BY date_trunc('hour', tsin) ORDER BY 1; tsin | count | min | max ------------------------+-------+-------+------- 1974-01-01 01:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 02:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 03:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 04:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 05:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 06:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 07:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 08:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 09:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 10:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 11:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 12:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 13:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 14:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 15:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 16:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 17:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 18:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 19:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 20:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 21:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 22:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 23:00:00-06 | 3600 | 97.50 | 97.50 1974-01-02 00:00:00-06 | 3600 | 97.50 | 97.50 (24 rows) Time: 116.695 ms 

It's important to note, that it's not using an index on the aggregation, though it could. If that's your typically query you probably want a BRIN on date_trunc('hour', tsin) therein lies a small problem in that date_trunc is not immutable so you'd have to first wrap it to make it so.

Partitioning

Another important point of information on PostgreSQL is that PG 10 bring partitioning DDL. So you can, for instance, easily create partitions for every year. Breaking down your modest database into minor ones that are tiny. In doing so, you should be able to use use and maintain btree indexes rather than BRIN which would be even faster.

CREATE TABLE electrothingy_y2016 PARTITION OF electrothingy FOR VALUES FROM ('2016-01-01') TO ('2017-01-01'); 
26.6k 24 24 gold badges 58 58 silver badges 74 74 bronze badges answered Oct 17, 2017 at 16:37 Evan Carroll Evan Carroll 64.1k 48 48 gold badges 250 250 silver badges 491 491 bronze badges

It amazes me me that nobody here has mentioned benchmarking - that is until @EvanCarroll came along with his excellent contribution!

If I were you, I would spend some time (and yes, I know it's a precious commodity!) setting up systems, running what you think will be (get end-user input here!), say, your 10 most common queries.

My own thoughts:

NoSQL solutions can work very well for particular use cases but are frequently inflexible for ad-hoc queries. For an amusing take on NoSQL by Brian Aker - former chief architect of MySQL, see here!

I agree with @Mr.Brownstone that your data is eminently suited to a relational solution (and this opinion has been confirmed by Evan Carroll)!

If I were to commit to any expenditure, it would be to my disk technology! I would be spending any money I had at my disposal on NAS or SAN or maybe some SSD disks to hold my rarely written aggregate data!

First I would look at what I have available now. Run some tests and show the results to the decision makers. You already have a proxy in the form of EC's work! But, a quick test or two whipped together on your own hardware would be more convincing!

Then think about spending money! If you are going to spend money, look at hardware first rather than software. AFAIK, you can hire out disk technology for a trial period, or better yet, spin up a couple of proofs-of-concept on the cloud.

My own personal first port of call for a project like this would be PostgreSQL. That is not to say that I would rule out a proprietary solution, but the laws of physics and disks are the same for everyone! "Yae cannae beet the laws o' physics Jim" :-)

64.1k 48 48 gold badges 250 250 silver badges 491 491 bronze badges answered Oct 18, 2017 at 7:55 30.3k 9 9 gold badges 71 71 silver badges 84 84 bronze badges

If you have not already, take a look at a time series DBMS, since it is optimized for storing and querying data where the primary focus is the date/time type. Typically time series databases are used for recording data in the minute/second/sub-second ranges, so I'm not sure if it is still appropriate for hourly increments. That said, this type of DBMS seems to be worth looking into. Currently InfluxDB seems to be the most established and widely used time series database.

answered Oct 17, 2017 at 15:31 FloorDivision FloorDivision 423 4 4 silver badges 10 10 bronze badges What is an example of a time series DBMS? Commented Oct 18, 2017 at 0:25 Have a look here. Commented Oct 18, 2017 at 8:11

@Vérace link no longer works, any one else here might want to try this: ondataengineering.net/tech-categories/time-series-databases

Commented Jul 30, 2020 at 14:11

Clearly this is not a NoSQL problem, but I would suggest that while an RDBMS solution would work, I think an OLAP approach will fit much better and given the very limited data ranges involved, I would strongly suggest investigating the use of a column based DB rather then row based one. Think about it this way, you may have 1.7 billion pieces of data, but you still only need 5 bits to index every possible value of hour or day of month.

I have experience with a similar problem domain where Sybase IQ (now SAP IQ) is used to store up to 300 million counters an hour of telecoms equipment performance management data, but I doubt if you have the budget for that sort of solution. In the open source arena, MariaDB ColumnStore is a very promising candidate, but I would recommend also investigating MonetDB.

Since query performance is a major driver for you, give consideration to how queries will be phrased. This is where OLAP and RDBMS show their greatest differences:- with OLAP you normalize for query performance, not to reduce repetition, reduce storage or even to enforce consistency. So in addition to the original timestamp (you did remember to capture its timezone I hope?) have a separate field for the UTC timestamp, other ones for the date and time, and yet more for the year, month, day, hour, minute and UTC offset. If you have additional information about locations, feel free to keep that in a separate location table that can be looked up on demand and feel free to keep the key to that table in your main record but keep the full location name in your main table as well, after all, all possible locations still only take 10 bits to index and every reference you do not have to follow to get the data to be reported is time saved on your query.

As a final suggestion, use separate tables for popular aggregated data and use batch jobs to populate them, that way you don't have to repeat the exercise for each and every report that uses an aggregated value and makes queries that compare current to historical or historical to historical much easier and much, much faster.