Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago .I am writing an application that needs to store and analyze large amounts of electrical and temperature data. Basically I need to store large amounts of hourly electricity usage measurements for the past several years and for many years to come for tens of thousands of locations and then analyze the data in a not very complex manner. The information that I need to store (for now) is Location ID, Timestamp (Date and Time), Temperature and Electricity Usage. About the amount of the data that needs to be stored, this is an approximation, but something along those lines:
20 000+ locations, 720 records per month (hourly measurements, approximately 720 hours per month), 120 months (for 10 years back) and many years into the future. Simple calculations yield the following results:
20 000 locations x 720 records x 120 months (10 years back) = 1 728 000 000 records.
The data will be written monthly, but will be read by hundreds of users (at least) constantly, so the read speed is of significantly more importance.
I have no experience with NoSQL databases but from what I've gathered, they are the best solution to use here. I've read on the most popular NoSQL databases, but since they are quite different and also allow for very different table architecture, I have not been able to decide what is the best database to use.
My main choices were Cassandra and MongoDB, but I since I have very limited knowledge and no real experience when it comes to large data and NoSQL I am not very certain. I also read that PostreSQL also handles such amounts of data well.
My questions are the following:
2017. While not small, this is not particularly a LARGE amount of data for proper hardware. And I hate to tell you, but so far what you have there sounds like relational data.
Commented Oct 17, 2017 at 14:16Since I read that MySQL (which I am currently using) is not the best choice for 1 000 000 000+ records and NoSQL is usually the solution, I posted my questions here, since I lack experience with NoSQL databases or working with billions of records of data. I also wrote that I may be wrong, thus asking for advice. Thanks.
Commented Oct 17, 2017 at 14:20I've stored multi-TB tables with tens of billions of rows in MS SQL Server 2008-2014 by using a good key (epoch date), compression, partitioning, and ensuring my queries/indexes are partition aligned. I had to move to NoSQL (Hadoop) when I started getting petabytes of data to analyze and index differently. NoSQL should have other considerations and in this case, it doesn't seem to fit.
Commented Oct 17, 2017 at 16:42@AliRazeghi Hadoop has nothing to do with SQL or NoSQL -- it's just a storage engine. There are plenty of SQL interfaces backed by Hadoop out there.
Commented Oct 17, 2017 at 16:54 What are your constraints re:money to spend on software/licenses? Commented Oct 17, 2017 at 16:57This is exactly what I do every day, except instead of using the hourly data, I use the 5 minute data. I download about 200 million records everyday, so the amount you talk about here is not a problem. The 5 minute data is about 2 TB in size and I have weather data going back 50 years at an hourly level by location. So let me answer you questions based on my experience:
General tip: I store most of the data between two databases, the first is straight-up time series data and is normalized. My second database is very de-normalized and contains pre-aggregated data. As fast as my system is, I am not blind to the fact that users don't even want to wait 30 seconds for a report to load – even if I personally think 30 seconds to crunch 2 TB of data is extremely fast.
To elaborate on why I recommend storing the hour separate from the date, here are a few reasons why I do it that way:
As I said above, this is all based on my personal experience, and let me tell you, it has been a hard few years and a lot of redesigns to get to where I am now. Don't do what I did, learn from my mistakes and make sure you involve the end users of your system (or developers, report authors etc. ) when making decisions about your database.
23.1k 6 6 gold badges 59 59 silver badges 103 103 bronze badges answered Oct 17, 2017 at 14:28 Mr.Brownstone Mr.Brownstone 13.1k 4 4 gold badges 38 38 silver badges 54 54 bronze badgesI disagree with a lot of this. None of this is a real concern with a modern database as demonstrated with actual numbers here. If users of the data are too stupid to use the sql, then you need to create them an interface -- you don't munge the schema. Extracting the hour is a bad idea
Commented Oct 19, 2017 at 15:17@EvanCarrollQWERHJKL I'm sorry you feel that way and a few years ago I would have agreed with you. Your answer is a good answer but the queries contained within it don't come close to some of the ones I have to run on a daily basis and my answer is based upon the work I do everyday with the exact data set specified by the OP.
Commented Oct 19, 2017 at 15:31 What is your hardware like? Commented Oct 15, 2018 at 21:51@kennes physical, 16 Cores, 256GB RAM, 100GB OS Drive, 500GB local SSD with TempDB data on it, hybrid SAN with 8TB SSD Cache and 40TB of spindle disks capable of 100,000 iops/sec. Database implementation uses ColumnStore, compression, in-memory tables, partitioning and a tabular SSAS instance.
Commented Oct 16, 2018 at 19:51That is incredible hardware depending on how many users you serve. Since this is a pseudo-optimization response, I think including your technology is useful. I was in complete shock to hear you can crunch 2TB in 30 seconds -- that is incredibly fast. My own personal judgement aside, I think it would be useful for future people looking to optimize time-series data!
Commented Oct 21, 2018 at 15:15Test it for yourself. This isn't a problem on a 5 year old laptop with an ssd.
EXPLAIN ANALYZE CREATE TABLE electrothingy AS SELECT x::int AS id, (x::int % 20000)::int AS locid, -- fake location ids in the range of 1-20000 now() AS tsin, -- static timestmap 97.5::numeric(5,2) AS temp, -- static temp x::int AS usage -- usage the same as id not sure what we want here. FROM generate_series(1,1728000000) -- for 1.7 billion rows AS gs(x); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series gs (cost=0.00..15.00 rows=1000 width=4) (actual time=173119.796..750391.668 rows=1728000000 loops=1) Planning time: 0.099 ms Execution time: 1343954.446 ms (3 rows)
So it took 22min to create the table. Largely, because the table is a modest 97GB. Next we create the indexes,
CREATE INDEX ON electrothingy USING brin (tsin); CREATE INDEX ON electrothingy USING brin (id); VACUUM ANALYZE electrothingy;
It took a good long while to create the indexes too. Though because they're BRIN they're only 2-3 MB and they store easily in ram. Reading 96 GB isn't instantaneous, but it's not a real problem for my laptop at your workload.
Now we query it.
explain analyze SELECT max(temp) FROM electrothingy WHERE id BETWEEN 1000000 AND 1001000; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=5245.22..5245.23 rows=1 width=7) (actual time=42.317..42.317 rows=1 loops=1) -> Bitmap Heap Scan on electrothingy (cost=1282.17..5242.73 rows=993 width=7) (actual time=40.619..42.158 rows=1001 loops=1) Recheck Cond: ((id >= 1000000) AND (id Bitmap Index Scan on electrothingy_id_idx (cost=0.00..1281.93 rows=993 width=0) (actual time=39.769..39.769 rows=1280 loops=1) Index Cond: ((id >= 1000000) AND (id
Here we generate a table with different timestamps in order to satisify the request to index and search on a timestamp column, creation takes a bit longer because to_timestamp(int) is substantially more slow than now() (which is cached for the transaction)
EXPLAIN ANALYZE CREATE TABLE electrothingy AS SELECT x::int AS id, (x::int % 20000)::int AS locid, -- here we use to_timestamp rather than now(), we -- this calculates seconds since epoch using the gs(x) as the offset to_timestamp(x::int) AS tsin, 97.5::numeric(5,2) AS temp, x::int AS usage FROM generate_series(1,1728000000) AS gs(x); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Function Scan on generate_series gs (cost=0.00..17.50 rows=1000 width=4) (actual time=176163.107..5891430.759 rows=1728000000 loops=1) Planning time: 0.607 ms Execution time: 7147449.908 ms (3 rows)
Now we can run a query on a timestamp value instead,,
explain analyze SELECT count(*), min(temp), max(temp) FROM electrothingy WHERE tsin BETWEEN '1974-01-01' AND '1974-01-02'; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=296073.83..296073.84 rows=1 width=7) (actual time=83.243..83.243 rows=1 loops=1) -> Bitmap Heap Scan on electrothingy (cost=2460.86..295490.76 rows=77743 width=7) (actual time=41.466..59.442 rows=86401 loops=1) Recheck Cond: ((tsin >= '1974-01-01 00:00:00-06'::timestamp with time zone) AND (tsin Bitmap Index Scan on electrothingy_tsin_idx (cost=0.00..2441.43 rows=77743 width=0) (actual time=40.217..40.217 rows=7680 loops=1) Index Cond: ((tsin >= '1974-01-01 00:00:00-06'::timestamp with time zone) AND (tsin
count | min | max -------+-------+------- 86401 | 97.50 | 97.50 (1 row)
So in 83.321 ms we can aggregate 86,401 records in a table with 1.7 Billion rows. That should be reasonable.
Calculating the hour ending is pretty easy too, truncate the timestamps down and then simply add an hour.
SELECT date_trunc('hour', tsin) + '1 hour' AS tsin, count(*), min(temp), max(temp) FROM electrothingy WHERE tsin >= '1974-01-01' AND tsin < '1974-01-02' GROUP BY date_trunc('hour', tsin) ORDER BY 1; tsin | count | min | max ------------------------+-------+-------+------- 1974-01-01 01:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 02:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 03:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 04:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 05:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 06:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 07:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 08:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 09:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 10:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 11:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 12:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 13:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 14:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 15:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 16:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 17:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 18:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 19:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 20:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 21:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 22:00:00-06 | 3600 | 97.50 | 97.50 1974-01-01 23:00:00-06 | 3600 | 97.50 | 97.50 1974-01-02 00:00:00-06 | 3600 | 97.50 | 97.50 (24 rows) Time: 116.695 ms
It's important to note, that it's not using an index on the aggregation, though it could. If that's your typically query you probably want a BRIN on date_trunc('hour', tsin) therein lies a small problem in that date_trunc is not immutable so you'd have to first wrap it to make it so.
Another important point of information on PostgreSQL is that PG 10 bring partitioning DDL. So you can, for instance, easily create partitions for every year. Breaking down your modest database into minor ones that are tiny. In doing so, you should be able to use use and maintain btree indexes rather than BRIN which would be even faster.
CREATE TABLE electrothingy_y2016 PARTITION OF electrothingy FOR VALUES FROM ('2016-01-01') TO ('2017-01-01');
26.6k 24 24 gold badges 58 58 silver badges 74 74 bronze badges
answered Oct 17, 2017 at 16:37
Evan Carroll Evan Carroll
64.1k 48 48 gold badges 250 250 silver badges 491 491 bronze badges
It amazes me me that nobody here has mentioned benchmarking - that is until @EvanCarroll came along with his excellent contribution!
If I were you, I would spend some time (and yes, I know it's a precious commodity!) setting up systems, running what you think will be (get end-user input here!), say, your 10 most common queries.
My own thoughts:
NoSQL solutions can work very well for particular use cases but are frequently inflexible for ad-hoc queries. For an amusing take on NoSQL by Brian Aker - former chief architect of MySQL, see here!
I agree with @Mr.Brownstone that your data is eminently suited to a relational solution (and this opinion has been confirmed by Evan Carroll)!
If I were to commit to any expenditure, it would be to my disk technology! I would be spending any money I had at my disposal on NAS or SAN or maybe some SSD disks to hold my rarely written aggregate data!
First I would look at what I have available now. Run some tests and show the results to the decision makers. You already have a proxy in the form of EC's work! But, a quick test or two whipped together on your own hardware would be more convincing!
Then think about spending money! If you are going to spend money, look at hardware first rather than software. AFAIK, you can hire out disk technology for a trial period, or better yet, spin up a couple of proofs-of-concept on the cloud.
My own personal first port of call for a project like this would be PostgreSQL. That is not to say that I would rule out a proprietary solution, but the laws of physics and disks are the same for everyone! "Yae cannae beet the laws o' physics Jim" :-)
64.1k 48 48 gold badges 250 250 silver badges 491 491 bronze badges answered Oct 18, 2017 at 7:55 30.3k 9 9 gold badges 71 71 silver badges 84 84 bronze badgesIf you have not already, take a look at a time series DBMS, since it is optimized for storing and querying data where the primary focus is the date/time type. Typically time series databases are used for recording data in the minute/second/sub-second ranges, so I'm not sure if it is still appropriate for hourly increments. That said, this type of DBMS seems to be worth looking into. Currently InfluxDB seems to be the most established and widely used time series database.
answered Oct 17, 2017 at 15:31 FloorDivision FloorDivision 423 4 4 silver badges 10 10 bronze badges What is an example of a time series DBMS? Commented Oct 18, 2017 at 0:25 Have a look here. Commented Oct 18, 2017 at 8:11@Vérace link no longer works, any one else here might want to try this: ondataengineering.net/tech-categories/time-series-databases
Commented Jul 30, 2020 at 14:11Clearly this is not a NoSQL problem, but I would suggest that while an RDBMS solution would work, I think an OLAP approach will fit much better and given the very limited data ranges involved, I would strongly suggest investigating the use of a column based DB rather then row based one. Think about it this way, you may have 1.7 billion pieces of data, but you still only need 5 bits to index every possible value of hour or day of month.
I have experience with a similar problem domain where Sybase IQ (now SAP IQ) is used to store up to 300 million counters an hour of telecoms equipment performance management data, but I doubt if you have the budget for that sort of solution. In the open source arena, MariaDB ColumnStore is a very promising candidate, but I would recommend also investigating MonetDB.
Since query performance is a major driver for you, give consideration to how queries will be phrased. This is where OLAP and RDBMS show their greatest differences:- with OLAP you normalize for query performance, not to reduce repetition, reduce storage or even to enforce consistency. So in addition to the original timestamp (you did remember to capture its timezone I hope?) have a separate field for the UTC timestamp, other ones for the date and time, and yet more for the year, month, day, hour, minute and UTC offset. If you have additional information about locations, feel free to keep that in a separate location table that can be looked up on demand and feel free to keep the key to that table in your main record but keep the full location name in your main table as well, after all, all possible locations still only take 10 bits to index and every reference you do not have to follow to get the data to be reported is time saved on your query.
As a final suggestion, use separate tables for popular aggregated data and use batch jobs to populate them, that way you don't have to repeat the exercise for each and every report that uses an aggregated value and makes queries that compare current to historical or historical to historical much easier and much, much faster.