Hello,
Thanks for a very great project. I have a question about internal storage implementation. As I understand you haven’t removed structures like B+ tree, you moved pages to something like a KV-storage. So how do you deal with problem that when you search key, you should read sequentially O(H) pages, where H is height of B+ tree. So you get O(H) RTT because you don’t know which page to ask next before asking previous. Of course when you have sequential scan you can get O(H) and then just you links to next page in leafs layer. But what about latency of simple search? Does it grow as a described above or you have some optimizations? (As I understand it may be a real problem for big databases, imagine you have height 10, 10 RTT may make latency for simple queries more than 100-200ms, which is not very good as I think).
Best wishes, Alexey
Hey!
Right, you described situation correctly.
- With sequential scans we are fetching pages ahead of time (e.g. Epic: performance parity with vanilla on medium-sized databases (100-300GB) · Issue #2221 · neondatabase/neon · GitHub).
- We didn’t anything special for indices yet, but we could do a similar look ahead for index range scans (at least for page pointers within one index page).
- As for the point queries we can’t do much. Tree root has high chances of being cached, so that should help to some extent. Also, query latency within one AZ is usually in order of 0.3-1.5ms, with that your example with 10RTT is more in 3-15ms of added latency. Another thing here is that at least for bigint primary key b-tree has branching factor of ~350, so the typical depth is 2-4.
So at least on paper we can handle all that cases without adding significant latency. As for practice now we are only started working on the first item in this list.
Hello,
Thank you for your response. When we have sequential scan there will not be many problems. Also when all our instances are in one AZ, it is not a problem too. But if you want to add more AZ (also they may be in one region), typical latency would be something like 5-10ms. And let’s imagine the situation you want to build a complex index, so your key size could be much more than one bigint, and branching factor may be something like ~35-50. And this may make height to be something near 10, so adding 50-100ms of latency to do a simple lookup may really be a big problem.
Do you have plans to make neon working fast with multiple AZ ?
Which solutions have you checked? (I really don’t know a lot of solutions, so wanted to know if you have discovered something new. One of solutions to make split tree into subtrees of 2-3 height (to keep them not too big, something like 10-20gb), so tree becomes a tree of tree. Each subtree could be a replication unit, so they will be hosted on same replica, so with one RTT you can go down 2-3 layers and reduce RTT. Will you implement something like this in the future)
Best wishes, Alexey Kuldoshin
Also when all our instances are in one AZ, it is not a problem too.
That is the plan so far. We need several AZ for high availability and we deploy safekeepers across AZ’s, to be sure that fresh data is safe. But there is no specific reason to not to collocate pageserver and compute in the same AZ. Besides lower latency it also reduces cost for cross-AZ traffic. That way all read requests stay within AZ and write requests are replicated across AZ’s.
To further expand on this:
We think we can eventually get IO performance comparable to AWS offerings: RDS stores the data on networked (EBS) storage, and Aurora requests the stored data across the network, too. As all these three solutions utilize the same underlying network, it should be possible for us to get a network latency figure that is comparable to the managed offerings by AWS. We are probably never going to be competitive in random IO performance with a server that stores all data on direct-attached SSDs (such as self-managed instances on EC2), but that is not the immediate target audience.
The only significant difference between Neon’s storage and AWS RDS or Aurora that limits our theoretical performance is that we do not necessarily have the full block immediately available, so we take more time processing the block at our storage servers, which adds some latency. However, internal experiments show that this latency is not too significant, and latency-hiding techniques (like prefetching and caching) have proven quite effective in improving the overall performance of some workloads.
If talking about Aurora, you cannot stay in one AZ for read requests, as @Stas_Kelvich because to get read quorum you should ask one least 2 AZ. Also if may make write quorum to ALL servers, but you will get a problem that if one AZ is unavailable you cannot process write requests. How do you deal with this problem?
I was talking about Neon. Some details are here Architecture decisions in Neon - Neon. Write requires majority of acks, read can be done from one location.