I was looking at the docs and I couldn’t find how to configure read-replicas (i.e.) hot_standby as postgres calls it (they follow the primary and apply WAL). Is that not available currently?
Read replicas are currently not supported on the hosted version of Neon.
We plan to support read replicas inside our platform sometime after we release branching, but we can’t put a date on when it will be available. If you’d ask me, we’ll probably support read replicas inside our platform starting sometime in the first half of next year. Please note that I’m not responsible for planning feature development, so this might turn out differently from my expectations.
External read replicas probably won’t be available for a long time because our current changes to PostgreSQL include WAL format changes incompatible with the WAL format that the standard version of PostgreSQL has. We are working on removing that incompatibility, but that will only take effect in or after the release of PostgreSQL 16, assuming we can get the changes into the upstream version of PostgreSQL.
I can setup a hot standby, if my procedure is correct, then I would like to report an issue I encountered while setting up a hot standby. Despite following the procedure outlined below, I experienced visibility problems where the hot standby failed to return the same number of tuples as the main/primary nodes, regardless of the duration I waited.
Below are the steps I followed to set up the hot standby and reproduce the visibility issues. The setup was performed on Ubuntu 22.04.2 LTS:
- Clone neon from github
git clone --recursive https://github.com/neondatabase/neon.git
cd neon
- Switch to the release branch and then build
git checkout remotes/origin/releases/2023-05-02 -b releases-2023-05-02
make -j`nproc` -s
- Start the main compute node
./target/debug/neon_local init
./target/debug/neon_local start
./target/debug/neon_local tenant create --set-default
./target/debug/neon_local endpoint start main
./target/debug/neon_local endpoint list
Note: Make note of the tenant ID and timeline for the hot standby specification. Example: “Created an initial timeline ‘546eb99a5ca79532dfa839b83ce227ee’ at Lsn 0/169AF68 for tenant: 2c49810ed75ee7828823fc5f0d5b3adf”
- Prepare the spec.json for a hot standby compute node
cp -pr ./docker-compose/compute_wrapper/var/db/postgres/specs/spec.json spec-hotsb.json
vim spec-hotsb.json
diff ./docker-compose/compute_wrapper/var/db/postgres/specs/spec.json spec-hotsb.json
43c43
< "value": "55433",
---
> "value": "55434",
103c103
< "value": "safekeeper1:5454,safekeeper2:5454,safekeeper3:5454",
---
> "value": "127.0.0.1:5454",
108c108
< "value": "TIMELINE_ID",
---
> "value": "546eb99a5ca79532dfa839b83ce227ee",
113c113
< "value": "TENANT_ID",
---
> "value": "2c49810ed75ee7828823fc5f0d5b3adf",
118c118,128
< "value": "host=pageserver port=6400",
---
> "value": "host=127.0.0.1 port=64000",
> "vartype": "string"
> },
> {
> "name": "hot_standby",
> "value": "on",
> "vartype": "bool"
> },
> {
> "name": "primary_conninfo",
> "value": "dbname=postgres user=cloud_admin host=127.0.0.1 port=55432 sslmode=disable gssencmode=disable target_session_attrs=any",
- Start hot standby
mkdir -p .neon/endpoints/hotsb/pgdata
./target/debug/compute_ctl \
--pgdata .neon/endpoints/hotsb/pgdata \
-C "postgresql://cloud_admin@localhost:55434/postgres" \
-b ./pg_install/v14/bin/postgres \
-S spec-hotsb.json
- Check if both main/primary and hot standby compute nodes are running
./target/debug/neon_local endpoint list
ENDPOINT ADDRESS TIMELINE BRANCH NAME LSN STATUS
hotsb 127.0.0.1:55434 546eb99a5ca79532dfa839b83ce227ee main 0/169AFA0 running
main 127.0.0.1:55432 546eb99a5ca79532dfa839b83ce227ee main 0/169AFA0 running
- Run a simple test
On the main/primary compute node, execute the following commands:
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "create table t(a int, b int); insert into t values(generate_series(1, 10), 1);"
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
On the hot standby compute node, run:
psql -p55434 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
Generally, at this point, both the hot standby and main/primary nodes should have the same number of tuples. However, the visibility issue arises when inserting more data. For example, if the following commands are executed and the results are checked on both sides, the discrepancy becomes evident:
On the main/primary node:
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "insert into t values(generate_series(1, 10000), 1);"
INSERT 0 10000
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "insert into t values(generate_series(1, 1000000), 1);"
INSERT 0 1000000
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
count
---------
1010010
(1 row)
On the hot standby node:
psql -p55434 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
count
---------
2001189
(1 row)
In some cases, the relation may even appear to be gone. When this happens, I have to stop the hot standby and start it again to restore its functionality:
psql -p55434 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
ERROR: could not open relation with OID 16384
LINE 1: select count(*) from t;
^
Additionally, it appears that the hot standby retrieves WAL from the main/primary compute node, rather than from the safekeeper or pageserver. I’m unsure if this behavior is by design or if it’s a configuration issue on my part.
Thank you for your attention and assistance.
David
Have you tried creating a hot standby using the CLI instead, like this:
neon_local endpoint create \
--hot-standby=true \
--branch-name=my-branch-name \
replica
This should configure the primary_conninfo correctly by pointing it to the safekeeper, and add several other parameters we use to configure both PostgreSQL and the Neon extension to work correctly as a hot standby.
If it’s still broken after that, could you share the log outputs of the primary and secondary instances with us? These should be in the postgres data directory of each endpoint.
Additionally, it appears that the hot standby retrieves WAL from the main/primary compute node, rather than from the safekeeper or pageserver. I’m unsure if this behavior is by design or if it’s a configuration issue on my part.
That is because you configured primary_conninfo
to [...] host=127.0.0.1 port=55432 [...]
, which is the address+port of your primary instance, not the safekeeper.
Thank you for pointing me out the right way to start a hot standby replica which can replicate the WAL from safekeeper. Now, I changed my setup to below steps, but I still see the same visibility issue from hot standby as before.
Step-1: Initiate and start storage node, main/primary node and replica/hot-standby nodes with below commands in order.
./target/debug/neon_local init
./target/debug/neon_local start
./target/debug/neon_local tenant create --set-default
./target/debug/neon_local endpoint start main
./target/debug/neon_local endpoint list
./target/debug/neon_local endpoint create \
--hot-standby=true \
--branch-name=main \
replica
./target/debug/neon_local endpoint start replica
Step-2: Make sure both main and replica are running
$ ./target/debug/neon_local endpoint list ENDPOINT ADDRESS TIMELINE BRANCH NAME LSN STATUS
main 127.0.0.1:55432 a95a178b03b714a20223dea0a3641ff3 main 0/169AFA0 running
replica 127.0.0.1:55433 a95a178b03b714a20223dea0a3641ff3 main 0/169AFA0 running
Step-3: Run visibility tests
Execute insert from main node:
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c 'create table t(a int, b int);'
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "insert into t values(generate_series(1, 100000), 1);"
.... after a simple count query on replica, then add more data
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "insert into t values(generate_series(1, 100000), 1);"
Check tuples from replica node:
psql -p55433 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
count
-------
99994
.... query result after the 2nd insert from main
psql -p55433 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
count
--------
199254
(1 row)
My understanding is that the hot standby replica should either return 0 tuples or exactly 100000 tuples. It should not return any number in the middle.
Here are some log from hot standby replica after added ‘log_min_messages = debug2’ to .neon/endpoints/replica/pgdata/postgresql.conf
tail -f .neon/endpoints/replica/pgdata/pg.log |grep -v “GetXLogReplayRecPtr|sendtime|sending write”
2023-06-14 21:10:30.011 GMT [2164554] CONTEXT: WAL redo at 0/1DD4270 for Heap/INSERT: off 109 flags 0x01; blkref #0: rel 1663/13008/16384, blk 442
2023-06-14 21:10:30.059 GMT [2164554] DEBUG: updated min recovery point to 0/1FAA6F8 on timeline 1
2023-06-14 21:10:30.059 GMT [2164554] CONTEXT: writing block 0 of relation base/13008/16384_vm
WAL redo at 0/1FAA6B0 for Heap/INSERT+INIT: off 1 flags 0x00; blkref #0: rel 1663/13008/16384, blk 560
2023-06-14 21:10:30.138 GMT [2164558] DEBUG: creating and filling new WAL file
2023-06-14 21:10:30.144 GMT [2164558] DEBUG: done creating and filling new WAL file
2023-06-14 21:10:30.179 GMT [2164554] DEBUG: updated min recovery point to 0/21A6780 on timeline 1
2023-06-14 21:10:30.179 GMT [2164554] CONTEXT: writing block 560 of relation base/13008/16384
WAL redo at 0/21A6738 for Heap/INSERT+INIT: off 1 flags 0x00; blkref #0: rel 1663/13008/16384, blk 687
2023-06-14 21:10:30.220 GMT [2164554] DEBUG: updated min recovery point to 0/23A07A8 on timeline 1
2023-06-14 21:10:30.220 GMT [2164554] CONTEXT: writing block 687 of relation base/13008/16384
WAL redo at 0/23A0760 for Heap/INSERT+INIT: off 1 flags 0x00; blkref #0: rel 1663/13008/16384, blk 814
2023-06-14 21:10:41.204 GMT [2164553] DEBUG: forked new backend, pid=2164615 socket=8
2023-06-14 21:10:41.205 GMT [2164615] LOG: [NEON_SMGR] libpagestore: connected to 'postgresql://no_user@127.0.0.1:64000'
2023-06-14 21:10:41.211 GMT [2164615] DEBUG: updated min recovery point to 0/24BB318 on timeline 1
2023-06-14 21:10:41.211 GMT [2164615] CONTEXT: writing block 814 of relation base/13008/16384
2023-06-14 21:10:41.435 GMT [2164553] DEBUG: server process (PID 2164615) exited with exit code 0
It is easy to reproduce on my Ubuntu 22.04 environment, if any further information required please let me know.
Since I can not upload a txt file, I put the full log files with debug2 enabled to this link.
https://drive.google.com/drive/folders/1fvIklf9-JEPAYFcAb7X0KzQaZO1juuQg?usp=sharing
Thank you!
Any updates on when read replicas might be present in Neon? we’re almost at the end of “the first half of next year” as of the original reply, so @Matthias would love to know if this is coming. I am particularly keen on multi-region read replicas! Thanks!
The fix was landed. Thanks for submitting such issue.
In-region replication will be available soon. We’re in the UI stage. For cross-regional replication, we didn’t take it into Q3. Most probably, it can be delivered in Q4-Q1.
Verified the fix in the same way as posted above, the issues can’t be reproduced any more. Thanks a lot for the fix.
Meanwhile, we released the support of read-replicas.