Are read-replicas available in tech preview?

mallipeddi · November 1, 2022, 1:59am

I was looking at the docs and I couldn’t find how to configure read-replicas (i.e.) hot_standby as postgres calls it (they follow the primary and apply WAL). Is that not available currently?

Matthias · November 4, 2022, 10:29am

Read replicas are currently not supported on the hosted version of Neon.

We plan to support read replicas inside our platform sometime after we release branching, but we can’t put a date on when it will be available. If you’d ask me, we’ll probably support read replicas inside our platform starting sometime in the first half of next year. Please note that I’m not responsible for planning feature development, so this might turn out differently from my expectations.

External read replicas probably won’t be available for a long time because our current changes to PostgreSQL include WAL format changes incompatible with the WAL format that the standard version of PostgreSQL has. We are working on removing that incompatibility, but that will only take effect in or after the release of PostgreSQL 16, assuming we can get the changes into the upstream version of PostgreSQL.

idrawone · June 14, 2023, 6:28pm

I can setup a hot standby, if my procedure is correct, then I would like to report an issue I encountered while setting up a hot standby. Despite following the procedure outlined below, I experienced visibility problems where the hot standby failed to return the same number of tuples as the main/primary nodes, regardless of the duration I waited.

Below are the steps I followed to set up the hot standby and reproduce the visibility issues. The setup was performed on Ubuntu 22.04.2 LTS:

Clone neon from github

git clone --recursive https://github.com/neondatabase/neon.git
cd neon

Switch to the release branch and then build

git checkout remotes/origin/releases/2023-05-02 -b releases-2023-05-02
make -j`nproc` -s

Start the main compute node

./target/debug/neon_local init
./target/debug/neon_local start
./target/debug/neon_local tenant create --set-default
./target/debug/neon_local endpoint start main
./target/debug/neon_local endpoint list

Note: Make note of the tenant ID and timeline for the hot standby specification. Example: “Created an initial timeline ‘546eb99a5ca79532dfa839b83ce227ee’ at Lsn 0/169AF68 for tenant: 2c49810ed75ee7828823fc5f0d5b3adf”

Prepare the spec.json for a hot standby compute node

cp -pr ./docker-compose/compute_wrapper/var/db/postgres/specs/spec.json spec-hotsb.json
vim spec-hotsb.json
diff ./docker-compose/compute_wrapper/var/db/postgres/specs/spec.json spec-hotsb.json
43c43
<                 "value": "55433",
---
>                 "value": "55434",
103c103
<                 "value": "safekeeper1:5454,safekeeper2:5454,safekeeper3:5454",
---
>                 "value": "127.0.0.1:5454",
108c108
<                 "value": "TIMELINE_ID",
---
>                 "value": "546eb99a5ca79532dfa839b83ce227ee",
113c113
<                 "value": "TENANT_ID",
---
>                 "value": "2c49810ed75ee7828823fc5f0d5b3adf",
118c118,128
<                 "value": "host=pageserver port=6400",
---
>                 "value": "host=127.0.0.1 port=64000",
>                 "vartype": "string"
>             },
>             {
>                 "name": "hot_standby",
>                 "value": "on",
>                 "vartype": "bool"
>             },
>             {
>                 "name": "primary_conninfo",
>                 "value": "dbname=postgres user=cloud_admin host=127.0.0.1 port=55432 sslmode=disable gssencmode=disable target_session_attrs=any",

Start hot standby

mkdir -p .neon/endpoints/hotsb/pgdata
./target/debug/compute_ctl \
  --pgdata .neon/endpoints/hotsb/pgdata \
  -C "postgresql://cloud_admin@localhost:55434/postgres" \
  -b ./pg_install/v14/bin/postgres \
  -S spec-hotsb.json

Check if both main/primary and hot standby compute nodes are running

./target/debug/neon_local endpoint list
 ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS  
 hotsb     127.0.0.1:55434  546eb99a5ca79532dfa839b83ce227ee  main         0/169AFA0  running 
 main      127.0.0.1:55432  546eb99a5ca79532dfa839b83ce227ee  main         0/169AFA0  running

Run a simple test
On the main/primary compute node, execute the following commands:

psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "create table t(a int, b int); insert into t values(generate_series(1, 10), 1);"
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"

On the hot standby compute node, run:
psql -p55434 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"

Generally, at this point, both the hot standby and main/primary nodes should have the same number of tuples. However, the visibility issue arises when inserting more data. For example, if the following commands are executed and the results are checked on both sides, the discrepancy becomes evident:

On the main/primary node:

psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "insert into t values(generate_series(1, 10000), 1);"
INSERT 0 10000
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "insert into t values(generate_series(1, 1000000), 1);"
INSERT 0 1000000
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
  count  
---------
 1010010
(1 row)

On the hot standby node:

psql -p55434 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
  count  
---------
 2001189
(1 row)

In some cases, the relation may even appear to be gone. When this happens, I have to stop the hot standby and start it again to restore its functionality:

psql -p55434 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
ERROR:  could not open relation with OID 16384
LINE 1: select count(*) from t;
                             ^

Additionally, it appears that the hot standby retrieves WAL from the main/primary compute node, rather than from the safekeeper or pageserver. I’m unsure if this behavior is by design or if it’s a configuration issue on my part.

Thank you for your attention and assistance.
David

Matthias · June 14, 2023, 7:02pm

Have you tried creating a hot standby using the CLI instead, like this:

neon_local endpoint create \
    --hot-standby=true \
    --branch-name=my-branch-name \
    replica

This should configure the primary_conninfo correctly by pointing it to the safekeeper, and add several other parameters we use to configure both PostgreSQL and the Neon extension to work correctly as a hot standby.

If it’s still broken after that, could you share the log outputs of the primary and secondary instances with us? These should be in the postgres data directory of each endpoint.

Additionally, it appears that the hot standby retrieves WAL from the main/primary compute node, rather than from the safekeeper or pageserver. I’m unsure if this behavior is by design or if it’s a configuration issue on my part.

That is because you configured primary_conninfo to [...] host=127.0.0.1 port=55432 [...], which is the address+port of your primary instance, not the safekeeper.

idrawone · June 14, 2023, 9:28pm

Thank you for pointing me out the right way to start a hot standby replica which can replicate the WAL from safekeeper. Now, I changed my setup to below steps, but I still see the same visibility issue from hot standby as before.

Step-1: Initiate and start storage node, main/primary node and replica/hot-standby nodes with below commands in order.

./target/debug/neon_local init
./target/debug/neon_local start
./target/debug/neon_local tenant create --set-default
./target/debug/neon_local endpoint start main
./target/debug/neon_local endpoint list

./target/debug/neon_local endpoint create \
    --hot-standby=true \
    --branch-name=main \
    replica

./target/debug/neon_local endpoint start replica

Step-2: Make sure both main and replica are running

$ ./target/debug/neon_local endpoint list ENDPOINT  ADDRESS          TIMELINE                          BRANCH NAME  LSN        STATUS  
 main      127.0.0.1:55432  a95a178b03b714a20223dea0a3641ff3  main         0/169AFA0  running 
 replica   127.0.0.1:55433  a95a178b03b714a20223dea0a3641ff3  main         0/169AFA0  running

Step-3: Run visibility tests
Execute insert from main node:

psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c 'create table t(a int, b int);'
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "insert into t values(generate_series(1, 100000), 1);"
.... after a simple count query on replica, then add more data
psql -p55432 -h 127.0.0.1 -U cloud_admin postgres -c "insert into t values(generate_series(1, 100000), 1);"

Check tuples from replica node:

psql -p55433 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
 count 
-------
 99994

.... query result after the 2nd insert from main
psql -p55433 -h 127.0.0.1 -U cloud_admin postgres -c "select count(*) from t;"
 count  
--------
 199254
(1 row)

My understanding is that the hot standby replica should either return 0 tuples or exactly 100000 tuples. It should not return any number in the middle.

Here are some log from hot standby replica after added ‘log_min_messages = debug2’ to .neon/endpoints/replica/pgdata/postgresql.conf

tail -f .neon/endpoints/replica/pgdata/pg.log |grep -v “GetXLogReplayRecPtr|sendtime|sending write”

2023-06-14 21:10:30.011 GMT [2164554] CONTEXT:  WAL redo at 0/1DD4270 for Heap/INSERT: off 109 flags 0x01; blkref #0: rel 1663/13008/16384, blk 442
2023-06-14 21:10:30.059 GMT [2164554] DEBUG:  updated min recovery point to 0/1FAA6F8 on timeline 1
2023-06-14 21:10:30.059 GMT [2164554] CONTEXT:  writing block 0 of relation base/13008/16384_vm
  WAL redo at 0/1FAA6B0 for Heap/INSERT+INIT: off 1 flags 0x00; blkref #0: rel 1663/13008/16384, blk 560
2023-06-14 21:10:30.138 GMT [2164558] DEBUG:  creating and filling new WAL file
2023-06-14 21:10:30.144 GMT [2164558] DEBUG:  done creating and filling new WAL file
2023-06-14 21:10:30.179 GMT [2164554] DEBUG:  updated min recovery point to 0/21A6780 on timeline 1
2023-06-14 21:10:30.179 GMT [2164554] CONTEXT:  writing block 560 of relation base/13008/16384
  WAL redo at 0/21A6738 for Heap/INSERT+INIT: off 1 flags 0x00; blkref #0: rel 1663/13008/16384, blk 687
2023-06-14 21:10:30.220 GMT [2164554] DEBUG:  updated min recovery point to 0/23A07A8 on timeline 1
2023-06-14 21:10:30.220 GMT [2164554] CONTEXT:  writing block 687 of relation base/13008/16384
  WAL redo at 0/23A0760 for Heap/INSERT+INIT: off 1 flags 0x00; blkref #0: rel 1663/13008/16384, blk 814

2023-06-14 21:10:41.204 GMT [2164553] DEBUG:  forked new backend, pid=2164615 socket=8
2023-06-14 21:10:41.205 GMT [2164615] LOG:  [NEON_SMGR] libpagestore: connected to 'postgresql://no_user@127.0.0.1:64000'
2023-06-14 21:10:41.211 GMT [2164615] DEBUG:  updated min recovery point to 0/24BB318 on timeline 1
2023-06-14 21:10:41.211 GMT [2164615] CONTEXT:  writing block 814 of relation base/13008/16384
2023-06-14 21:10:41.435 GMT [2164553] DEBUG:  server process (PID 2164615) exited with exit code 0

It is easy to reproduce on my Ubuntu 22.04 environment, if any further information required please let me know.

Since I can not upload a txt file, I put the full log files with debug2 enabled to this link.
https://drive.google.com/drive/folders/1fvIklf9-JEPAYFcAb7X0KzQaZO1juuQg?usp=sharing

Thank you!

osdiab · June 29, 2023, 8:38am

Any updates on when read replicas might be present in Neon? we’re almost at the end of “the first half of next year” as of the original reply, so @Matthias would love to know if this is coming. I am particularly keen on multi-region read replicas! Thanks!

vadim2404 · July 7, 2023, 3:04pm

The fix was landed. Thanks for submitting such issue.

vadim2404 · July 7, 2023, 3:06pm

In-region replication will be available soon. We’re in the UI stage. For cross-regional replication, we didn’t take it into Q3. Most probably, it can be delivered in Q4-Q1.

idrawone · July 14, 2023, 5:08pm

Verified the fix in the same way as posted above, the issues can’t be reproduced any more. Thanks a lot for the fix.

vadim2404 · July 14, 2023, 5:27pm

Meanwhile, we released the support of read-replicas.