When there are some problems in the RisingWave cluster, such as node restart and network abnormalities, the cluster will enter recovery processing to ensure data consistency.
However, some issues might also cause recovery failure, leading the cluster to be in a constant recovering state and unavailable. This topic aims to provide you with information regarding recovery failure in the RisingWave cluster.
When the cluster is waiting for new nodes to join during the recovery process, such as when there are these logs: waiting for new workers to join, elapsed xxxs.
Check if the available parallel units of the compute nodes in the cluster have decreased.
Check if the cluster has actively offline some compute nodes.
Check if reduced CPU resource allocation for some compute nodes and they have been offline for over 5 mins.
Solution:
Manually specify the parallelism of the compute nodes and restart, parameter: -parallelism, or temporarily launch some compute nodes to meet the requirements of parallelism.
After that to avoid some OOM issues, we’d better do some scaling: decrease the parallelism of all running streaming jobs or scale out from the temporary nodes.
Change back to the new parallelism.
If you specified the parallelism of the compute nodes and don’t want to specify the parallelism manually to create streaming jobs every time, you can:
Stop the compute nodes.
Unregister them in cluster: risectl meta unregister-workers --workers <worker_id or worker_host, ...>.
Remove the parameter -parallelism and start the compute nodes.
Offline temporary nodes.
Other: all these kinds of cases will be covered by the auto scaling feature that is on the way.
How to identify:When there is a network connection problem between the meta and compute nodes, as well as between the meta and etcd nodes, the recovery of the cluster will also continue to fail. You may find some logs like connection refused or error trying to connect: dns error: failed to lookup address information: Name or service not known.Solution:Please check the network configuration of the deployment environment and whether the operators of k8s are working properly.