Known issues
This page documents significant known issues in Weaviate, their symptoms, and recommended resolutions. Use the table below to find issues that may be affecting you, and their status.
Quick reference
Issue | Affected Versions | Resolution | Fixed In |
---|---|---|---|
Empty collections panic | 1.28 - 1.31 | In Progress | - |
Database restoration blocked | 1.27.23-26, 1.28.14-15, 1.29.5-7, 1.30.3 | Fixed | 1.27.27, 1.28.16, 1.29.8, 1.30.4 |
RAFT snapshot compatibility on downgrade | 1.28.13+, 1.29.5+, 1.30.2+ (when downgrading to 1.27.25 or earlier) | Fixed | 1.27.26 |
RAFT bootstrap timeout | 1.25 - 1.28 | Workaround | - |
RAFT timeouts under heavy load | 1.25 and later | Configuration | - |
Invalid port 99999999 error | 1.25 - 1.28 | By Design | - |
Context deadline on tenant deletion | 1.26 - 1.28 | Fixed | 1.26.14, 1.27.11, 1.28.5 |
Memory pressure: Shard init failure | All | Configuration | - |
RAFT snapshot cannot be created | 1.25 - 1.28 | Workaround | - |
Failed to decode incoming command | 1.25 - 1.29 | Configuration | - |
If you can't find the issue you are experiencing, you can always open a new one.
Known issues in detail
Database restoration blocked
- Affected versions: 1.27.23-26, 1.28.14-15, 1.29.5-7, 1.30.3
- Resolution: Fixed in 1.27.27, 1.28.16, 1.29.8, 1.30.4
Symptoms
- Nodes fail to start and remain indefinitely in initialization
- Repeated log messages:
waiting for database to be restored
- May show schema update errors during RAFT command replay:
cmd_class":"Content_par_test","cmd_type":2,"cmd_type_name":"TYPE_UPDATE_CLASS","error":"updating schema: TYPE_UPDATE_CLASS: bad request :parse class update: property \"content\": property fields other than description cannot be updated through updating the class. Use the add property feature (e.g. \"POST /v1/schema/{className}/properties\")
Root cause
A regression introduced in schema catch-up handling caused invalid RAFT commands to block database initialization. The system would continuously retry these invalid commands, preventing marking the database as ready.
Resolution
Upgrade to the fixed version:
- If on 1.27.x → upgrade to 1.27.27 or higher
- If on 1.28.x → upgrade to 1.28.16 or higher
- If on 1.29.x → upgrade to 1.29.8 or higher
- If on 1.30.x → upgrade to 1.30.4 or higher
RAFT snapshot compatibility on downgrade
- Affected versions: 1.28.13+, 1.29.5+, 1.30.2+ (when downgrading to 1.27.25 or earlier)
- Resolution: Fixed in 1.27.26
Symptoms
After downgrading from 1.28+ to older 1.27.x versions:
- RAFT snapshots fail to load
- Cluster cannot reach Ready state
- Node initialization failures
Root cause
RAFT snapshot format changes introduced in 1.28.13, 1.29.5, and 1.30.2 are not backward compatible with 1.27 releases prior to 1.27.26.
Resolution
When downgrading from 1.28.13+, 1.29.5+, or 1.30.2+:
- Ensure you downgrade to version 1.27.26 or higher
- Do not downgrade to 1.27.25 or earlier
RAFT bootstrap timeout
- Affected versions: 1.25, 1.26, 1.27, 1.28
- Resolution: Workaround available
Symptoms
- Nodes stuck in crash loop with consistent restart intervals
- Logs show schema catch-up in progress but never complete:
Schema catching up: applying log entry: [X/Y]
- Startup probe failures in Kubernetes
Root cause
During cluster initialization or node recovery, applying accumulated RAFT log entries (especially with large schemas or many collections) may exceed the default 600-second bootstrap timeout.
Resolution
Increase bootstrap timeout and startup probe:
env:
- name: RAFT_BOOTSTRAP_TIMEOUT
value: "900" # 15 minutes
startupProbe:
failureThreshold: 90 # 90 * 10 = 900 seconds
periodSeconds: 10
httpGet:
path: /v1/.well-known/ready
port: 8080
Calculate appropriate timeout:
- Small clusters (< 10 collections): 600s default usually sufficient
- Medium clusters (10-100 collections): 900-1800s
- Large clusters (100+ collections): 1800-3600s
Note: Versions 1.27.4+, 1.26.11+, and 1.25.26+ include optimizations that reduce schema rebuild time during catch-up.
Prevention
- Monitor cluster scale and adjust timeouts proactively
- Use WCS tool:
wcs startup-count <failure-threshold> --apply; wcs sync
RAFT timeouts under heavy load
- Affected versions: 1.25 and later
- Resolution: Configuration available (default improved in 1.31+)
Symptoms
- Frequent leader elections and cluster instability
- Logs showing timeout errors:
heartbeat timeout reached, starting election
Election timeout reached, restarting election
memberlist: Failed fallback TCP ping: timeout 1s: read tcp [...]: i/o timeout
- High CPU usage, memory pressure, or goroutine counts in monitoring
- Performance degradation during normal operations
Root cause
Under heavy load or network latency, nodes cannot respond to RAFT heartbeats and memberlist pings within default timeout windows. This causes false failure detection, unnecessary leader elections, and cascading performance issues.
The issue is typically a symptom of underlying resource pressure rather than a RAFT problem itself.
Resolution
1. Adjust RAFT timeout multiplier:
# Production (default in 1.31+)
RAFT_TIMEOUTS_MULTIPLIER=5
# High-latency networks
RAFT_TIMEOUTS_MULTIPLIER=10
# Heavily loaded or unstable environments
RAFT_TIMEOUTS_MULTIPLIER=15
This multiplies all timeout values:
- Heartbeat timeout: 1s → 5s (with multiplier of 5)
- Election timeout: 1s → 5s
- Leader lease timeout: 0.5s → 2.5s
- Memberlist TCP timeout: 10s → 50s
2. Investigate root cause:
Check for underlying issues:
- Memory pressure or OOM events
- CPU saturation
- Network latency or packet loss
- Too many collections causing Go scheduler pressure
3. If too many collections:
Reduce Go scheduler load:
GOMAXPROCS=<value less than available CPUs>
Best practices
- Start with default multiplier (5) for most environments
- Increase gradually if seeing frequent elections
- Monitor cluster stability after changes
- Address underlying resource issues rather than only masking with higher timeouts
Invalid port 99999999
- Affected versions: 1.25, 1.26, 1.27, 1.28
- Resolution: By design (not a bug)
Symptoms
Error message in logs:
dial tcp: address 99999999: invalid port
Often accompanied by memberlist instability messages:
memberlist: Suspect weaviate-0 has failed, no acks received
memberlist: Marking weaviate-0 as failed, suspect timeout reached
Root cause
This is not a RAFT problem but a symptom of memberlist instability. The invalid port 99999999
is intentionally returned to prevent RAFT from communicating with nodes that are not part of the memberlist, which prevents cross-talk issues where RAFT might contact old IP addresses from previous cluster configurations.
The underlying cause is typically:
- Too many collections causing Go scheduler slowdown and network I/O delays
- Network connectivity issues preventing memberlist health checks
Resolution
Address underlying causes:
- If too many collections:
GOMAXPROCS=<value less than available CPUs>
- If network issues:
- Check connectivity between all cluster nodes
- Review network policies and firewall rules
- Verify DNS resolution
Context deadline on tenant deletion
- Affected versions: 1.26, 1.27, 1.28
- Resolution: Fixed in 1.26.14, 1.27.11, 1.28.5
Symptoms
Tenant deletion fails with timeout errors:
context deadline exceeded
session: fetching region failed: RequestCanceled: request context canceled
caused by: context deadline exceeded
Occurs only when tenant offloading module (offload-s3
) is enabled.
Root cause
When tenant offloading is enabled and AWS credentials are misconfigured, the deletion process attempts to delete cloud resources but times out waiting for AWS responses.
Resolution
Temporary workaround (if upgrade not immediately possible):
Option 1: Disable tenant offloading
# Remove or disable tenant offloading module configuration
Option 2: Correct AWS credentials
# Provide valid AWS credentials for tenant offloading
AWS_ACCESS_KEY_ID=<valid_key>
AWS_SECRET_ACCESS_KEY=<valid_secret>
Permanent fix: Upgrade to fixed version:
- 1.26.x → 1.26.14 or higher
- 1.27.x → 1.27.11 or higher
- 1.28.x → 1.28.5 or higher
Memory pressure: Shard init failure
- Affected versions: All versions
- Resolution: Configuration required
Symptoms
- Shard initialization failures during tenant activation
- Errors in logs:
memory pressure: cannot init shard: not enough memory mappings
broadcast: cannot reach enough replicas
- Tenant activation failures
Root cause
The system has reached the operating system limit for memory-mapped files (vm.max_map_count
). Each shard requires multiple memory mappings, and the default OS limit may be insufficient for large multi-tenant deployments.
Resolution
Increase the system memory mapping limit:
# Check current value
sysctl vm.max_map_count
# Increase to 3-4x current value
# Example: 2097152 → 8388608
sysctl -w vm.max_map_count=8388608
Make the change persistent:
# Add to /etc/sysctl.conf
echo "vm.max_map_count=8388608" >> /etc/sysctl.conf
Restart affected pods to apply the new configuration.
Empty collections panic
- Affected versions: 1.28, 1.29, 1.30, 1.31
- Resolution: In progress
Symptoms
Single-node clusters panic on startup with:
Recovered from panic: assignment to entry in nil map
[...stack trace...]
github.com/weaviate/weaviate/cluster/schema.(*schema).addClass
Root cause
RAFT snapshots with no collections (previously called classes) cause a nil map assignment during restoration due to JSON unmarshaler omitempty
behavior. This edge case occurs when snapshots are created before any collections are added.
Resolution
Option 1: Upgrade (when available) Upgrade to patched version containing the fix.
Option 2: Remove empty snapshot
Identify and remove the problematic snapshot:
# Navigate to RAFT directory
cd raft/snapshots/
# Find snapshot with empty classes
# Look for state.bin containing: {"node_id":"...","snapshot_id":"...","classes":{}}
# Remove the empty snapshot directory
rm -rf <snapshot-directory>
Example structure:
raft/
├── db_users/
├── raft.db
└── snapshots/
├── 4-4-1727456146194/ # Valid snapshot
└── 5-6-1728681332462/ # Empty snapshot - remove this
Prevention
This issue should not occur in normal operations. It typically happens only if a snapshot is created before any schema is defined.
RAFT snapshot cannot be created
- Affected versions: 1.25, 1.26, 1.27, 1.28
- Resolution: Workaround available
Symptoms
Node stuck during bootstrap with error messages indicating snapshot threshold reached but unable to create snapshot. This typically occurs only during initial cluster setup with rapid configuration changes.
Root cause
During bootstrap, many configuration changes in short succession increase RAFT log size and trigger snapshot threshold before the node has fully initialized. The cluster becomes stuck because:
- It cannot create a snapshot (requires bootstrap completion)
- It cannot apply new configurations (requires snapshot first)
This should be rare in normal operations.
Resolution
Temporarily increase snapshot thresholds:
RAFT_SNAPSHOT_INTERVAL=600 # seconds (default: 120)
RAFT_SNAPSHOT_THRESHOLD=24576 # entries (default: 8192)
This allows the node to apply all RAFT log entries before triggering snapshot creation.
After node reports healthy:
- Remove the custom configuration
- Restart the node to return to defaults
Prevention
- Avoid making many rapid configuration changes during initial cluster bootstrap
- Stage large schema deployments rather than applying all at once
Failed to decode incoming command
- Affected versions: 1.25, 1.26, 1.27, 1.28, 1.29
- Resolution: Configuration
Symptoms
Log entries showing:
failed to decode incoming command
error: unknown rpc type 71
remote-address: 10.0.104.114:42128
Note: 71
represents ASCII 'G' (GET), 80
represents ASCII 'P' (POST)
Root cause
HTTP requests being sent to RAFT's internal TCP endpoint (port 8300). This commonly occurs when Prometheus or other monitoring tools auto-discover and attempt to scrape all open ports, including internal RAFT ports.
Resolution
Configure monitoring to exclude RAFT ports:
Update Prometheus scrape configuration to skip internal cluster ports:
- Port 7000: Memberlist
- Port 7100-7103: Memberlist gossip
- Port 8300: RAFT
For Prometheus Operator:
additionalScrapeConfigs:
- job_name: "weaviate"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "(7000|7100|7101|7102|7103|8300)"
action: drop
This is informational only and does not impact cluster functionality.
Getting help
If you encounter an issue not listed here:
- Search GitHub Issues
- Ask in the Weaviate Community Forum
If you can't find an existing issue, please open a new one. Try to include the following information:
- Weaviate version
- Deployment environment (cloud, on-prem, Kubernetes, etc.)
- Relevant log excerpts
- Steps to reproduce
- Impact on your workload