Skip to main content
Go to documentation:
⌘U
Weaviate Database

Develop AI applications using Weaviate's APIs and tools

Deploy

Deploy, configure, and maintain Weaviate Database

Weaviate Agents

Build and deploy intelligent agents with Weaviate

Weaviate Cloud

Manage and scale Weaviate in the cloud

Additional resources

Integrations
Contributor guide
Events & Workshops
Weaviate Academy

Need help?

Weaviate LogoAsk AI Assistant⌘K
Community Forum
Last Updated: October 2025

Known issues

This page documents significant known issues in Weaviate, their symptoms, and recommended resolutions. Use the table below to find issues that may be affecting you, and their status.

Quick reference

IssueAffected VersionsResolutionFixed In
Empty collections panic1.28 - 1.31In Progress-
Database restoration blocked1.27.23-26, 1.28.14-15, 1.29.5-7, 1.30.3Fixed1.27.27, 1.28.16, 1.29.8, 1.30.4
RAFT snapshot compatibility on downgrade1.28.13+, 1.29.5+, 1.30.2+ (when downgrading to 1.27.25 or earlier)Fixed1.27.26
RAFT bootstrap timeout1.25 - 1.28Workaround-
RAFT timeouts under heavy load1.25 and laterConfiguration-
Invalid port 99999999 error1.25 - 1.28By Design-
Context deadline on tenant deletion1.26 - 1.28Fixed1.26.14, 1.27.11, 1.28.5
Memory pressure: Shard init failureAllConfiguration-
RAFT snapshot cannot be created1.25 - 1.28Workaround-
Failed to decode incoming command1.25 - 1.29Configuration-

Don't see your issue in the table?

If you can't find the issue you are experiencing, you can always open a new one.

Known issues in detail

Database restoration blocked

Impact summary
  • Affected versions: 1.27.23-26, 1.28.14-15, 1.29.5-7, 1.30.3
  • Resolution: Fixed in 1.27.27, 1.28.16, 1.29.8, 1.30.4

Symptoms

  • Nodes fail to start and remain indefinitely in initialization
  • Repeated log messages: waiting for database to be restored
  • May show schema update errors during RAFT command replay:
  cmd_class":"Content_par_test","cmd_type":2,"cmd_type_name":"TYPE_UPDATE_CLASS","error":"updating schema: TYPE_UPDATE_CLASS: bad request :parse class update: property \"content\": property fields other than description cannot be updated through updating the class. Use the add property feature (e.g. \"POST /v1/schema/{className}/properties\")

Root cause

A regression introduced in schema catch-up handling caused invalid RAFT commands to block database initialization. The system would continuously retry these invalid commands, preventing marking the database as ready.

Resolution

Upgrade to the fixed version:

  • If on 1.27.x → upgrade to 1.27.27 or higher
  • If on 1.28.x → upgrade to 1.28.16 or higher
  • If on 1.29.x → upgrade to 1.29.8 or higher
  • If on 1.30.x → upgrade to 1.30.4 or higher

RAFT snapshot compatibility on downgrade

Impact summary
  • Affected versions: 1.28.13+, 1.29.5+, 1.30.2+ (when downgrading to 1.27.25 or earlier)
  • Resolution: Fixed in 1.27.26

Symptoms

After downgrading from 1.28+ to older 1.27.x versions:

  • RAFT snapshots fail to load
  • Cluster cannot reach Ready state
  • Node initialization failures

Root cause

RAFT snapshot format changes introduced in 1.28.13, 1.29.5, and 1.30.2 are not backward compatible with 1.27 releases prior to 1.27.26.

Resolution

When downgrading from 1.28.13+, 1.29.5+, or 1.30.2+:

  • Ensure you downgrade to version 1.27.26 or higher
  • Do not downgrade to 1.27.25 or earlier

RAFT bootstrap timeout

Impact summary
  • Affected versions: 1.25, 1.26, 1.27, 1.28
  • Resolution: Workaround available

Symptoms

  • Nodes stuck in crash loop with consistent restart intervals
  • Logs show schema catch-up in progress but never complete:
  Schema catching up: applying log entry: [X/Y]
  • Startup probe failures in Kubernetes

Root cause

During cluster initialization or node recovery, applying accumulated RAFT log entries (especially with large schemas or many collections) may exceed the default 600-second bootstrap timeout.

Resolution

Increase bootstrap timeout and startup probe:

env:
- name: RAFT_BOOTSTRAP_TIMEOUT
value: "900" # 15 minutes

startupProbe:
failureThreshold: 90 # 90 * 10 = 900 seconds
periodSeconds: 10
httpGet:
path: /v1/.well-known/ready
port: 8080

Calculate appropriate timeout:

  • Small clusters (< 10 collections): 600s default usually sufficient
  • Medium clusters (10-100 collections): 900-1800s
  • Large clusters (100+ collections): 1800-3600s

Note: Versions 1.27.4+, 1.26.11+, and 1.25.26+ include optimizations that reduce schema rebuild time during catch-up.

Prevention

  • Monitor cluster scale and adjust timeouts proactively
  • Use WCS tool: wcs startup-count <failure-threshold> --apply; wcs sync

RAFT timeouts under heavy load

Impact summary
  • Affected versions: 1.25 and later
  • Resolution: Configuration available (default improved in 1.31+)

Symptoms

  • Frequent leader elections and cluster instability
  • Logs showing timeout errors:
  heartbeat timeout reached, starting election
Election timeout reached, restarting election
memberlist: Failed fallback TCP ping: timeout 1s: read tcp [...]: i/o timeout
  • High CPU usage, memory pressure, or goroutine counts in monitoring
  • Performance degradation during normal operations

Root cause

Under heavy load or network latency, nodes cannot respond to RAFT heartbeats and memberlist pings within default timeout windows. This causes false failure detection, unnecessary leader elections, and cascading performance issues.

The issue is typically a symptom of underlying resource pressure rather than a RAFT problem itself.

Resolution

1. Adjust RAFT timeout multiplier:

# Production (default in 1.31+)
RAFT_TIMEOUTS_MULTIPLIER=5

# High-latency networks
RAFT_TIMEOUTS_MULTIPLIER=10

# Heavily loaded or unstable environments
RAFT_TIMEOUTS_MULTIPLIER=15

This multiplies all timeout values:

  • Heartbeat timeout: 1s → 5s (with multiplier of 5)
  • Election timeout: 1s → 5s
  • Leader lease timeout: 0.5s → 2.5s
  • Memberlist TCP timeout: 10s → 50s

2. Investigate root cause:

Check for underlying issues:

  • Memory pressure or OOM events
  • CPU saturation
  • Network latency or packet loss
  • Too many collections causing Go scheduler pressure

3. If too many collections:

Reduce Go scheduler load:

GOMAXPROCS=<value less than available CPUs>

Best practices

  • Start with default multiplier (5) for most environments
  • Increase gradually if seeing frequent elections
  • Monitor cluster stability after changes
  • Address underlying resource issues rather than only masking with higher timeouts

Invalid port 99999999

Impact summary
  • Affected versions: 1.25, 1.26, 1.27, 1.28
  • Resolution: By design (not a bug)

Symptoms

Error message in logs:

dial tcp: address 99999999: invalid port

Often accompanied by memberlist instability messages:

memberlist: Suspect weaviate-0 has failed, no acks received
memberlist: Marking weaviate-0 as failed, suspect timeout reached

Root cause

This is not a RAFT problem but a symptom of memberlist instability. The invalid port 99999999 is intentionally returned to prevent RAFT from communicating with nodes that are not part of the memberlist, which prevents cross-talk issues where RAFT might contact old IP addresses from previous cluster configurations.

The underlying cause is typically:

  1. Too many collections causing Go scheduler slowdown and network I/O delays
  2. Network connectivity issues preventing memberlist health checks

Resolution

Address underlying causes:

  1. If too many collections:
   GOMAXPROCS=<value less than available CPUs>
  1. If network issues:
    • Check connectivity between all cluster nodes
    • Review network policies and firewall rules
    • Verify DNS resolution

Context deadline on tenant deletion

Impact summary
  • Affected versions: 1.26, 1.27, 1.28
  • Resolution: Fixed in 1.26.14, 1.27.11, 1.28.5

Symptoms

Tenant deletion fails with timeout errors:

context deadline exceeded
session: fetching region failed: RequestCanceled: request context canceled
caused by: context deadline exceeded

Occurs only when tenant offloading module (offload-s3) is enabled.

Root cause

When tenant offloading is enabled and AWS credentials are misconfigured, the deletion process attempts to delete cloud resources but times out waiting for AWS responses.

Resolution

Temporary workaround (if upgrade not immediately possible):

Option 1: Disable tenant offloading

# Remove or disable tenant offloading module configuration

Option 2: Correct AWS credentials

# Provide valid AWS credentials for tenant offloading
AWS_ACCESS_KEY_ID=<valid_key>
AWS_SECRET_ACCESS_KEY=<valid_secret>

Permanent fix: Upgrade to fixed version:

  • 1.26.x → 1.26.14 or higher
  • 1.27.x → 1.27.11 or higher
  • 1.28.x → 1.28.5 or higher

Memory pressure: Shard init failure

Impact summary
  • Affected versions: All versions
  • Resolution: Configuration required

Symptoms

  • Shard initialization failures during tenant activation
  • Errors in logs:
  memory pressure: cannot init shard: not enough memory mappings
broadcast: cannot reach enough replicas
  • Tenant activation failures

Root cause

The system has reached the operating system limit for memory-mapped files (vm.max_map_count). Each shard requires multiple memory mappings, and the default OS limit may be insufficient for large multi-tenant deployments.

Resolution

Increase the system memory mapping limit:

# Check current value
sysctl vm.max_map_count

# Increase to 3-4x current value
# Example: 2097152 → 8388608
sysctl -w vm.max_map_count=8388608

Make the change persistent:

# Add to /etc/sysctl.conf
echo "vm.max_map_count=8388608" >> /etc/sysctl.conf

Restart affected pods to apply the new configuration.


Empty collections panic

Impact summary
  • Affected versions: 1.28, 1.29, 1.30, 1.31
  • Resolution: In progress

Symptoms

Single-node clusters panic on startup with:

Recovered from panic: assignment to entry in nil map
[...stack trace...]
github.com/weaviate/weaviate/cluster/schema.(*schema).addClass

Root cause

RAFT snapshots with no collections (previously called classes) cause a nil map assignment during restoration due to JSON unmarshaler omitempty behavior. This edge case occurs when snapshots are created before any collections are added.

Resolution

Option 1: Upgrade (when available) Upgrade to patched version containing the fix.

Option 2: Remove empty snapshot

Identify and remove the problematic snapshot:

# Navigate to RAFT directory
cd raft/snapshots/

# Find snapshot with empty classes
# Look for state.bin containing: {"node_id":"...","snapshot_id":"...","classes":{}}

# Remove the empty snapshot directory
rm -rf <snapshot-directory>

Example structure:

raft/
├── db_users/
├── raft.db
└── snapshots/
├── 4-4-1727456146194/ # Valid snapshot
└── 5-6-1728681332462/ # Empty snapshot - remove this

Prevention

This issue should not occur in normal operations. It typically happens only if a snapshot is created before any schema is defined.


RAFT snapshot cannot be created

Impact summary
  • Affected versions: 1.25, 1.26, 1.27, 1.28
  • Resolution: Workaround available

Symptoms

Node stuck during bootstrap with error messages indicating snapshot threshold reached but unable to create snapshot. This typically occurs only during initial cluster setup with rapid configuration changes.

Root cause

During bootstrap, many configuration changes in short succession increase RAFT log size and trigger snapshot threshold before the node has fully initialized. The cluster becomes stuck because:

  1. It cannot create a snapshot (requires bootstrap completion)
  2. It cannot apply new configurations (requires snapshot first)

This should be rare in normal operations.

Resolution

Temporarily increase snapshot thresholds:

RAFT_SNAPSHOT_INTERVAL=600  # seconds (default: 120)
RAFT_SNAPSHOT_THRESHOLD=24576 # entries (default: 8192)

This allows the node to apply all RAFT log entries before triggering snapshot creation.

After node reports healthy:

  1. Remove the custom configuration
  2. Restart the node to return to defaults

Prevention

  • Avoid making many rapid configuration changes during initial cluster bootstrap
  • Stage large schema deployments rather than applying all at once

Failed to decode incoming command

Impact summary
  • Affected versions: 1.25, 1.26, 1.27, 1.28, 1.29
  • Resolution: Configuration

Symptoms

Log entries showing:

failed to decode incoming command
error: unknown rpc type 71
remote-address: 10.0.104.114:42128

Note: 71 represents ASCII 'G' (GET), 80 represents ASCII 'P' (POST)

Root cause

HTTP requests being sent to RAFT's internal TCP endpoint (port 8300). This commonly occurs when Prometheus or other monitoring tools auto-discover and attempt to scrape all open ports, including internal RAFT ports.

Resolution

Configure monitoring to exclude RAFT ports:

Update Prometheus scrape configuration to skip internal cluster ports:

  • Port 7000: Memberlist
  • Port 7100-7103: Memberlist gossip
  • Port 8300: RAFT

For Prometheus Operator:

additionalScrapeConfigs:
- job_name: "weaviate"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_container_port_number]
regex: "(7000|7100|7101|7102|7103|8300)"
action: drop

This is informational only and does not impact cluster functionality.


Getting help

If you encounter an issue not listed here:

  1. Search GitHub Issues
  2. Ask in the Weaviate Community Forum

If you can't find an existing issue, please open a new one. Try to include the following information:

  • Weaviate version
  • Deployment environment (cloud, on-prem, Kubernetes, etc.)
  • Relevant log excerpts
  • Steps to reproduce
  • Impact on your workload