Designing a Serverless Real-Time Chat with AWS WebSockets
I want a clean, feasible plan for building a real-time chat service on AWS. Just an architecture I can deploy, test, and iterate on.
Goals
- Real time. Messages appear for all clients in the same room.
- Serverless scale. Handle spikes without manual intervention.
- Cost aware. Pay only when used.
- Simple to operate. Clear metrics, alarms, and dead-letter handling.
Architecture
Core pieces
Amazon API Gateway WebSocket API
- Routes:
$connect,$disconnect,$default, andsendMessage.
- Routes:
AWS Lambda
onConnectstores a connection record.onDisconnectremoves it.onMessagewrites a message and broadcasts to the room.
Amazon DynamoDB
chat_connectionskeyed byroomIdandconnectionId.chat_messageskeyed byroomIdandtimestampfor history and replay.
IAM and Observability
- Minimal policies for DynamoDB and
execute-api:ManageConnections. - CloudWatch metrics, alarms, and DLQs for resilience.
- Minimal policies for DynamoDB and
Data model
chat_connections- PK
roomId(S), SKconnectionId(S) - attrs:
userId(S),connectedAt(S ISO),ttl(N, optional)
- PK
chat_messages- PK
roomId(S), SKtimestamp(S ISO) - attrs:
userId(S),message(S),metadata(M)
- PK
This lets onMessage query connections for a room with a single Query call, then fan out with API Gateway Management API.
Note: WebSockets originate from the client to AWS over
wss://.
Terraform plan
WebSocket API and stage
resource "aws_apigatewayv2_api" "chat" {
name = "serverless-chat"
protocol_type = "WEBSOCKET"
route_selection_expression = "$request.body.action"
}
resource "aws_apigatewayv2_stage" "default" {
api_id = aws_apigatewayv2_api.chat.id
name = "$default"
auto_deploy = true
}
DynamoDB tables
resource "aws_dynamodb_table" "connections" {
name = "chat_connections"
billing_mode = "PAY_PER_REQUEST"
hash_key = "roomId"
range_key = "connectionId"
attribute { name = "roomId"; type = "S" }
attribute { name = "connectionId"; type = "S" }
ttl {
attribute_name = "ttl"
enabled = true
}
}
resource "aws_dynamodb_table" "messages" {
name = "chat_messages"
billing_mode = "PAY_PER_REQUEST"
hash_key = "roomId"
range_key = "timestamp"
attribute { name = "roomId"; type = "S" }
attribute { name = "timestamp"; type = "S" }
}
Lambda functions, permissions, and routes
# Role for all chat Lambdas
data "aws_caller_identity" "current" {}
resource "aws_iam_role" "chat_lambda_role" {
name = "chat-lambda-role"
assume_role_policy = data.aws_iam_policy_document.chat_assume.json
}
data "aws_iam_policy_document" "chat_assume" {
statement {
actions = ["sts:AssumeRole"]
principals { type = "Service", identifiers = ["lambda.amazonaws.com"] }
}
}
data "aws_iam_policy_document" "chat_policy" {
statement {
actions = [
"dynamodb:PutItem",
"dynamodb:DeleteItem",
"dynamodb:Query",
"dynamodb:GetItem"
]
resources = [
aws_dynamodb_table.connections.arn,
"${aws_dynamodb_table.connections.arn}/index/*",
aws_dynamodb_table.messages.arn
]
}
statement {
actions = ["execute-api:ManageConnections"]
resources = ["arn:aws:execute-api:${var.region}:${data.aws_caller_identity.current.account_id}:${aws_apigatewayv2_api.chat.id}/*"]
}
statement {
actions = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"]
resources = ["*"]
}
}
resource "aws_iam_policy" "chat_inline" {
name = "chat-lambda-policy"
policy = data.aws_iam_policy_document.chat_policy.json
}
resource "aws_iam_role_policy_attachment" "chat_attach" {
role = aws_iam_role.chat_lambda_role.name
policy_arn = aws_iam_policy.chat_inline.arn
}
# Zip your handlers outside this snippet, or use an archive_file
resource "aws_lambda_function" "on_connect" {
function_name = "chat-on-connect"
role = aws_iam_role.chat_lambda_role.arn
handler = "index.onConnect"
runtime = "nodejs20.x"
filename = "build/connect.zip"
environment {
variables = {
CONNECTIONS_TABLE = aws_dynamodb_table.connections.name
}
}
}
resource "aws_lambda_function" "on_disconnect" {
function_name = "chat-on-disconnect"
role = aws_iam_role.chat_lambda_role.arn
handler = "index.onDisconnect"
runtime = "nodejs20.x"
filename = "build/disconnect.zip"
environment {
variables = {
CONNECTIONS_TABLE = aws_dynamodb_table.connections.name
}
}
}
resource "aws_lambda_function" "on_message" {
function_name = "chat-on-message"
role = aws_iam_role.chat_lambda_role.arn
handler = "index.onMessage"
runtime = "nodejs20.x"
filename = "build/message.zip"
environment {
variables = {
CONNECTIONS_TABLE = aws_dynamodb_table.connections.name
MESSAGES_TABLE = aws_dynamodb_table.messages.name
# APIGW domain and stage are read from the event
}
}
}
# Integrations
resource "aws_apigatewayv2_integration" "connect" {
api_id = aws_apigatewayv2_api.chat.id
integration_type = "AWS_PROXY"
integration_uri = aws_lambda_function.on_connect.invoke_arn
integration_method = "POST"
payload_format_version = "2.0"
}
resource "aws_apigatewayv2_integration" "disconnect" {
api_id = aws_apigatewayv2_api.chat.id
integration_type = "AWS_PROXY"
integration_uri = aws_lambda_function.on_disconnect.invoke_arn
integration_method = "POST"
payload_format_version = "2.0"
}
resource "aws_apigatewayv2_integration" "message" {
api_id = aws_apigatewayv2_api.chat.id
integration_type = "AWS_PROXY"
integration_uri = aws_lambda_function.on_message.invoke_arn
integration_method = "POST"
payload_format_version = "2.0"
}
# Routes
resource "aws_apigatewayv2_route" "connect" {
api_id = aws_apigatewayv2_api.chat.id
route_key = "$connect"
target = "integrations/${aws_apigatewayv2_integration.connect.id}"
}
resource "aws_apigatewayv2_route" "disconnect" {
api_id = aws_apigatewayv2_api.chat.id
route_key = "$disconnect"
target = "integrations/${aws_apigatewayv2_integration.disconnect.id}"
}
resource "aws_apigatewayv2_route" "default" {
api_id = aws_apigatewayv2_api.chat.id
route_key = "$default"
target = "integrations/${aws_apigatewayv2_integration.message.id}"
}
resource "aws_apigatewayv2_route" "send" {
api_id = aws_apigatewayv2_api.chat.id
route_key = "sendMessage"
target = "integrations/${aws_apigatewayv2_integration.message.id}"
}
# Invoke permissions
resource "aws_lambda_permission" "allow_connect" {
statement_id = "AllowAPIGatewayInvokeConnect"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.on_connect.function_name
principal = "apigateway.amazonaws.com"
source_arn = "${aws_apigatewayv2_api.chat.execution_arn}/*/*"
}
resource "aws_lambda_permission" "allow_disconnect" {
statement_id = "AllowAPIGatewayInvokeDisconnect"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.on_disconnect.function_name
principal = "apigateway.amazonaws.com"
source_arn = "${aws_apigatewayv2_api.chat.execution_arn}/*/*"
}
resource "aws_lambda_permission" "allow_message" {
statement_id = "AllowAPIGatewayInvokeMessage"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.on_message.function_name
principal = "apigateway.amazonaws.com"
source_arn = "${aws_apigatewayv2_api.chat.execution_arn}/*/*"
}
Lambda handlers in TypeScript
index.ts:
import { DynamoDBClient, PutItemCommand, DeleteItemCommand, QueryCommand } from "@aws-sdk/client-dynamodb";
import { ApiGatewayManagementApiClient, PostToConnectionCommand } from "@aws-sdk/client-apigatewaymanagementapi";
const ddb = new DynamoDBClient({});
const CONNECTIONS_TABLE = process.env.CONNECTIONS_TABLE!;
const MESSAGES_TABLE = process.env.MESSAGES_TABLE!;
type WSHandler = (event: any) => Promise<any>;
export const onConnect: WSHandler = async (event) => {
const roomId = event.queryStringParameters?.roomId || "lobby";
const userId = event.queryStringParameters?.userId || "anonymous";
const connectionId = event.requestContext.connectionId as string;
const connectedAt = new Date().toISOString();
await ddb.send(new PutItemCommand({
TableName: CONNECTIONS_TABLE,
Item: {
roomId: { S: roomId },
connectionId: { S: connectionId },
userId: { S: userId },
connectedAt: { S: connectedAt },
ttl: { N: String(Math.floor(Date.now() / 1000) + 60 * 60 * 24) }
}
}));
return { statusCode: 200 };
};
export const onDisconnect: WSHandler = async (event) => {
const connectionId = event.requestContext.connectionId as string;
const roomId = "lobby";
await ddb.send(new DeleteItemCommand({
TableName: CONNECTIONS_TABLE,
Key: {
roomId: { S: roomId },
connectionId: { S: connectionId }
}
}));
return { statusCode: 200 };
};
export const onMessage: WSHandler = async (event) => {
const { domainName, stage, connectionId } = event.requestContext;
const body = JSON.parse(event.body || "{}");
const roomId = body.roomId || "lobby";
const userId = body.userId || "anonymous";
const text = String(body.message || "").slice(0, 2000);
const timestamp = new Date().toISOString();
await ddb.send(new PutItemCommand({
TableName: MESSAGES_TABLE,
Item: {
roomId: { S: roomId },
timestamp: { S: timestamp },
userId: { S: userId },
message: { S: text }
}
}));
const connections = await ddb.send(new QueryCommand({
TableName: CONNECTIONS_TABLE,
KeyConditionExpression: "roomId = :r",
ExpressionAttributeValues: { ":r": { S: roomId } }
}));
const mgmt = new ApiGatewayManagementApiClient({
endpoint: `https://${domainName}/${stage}`
});
const payload = Buffer.from(JSON.stringify({
roomId, userId, message: text, timestamp
}));
await Promise.all((connections.Items || []).map(async (item) => {
const connId = item.connectionId.S!;
try {
await mgmt.send(new PostToConnectionCommand({ ConnectionId: connId, Data: payload }));
} catch (err: any) {
if (err?.$metadata?.httpStatusCode === 410) {
await ddb.send(new DeleteItemCommand({
TableName: CONNECTIONS_TABLE,
Key: { roomId: { S: roomId }, connectionId: { S: connId } }
}));
}
}
}));
await mgmt.send(new PostToConnectionCommand({
ConnectionId: connectionId,
Data: Buffer.from(JSON.stringify({ ack: true, timestamp }))
}));
return { statusCode: 200 };
};
Build three zip files: connect.zip, disconnect.zip, message.zip.
Optional authentication
Start without auth to validate the flow.
- Add a JWT authorizer in API Gateway v2. Accept tokens from Cognito or your IdP.
- Use claims to set
userIdand allowedroomIdin$context.authorizer. - Enforce per-room access in the Lambdas by checking claims.
Client integration
A minimal browser client with reconnection and heartbeats.
<script>
const endpoint = "wss://YOUR_API_ID.execute-api.YOUR_REGION.amazonaws.com";
let ws, pingTimer, reconnectTimer;
function connect(roomId = "lobby", userId = "web-" + Math.random().toString(36).slice(2)) {
ws = new WebSocket(`${endpoint}/?roomId=${encodeURIComponent(roomId)}&userId=${encodeURIComponent(userId)}`);
ws.onopen = () => {
console.log("connected");
clearInterval(reconnectTimer);
pingTimer = setInterval(() => ws.send(JSON.stringify({ action: "ping" })), 25000);
};
ws.onmessage = (ev) => {
const data = JSON.parse(ev.data);
renderMessage(data);
};
ws.onclose = () => {
clearInterval(pingTimer);
reconnectTimer = setInterval(() => connect(roomId, userId), 3000);
};
ws.onerror = () => ws.close();
}
function sendMessage(message, roomId = "lobby", userId = "web") {
ws.send(JSON.stringify({ action: "sendMessage", roomId, userId, message }));
}
function renderMessage({ userId, message, timestamp }) {
const el = document.getElementById("chat");
const li = document.createElement("li");
li.textContent = `[${timestamp}] ${userId}: ${message}`;
el.appendChild(li);
}
connect();
</script>
<ul id="chat"></ul>
<input id="msg" />
<button onclick="sendMessage(document.getElementById('msg').value)">Send</button>
Operations and reliability
Metrics and alarms
- API Gateway 4XX and 5XX rates
- Lambda errors, duration, and throttles
- DynamoDB throttles
DLQs
- Configure SQS DLQs for all three Lambdas and alarm on non-zero depth.
Stale connection cleanup
- Remove connections on 410 Gone as shown.
- TTL on
chat_connectionsensures eventual cleanup.
Throughput
- Use batches per room and parallelize
PostToConnection. Fan-out is limited by API Gateway TPS and Lambda concurrency. Add backoff and chunking for very large rooms.
- Use batches per room and parallelize
History and retention
- Set DynamoDB TTL on
chat_messagesif you do not need long-term storage. - For longer retention, ship messages to S3 via Firehose from Lambda.
- Set DynamoDB TTL on
Cost awareness
- WebSocket connections cost per million minutes. Idle clients still count. Use heartbeats and close idle clients on the server if needed.
Deployment checklist
-
terraform applyoutputs thewss://endpoint - Basic client connects, sends, and receives messages in the lobby
- Alarms on 5XX and Lambda errors are green
- DLQs are attached and empty
- TTL is enabled on
chat_connectionsand, if desired,chat_messages
Next steps
- Add JWT authorizer and role-based room access.
- Introduce message moderation and rate limits per user.
- Add a simple REST
GET /historyto fetch the last N messages for a room. - Optionally put a CloudFront distribution in front of the WebSocket API for a custom domain and TLS cert management.
This plan gets a functional, production-shaped chat service online without managing servers. From here, iterate on auth, moderation, and UX as needs grow.
