The Push Alerts That Arrived After the Sale Ended: A Mobile Push Notification Reliability Playbook for FCM and APNs

Mobile push notification reliability across FCM and APNs dashboards

At 9:12 p.m., our commerce app sent a “flash sale ends in 20 minutes” push to 180,000 users. By 9:25, support tickets started landing. Some people got the alert after the sale ended. Some never got it. A few got two copies. The backend logs looked clean, so at first glance everyone blamed “push being flaky.” It was not flaky. It was predictable, and our pipeline design was the real bug.

If your team ships both Android and iOS, this pattern is common: the send endpoint returns success, product assumes delivery, and then analytics shows a weird gap. The fix is not one magical SDK call. The fix is a reliability contract that treats tokens, TTL, and platform behavior as first-class engineering concerns.

In this guide, I will walk through a practical mobile push notification reliability setup using FCM and APNs semantics, including token hygiene, urgency-aware delivery settings, and operational guardrails that survive real production traffic.

If you are modernizing auth flows, pair this with our Android credential work in Implementing Android Passkeys with Credential Manager. If your send path is Node-based, our Node.js systems reliability playbook is a useful companion. For threat modeling notification abuse, read our zero-trust hardening blueprint, and for platform-level architecture tradeoffs see this cloud architecture guide.

The quiet failure modes that break trust

Most delivery incidents cluster around four causes:

  • Stale tokens: devices churn, apps are reinstalled, users switch phones, and old tokens keep sitting in your database.
  • Wrong urgency policy: everything is sent at high priority, then platform power controls throttle low-value traffic anyway.
  • No expiration discipline: a message that is useless after 5 minutes is still eligible for delayed delivery hours later.
  • False success semantics: teams treat “accepted by provider” as “seen by user,” which are very different states.

FCM explicitly recommends managing token freshness, storing timestamps, and pruning invalid tokens. It also documents that Android tokens can expire after long inactivity windows. APNs similarly expects explicit expiry and priority behavior. None of this is optional if reliability matters.

Pattern 1, Build a token ledger, not a token column

A single users.push_token field is fragile. Use a token ledger keyed by user + device, with platform metadata and last-seen timestamps. That unlocks targeted pruning and clean fallbacks when one device goes stale.

CREATE TABLE device_push_tokens (
  id BIGSERIAL PRIMARY KEY,
  user_id BIGINT NOT NULL,
  platform TEXT NOT NULL CHECK (platform IN ('android','ios')),
  token TEXT NOT NULL,
  app_version TEXT,
  locale TEXT,
  last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  last_success_at TIMESTAMPTZ,
  last_error_code TEXT,
  is_active BOOLEAN NOT NULL DEFAULT TRUE,
  UNIQUE (platform, token)
);

-- Upsert whenever app starts, logs in, or token refresh callback fires
INSERT INTO device_push_tokens (user_id, platform, token, app_version, locale, last_seen_at)
VALUES ($1, $2, $3, $4, $5, NOW())
ON CONFLICT (platform, token)
DO UPDATE SET
  user_id = EXCLUDED.user_id,
  app_version = EXCLUDED.app_version,
  locale = EXCLUDED.locale,
  last_seen_at = NOW(),
  is_active = TRUE;

Tradeoff: this schema is noisier than a simple profile table, but it gives you auditability and safer cleanup. In practice, it pays for itself the first time you need to answer, “Which devices stopped receiving promos after version 8.3.1?”

Pattern 2, Define an urgency matrix before you send

Do not let every product event use the same delivery settings. Create a small policy table:

  • Critical transactional (OTP fallback, fraud lock): aggressive delivery, short expiry.
  • Time-sensitive engagement (sale ending soon): moderate priority, strict TTL.
  • Informational (weekly digest): relaxed priority, longer TTL.

This is where notification TTL strategy matters. FCM allows Android/Web TTL from 0 seconds to 28 days, and APNs supports expiration headers. If a sale alert loses meaning after 15 minutes, set expiry accordingly. A late notification can be worse than no notification.

Pattern 3, Treat invalid-token handling as core business logic

Many teams log provider errors but never mutate state. That turns your sender into a stale-token spammer. Build automated cleanup into the same workflow that sends notifications.

import { getMessaging } from "firebase-admin/messaging";
import db from "./db.js";

const CATEGORY_POLICY = {
  critical: { androidTtlSec: 60, androidPriority: "high", apnsPriority: "10" },
  promo:    { androidTtlSec: 900, androidPriority: "high", apnsPriority: "10" },
  digest:   { androidTtlSec: 21600, androidPriority: "normal", apnsPriority: "5" }
};

export async function sendPushToDevice(device, payload, category = "promo") {
  const p = CATEGORY_POLICY[category] ?? CATEGORY_POLICY.promo;
  const apnsExpiration = Math.floor(Date.now() / 1000) + p.androidTtlSec;

  const message = {
    token: device.token,
    notification: { title: payload.title, body: payload.body },
    data: payload.data ?? {},
    android: {
      priority: p.androidPriority,
      ttl: `${p.androidTtlSec}s`
    },
    apns: {
      headers: {
        "apns-priority": p.apnsPriority,
        "apns-expiration": String(apnsExpiration)
      },
      payload: { aps: { sound: "default" } }
    }
  };

  try {
    const messageId = await getMessaging().send(message);
    await db.query(
      `UPDATE device_push_tokens
         SET last_success_at = NOW(), last_error_code = NULL
       WHERE id = $1`,
      [device.id]
    );
    return { ok: true, messageId };
  } catch (err) {
    const code = err?.errorInfo?.code || err?.code || "unknown";
    const invalid = code.includes("registration-token-not-registered") ||
                    code.includes("invalid-registration-token");

    await db.query(
      `UPDATE device_push_tokens
          SET last_error_code = $2,
              is_active = CASE WHEN $3 THEN FALSE ELSE is_active END
        WHERE id = $1`,
      [device.id, code, invalid]
    );

    return { ok: false, code };
  }
}

This one block encodes two production truths: FCM token management is ongoing, not one-time, and APNs/FCM expiry controls should map to business intent, not guesswork.

Client behavior you should enforce

  • Upload the latest token at login and after every token refresh callback.
  • Include app version + locale with token updates for debugging segmentation.
  • On iOS, ensure APNs setup and token mapping are explicit, especially if swizzling is disabled.
  • On Android, test runtime notification permission flows on fresh installs and OS upgrades.

Tradeoff: frequent token refresh checks improve freshness but add network chatter. Monthly verification plus event-driven updates (install, login, token refresh) is usually a good balance for battery, bandwidth, and operational clarity.

Troubleshooting, when delivery rates drop overnight

1) Symptom, send success high, opens cratered

Likely cause: stale audience inflation. Old tokens are still targeted, making delivery metrics look worse than user-visible reality. Fix: prune inactive tokens by age and deactivate tokens on invalid-token provider responses.

2) Symptom, users receive promo too late

Likely cause: TTL too long for time-sensitive campaigns. Fix: set tighter TTL and platform expiration values for those categories. A “sale ending” push should expire fast.

3) Symptom, iOS mostly okay, Android inconsistent on new installs

Likely cause: permission flow regressions or onboarding path skips token upload until a later screen. Fix: instrument the first-run flow and assert token upload success as part of release checks.

4) Symptom, sudden spike in invalid token errors after app release

Likely cause: reinstall churn and token rotation after rollout. Fix: increase token refresh telemetry and ensure server upsert logic handles rapid token replacement without duplicates.

FAQ

Should we delete a token immediately after one failed send?

If the failure is a confirmed invalid-token class (for example, unregistered/not-registered), yes, deactivate it immediately. For transient errors (timeouts, internal provider errors), keep it active and retry with backoff.

Is priority “high” always better for engagement?

No. Overusing high priority can hurt battery behavior and still does not guarantee immediate user attention. Reserve it for truly urgent flows, and use normal/low-urgency lanes for digests and non-critical nudges.

How often should we clean stale tokens?

Daily pruning jobs are a practical default for active apps. The exact stale window depends on your business cadence, but many teams start with around 30 days and tune from observed reactivation patterns.

Actionable takeaways for this week

  • Create a token ledger table with last_seen_at, is_active, and last_error_code.
  • Ship an urgency matrix that maps each notification type to priority + TTL + expiration behavior.
  • Auto-deactivate invalid tokens in the send path, not in a manual dashboard task.
  • Add release checks for Android permission prompts and iOS token mapping correctness.
  • Track provider acceptance separately from downstream engagement so your team stops misreading success.

Push reliability is less about clever messaging copy and more about systems discipline. Once your token lifecycle and expiry rules match user intent, notification trust improves quickly, and the support queue gets a lot quieter.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Privacy Policy · Contact · Sitemap

© 7Tech – Programming and Tech Tutorials