Post

AI for Mobile-First Markets: On-Device ML with TensorFlow Lite

AI for Mobile-First Markets: On-Device ML with TensorFlow Lite

The Mobile-First Imperative

If you’re building AI for African users, you’re building for mobile — period. The statistics are unambiguous:

  • 85% of internet connections in sub-Saharan Africa are via mobile
  • 65% of mobile devices in Africa are budget Android phones (median price: $80-150)
  • Average RAM: 3-4 GB (flagship devices globally have 12-16 GB)
  • Average storage: 32-64 GB (with 20+ GB consumed by the OS and pre-installed apps)
  • Connectivity: Many users operate on 2G or 3G with intermittent signal

Cloud-based AI — where you send data to a server, run inference, and get a response — is not practical for most African use cases. Latency is high, data costs eat into limited budgets, and offline operation is mandatory in many areas.

The alternative: on-device machine learning, where the model runs entirely on the user’s phone. No server call. No data charge. No internet dependency. This is TensorFlow Lite’s territory.

TensorFlow Lite: The Essential Toolkit

TensorFlow Lite (TFLite) is Google’s lightweight solution for deploying ML models on mobile, embedded, and IoT devices. It converts standard TensorFlow models into an efficient format (.tflite) that runs on-device.

Why TFLite Dominates in Africa

FeatureBenefit for African Users
Small binary sizeCore TFLite runtime is ~300 KB
No internet neededInference runs 100% offline
Hardware accelerationWorks with GPU, DSP, and NPU on budget chips
Android firstFirst-class support on the dominant African mobile OS
Model optimization toolsQuantization, pruning, clustering built in
60+ hardware delegatesQualcomm, MediaTek, Rockchip — all common in African phones

The Model Conversion Pipeline

Converting a TensorFlow model to TFLite involves three key steps:

1. Train or Load a Model

Start with a model architecture suited for mobile. EfficientNet-Lite, MobileNetV3, and TinyBERT are common choices.

1
2
3
4
5
6
7
8
import tensorflow as tf

# Load a mobile-optimized model
model = tf.keras.applications.MobileNetV3Small(
    input_shape=(224, 224, 3),
    weights="imagenet",
    classes=1000,
)

2. Convert to TFLite

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
converter = tf.lite.TFLiteConverter.from_keras_model(model)

# Apply optimization defaults
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Representative dataset for quantization calibration
def representative_dataset():
    for _ in range(100):
        data = tf.random.normal([1, 224, 224, 3])
        yield [data]

converter.representative_dataset = representative_dataset

# Target integer-only quantization
converter.target_spec.supported_ops = [
    tf.lite.OpsSet.TFLITE_BUILTINS_INT8
]

tflite_model = converter.convert()

# Save the model
with open("model_int8.tflite", "wb") as f:
    f.write(tflite_model)

3. Deploy on Android

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// On Android, load and run inference
class MLService(private val context: Context) {
    private var interpreter: Interpreter? = null

    fun loadModel() {
        val modelBuffer = loadModelFile(context, "model_int8.tflite")
        val options = Interpreter.Options().apply {
            // Use NNAPI for hardware acceleration if available
            setUseNNAPI(true)
            // Limit to 2 threads on budget devices
            setNumThreads(2)
        }
        interpreter = Interpreter(modelBuffer, options)
    }

    fun runInference(input: FloatArray): FloatArray {
        val output = Array(1) { FloatArray(1000) }
        interpreter?.run(input, output)
        return output[0]
    }
}

Quantization: Making Models Fit

Quantization is the single most important optimization for mobile AI. It reduces model size and speeds up inference by using lower-precision numbers.

TechniqueSize ReductionAccuracy ImpactSpeedup
Float16 quantization50%Negligible (<0.5%)1.5-2x with GPU
Int8 full-integer quantization75%0.5-2% drop2-4x with DSP/NNAPI
Dynamic range quantization75%0.5-1% dropNo latency improvement
Weight clustering80-90%1-5% dropOnly reduces storage

Real-World Example: Maize Disease Classifier

Let’s walk through a real African use case: classifying maize leaf diseases for smallholder farmers in Kenya. The original model:

  • Architecture: EfficientNet-B0
  • Parameters: 5.3M
  • Size (float32): 20.2 MB
  • Inference time on laptop: 45 ms

After TFLite conversion with full int8 quantization:

  • Size: 4.1 MB (80% reduction)
  • Inference time on Tecno Spark 10 (budget phone): 112 ms
  • Inference time on Samsung Galaxy A14: 74 ms
  • Accuracy drop: 0.3% (from 94.1% to 93.8%)

That 4.1 MB model loads instantly on any phone and doesn’t need a network connection to diagnose a diseased leaf. The farmer takes a photo in the field and gets an answer in under 2 seconds.

Beyond TFLite: The On-Device ML Ecosystem

ML Kit (Google)

Pre-built APIs for common tasks: text recognition, face detection, barcode scanning, image labeling. Runs on-device via TFLite underneath. Excellent for rapid prototyping.

ONNX Runtime Mobile

If your model was trained in PyTorch, ONNX Runtime Mobile provides similar on-device inference capabilities. Growing adoption in Africa, especially for NLP models exported from HuggingFace.

ExecuTorch (Meta)

Newer entrant specifically designed for mobile and edge. PyTorch-native, still maturing but promising for custom architectures.

Practical Patterns for African Deployments

Pattern 1: Model Bundling

Bundle the TFLite model in your APK (Android) or app bundle. Users don’t download models — they come pre-installed. This avoids first-time data costs:

1
2
3
4
5
6
7
8
9
// build.gradle
android {
    // ...
    sourceSets {
        main {
            assets.srcDirs += ['src/main/assets']
        }
    }
}

Place maize_disease_model.tflite in src/main/assets/.

Pattern 2: Fallback Inference Chain

On low-end devices, certain operations may not be supported by the hardware delegate. Always implement a fallback chain:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
fun runInferenceWithFallback(input: TensorImage): TensorImage {
    return try {
        // Option 1: With NNAPI GPU acceleration
        interpreterWithNNAPI.run(input)
    } catch (e: Exception) {
        try {
            // Option 2: CPU fallback
            interpreterCPU.run(input)
        } catch (e: Exception) {
            // Option 3: Fall back to a smaller model
            interpreterSmallModel.run(input)
        }
    }
}

Pattern 3: Lazy Model Loading

Don’t load all models at app startup. Load only when needed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class MaizeDiagnosisService(context: Context) {
    private var interpreter: Interpreter? = null

    suspend fun diagnose(image: Bitmap): DiagnosisResult {
        val model = getModel()  // Lazy, thread-safe loading
        return runInference(model, image)
    }

    private fun getModel(): Interpreter {
        return interpreter ?: synchronized(this) {
            interpreter ?: Interpreter(
                loadModelFile(context, "maize_model.tflite")
            ).also { interpreter = it }
        }
    }
}

Pattern 4: Sync-When-Available Updates

Newer model versions are downloaded when the user has a Wi-Fi connection:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class ModelUpdater(context: Context) {
    private val modelManager = RemoteModelManager(context)

    fun checkForUpdate() {
        WorkManager.getInstance(context)
            .enqueueUniqueWork(
                "model_update",
                ExistingWorkPolicy.REPLACE,
                OneTimeWorkRequestBuilder<ModelUpdateWorker>()
                    .setConstraints(
                        Constraints.Builder()
                            .setRequiredNetworkType(NetworkType.UNMETERED)
                            .setRequiresBatteryNotLow(true)
                            .build()
                    )
                    .build()
            )
    }
}

Common Pitfalls

  1. Not testing on real devices: The Google Colab TFLite emulator will NOT replicate the behavior of a Tecno Spark 10 running Android 12 Go Edition. Test on actual African-market phones.

  2. Oversized input preprocessing: Resizing a 12 MP camera image to 224×224 on a budget CPU takes 300-500ms. Use the CameraX ML Kit integration to get a direct feed of 224×224 images from the camera sensor.

  3. Ignoring thermal throttling: Continuous inference on a budget phone without a heat sink causes thermal throttling after 2-3 minutes. Batch predictions and cooldown periods are essential.

  4. Assuming GPU availability: Only 30% of budget Android phones have usable GPU acceleration for ML. Always test CPU-only performance.

  5. APK size bloat: A model >10 MB inside an APK will cause installation failures on phones with limited storage. Consider Google Play Feature Delivery (on-demand model downloads).

The Bottom Line

In African markets, cloud-first AI is a toy for the wealthy. On-device AI is the real thing. TensorFlow Lite, paired with good model optimization practices, can deliver production-quality ML on devices that cost $80. The next billion AI users won’t interact with a cloud API — they’ll open an app, point their camera, and get an answer, with zero bars of signal.

Next in the series: The communities building African AI — Masakhane, Lacuna Fund, and local dataset initiatives.

This post is licensed under CC BY 4.0 by the author.