InkSight DataLabeler

Published: May 01, 2025

Note: This entire page is currently under construction. The project was developed for research purposes, details will be updated as soon as the formal research paper is released. Stay tuned!

Overview

InkSight DataLabeler is a robust multimodal dataset and annotation platform tailored to AI-assisted accessibility in STEM education. The platform enables frame-level precision annotation of lecture videos, capturing dynamic handwritten content, complex diagrams, and spoken explanations to train AI models that can understand and transcribe multimodal educational material.

Demo coming soon: InkSight Data Labeler

Motivation

STEM lectures present significant challenges for students with auditory or visual processing difficulties. In 2019, 41% of U.S. colleges hired student note-takers to assist students requiring accommodations—a labor-intensive and expensive process prone to quality variability. Existing tools struggle with the complexity of STEM content, including mathematical equations, diagrams, and rapid multimodal delivery.

InkSight addresses this gap by providing precise annotation tools that enable the development of AI models capable of transcribing and understanding dynamic educational content.

Key Features

Dynamic Content Capture

Live Handwriting: Captures evolving handwritten notes, equations, and annotations in real-time
Complex Diagrams: Annotates step-by-step creation of scientific and mathematical diagrams
Speech Synchronization: Aligns spoken explanations with corresponding visual elements
Frame-Level Precision: Provides fine-grained annotations for every video frame

Collaborative Annotation System

Real-Time Collaboration: WebSocket-powered multi-user annotation sessions
Live Cursors: View other annotators’ work in progress
Three-Tiered Roles: Annotator, Lead Annotator, and Administrator for quality control
Validation Pipeline: Structured review process before inclusion in training dataset

Annotation Tools

Five primary annotation categories with dedicated tooling:

Handwritten Text (amber) - For lecture notes and written content
Mathematical Notation (emerald) - For equations and formulas
Diagrams (blue) - For visual representations and figures
Background Content (gray) - For static slide material
Erasure Regions (red) - For tracking content removal

Technical Architecture

Frontend

React with Tailwind CSS for responsive, modular UI
Video playback controls with timeline navigation
Canvas-based drawing tools for precise annotation
Real-time collaboration features

Backend

Node.js with Express for API services
PostgreSQL with JSONB for flexible coordinate storage
WebSocket broadcasting for real-time synchronization
Video processing pipeline for frame extraction

Data Storage

Per-frame annotation schema for consistency across frame rates
Structured metadata: frame number, tool type, coordinates, labels, confidence scores
Supporting tables for user activity, video metadata, and temporal events
Future integration with AWS S3 for scalable video storage

Deployment

Docker containerization for portability
Cloud deployment on Replit for demo access
Production-ready AWS architecture (S3, CloudFront CDN, Redis)

Dataset Characteristics

InkSight fills critical gaps identified in existing multimodal lecture datasets:

Advantages Over Existing Datasets

Dynamic Content: Unlike datasets focused on static slides (LPM), captures live handwriting evolution
Fine-Grained Annotations: Frame-by-frame detail vs. coarse bounding boxes
Strong Cross-Modal Alignment: Precise links between speech and visual elements
STEM Focus: Specialized for mathematical equations and scientific diagrams

LectureVideoDB (2018): Limited to OCR text recognition, lacks handwriting process
Google I/O (2014): Static slides only, no dynamic content
LectureBank (2019): Text-centric PDFs, no video or temporal data
LPM (2022-2023): Best match but focuses on typed slides, not handwritten content
SlideSpeech (2023): ASR enhancement, doesn’t analyze visual reasoning

Export Formats

Annotated data exports in training-ready formats:

JSON: Full annotation metadata and relationships
CSV: Tabular format for statistical analysis
COCO: Compatible with common ML frameworks
FCN-Optimized: Tailored for Fully Convolutional Networks

Each export includes video metadata, coordinate geometries, content types, validation status, and quality metrics (inter-annotator agreement, validation coverage).

Applications

Accessibility in STEM Education

Automated note-taking for students with disabilities
Real-time transcription of handwritten content
Diagram-to-text conversion for screen readers
Multimodal learning support

AI for Science (AI4S)

Symbolic Reasoning: Multi-step derivation reconstruction
Diagram Understanding: Visual-to-structured representation translation
Scientific Retrieval: Equation/diagram-based search in literature
Workflow Modeling: Problem-solving pattern extraction
Knowledge Graphs: Entity and relation extraction for prerequisite mapping

Vision-Language Model Training

Designed to support pre-training and fine-tuning of:

OpenAI CLIP: Joint vision-text representations for concept matching
OpenAI GPT-4V: Rich reasoning over diagrams and dense text
DeepMind Flamingo: Few-shot learning for educational adaptability
Google Gemini: Advanced multimodal integration

Research Paper

Developed as part of research at Cornell University in collaboration with Jennifer Sun. The work aims to advance multimodal learning systems that foster inclusive learning environments and improve AI model training for educational applications.

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)

Nicole Hao