Dashboard for Researchers & Geneticists: Functional Requirements [System Design]

Overview of Features The system is designed for researchers and geneticists to analyze genetic/genomic data, with multi-level roles using Role-Based Access Control (RBAC). The system will have the following components: Genetic Data API: Built using Fast API, PostgreSQL for genetic data storage, and Redis for caching. User Management: Using NoSQL database for user data storage, with Redis for caching. ETL (Extract, Transform, Load): Using Apache Airflow to migrate legacy data to a modern SQL database with normalization, better relationships using ERD, indexing, etc. Authentication: Single Sign-On (SSO), OAuth, and Decentralized Identifier (DID) using Keri or similar (e.g., AT Protocol). We will also integrate web3 wallets such as MetaMask. Emilia AI: Integration with PubMed API, Nature Journal API, etc., for answering questions related to genetic data. RStudio Server and JupyterHub: Integration for researchers to create and test RShiny apps and Jupyter notebooks, with S3 for blob storage under user RBAC. Functional Requirements Genetic Data API Gateway API Endpoints: Create, Read, Update, Delete (CRUD) operations for genetic data. API documentation with OpenAPI/Swagger API for accessing (read access) genetic data from RStudio environment with limitations. Database: PostgreSQL for genetic data storage, with sharding & replica sets for scalability & high availability. Caching: Redis for caching genetic data. User Management Database: NoSQL database for user data storage. Caching: Redis for caching user data. RBAC with multi-level permissions: Researchers (focus): Full data analysis capabilities, access to computational environments Farmers: Limited access to relevant genetic data for their livestock, data entry capabilities, profile management Future Enhancement: Farmers will have a dedicated interface to input livestock data and manage their profiles, with tailored access to the Emilia AI assistant. Their AI interactions will be focused on practical applications such as optimal mating recommendations and basic genetic analysis for their specific herds, without access to the broader research capabilities available to geneticists and researchers. ETL Legacy Data Migration: Utilizing Apache Airflow to extract data from legacy systems and load it into a modern PostgreSQL database, enabling efficient data management and analysis. Data Normalization: Normalizing genetic data to optimize relationships and data integrity using Entity-Relationship Diagrams (ERD), ensuring a robust and scalable database design. Indexing: Implementing indexing strategies to facilitate efficient data retrieval and querying, enhancing overall system performance. Authentication Single Sign-On (SSO): Implementing SSO for streamlined user authentication and access control. OAuth: Utilizing OAuth for secure authorization and authentication. Decentralized Identifier (DID): Integrating AT Protocol for decentralized identity management, enabling users to control their digital identity. Web3 & MetaMask Integration: Incorporating Web3 technologies and MetaMask for seamless blockchain-based authentication and interaction, ensuring a secure and decentralized user experience. Emilia AI Assistant Genetic/Genomic Database Integration: Integrating genetic and genomic databases to provide accurate and informed answers. LLM & RAG: Utilizing Large Language Models (LLM), Natural Language Processing (NLP - e.g., Lanchain) and Retrieval-Augmented Generation (RAG - e.g., Amazon Bedrock) to optimize answers based on genetic/genomic data and research publications. Journal APIs: Leveraging APIs from reputable journals (e.g., PubMed API), Nature API) to access the latest research publications and stay up-to-date with the latest findings. Downloaded Articles/Publications: In cases where APIs are not available, utilizing downloaded articles and publications as a starting point to ensure comprehensive coverage and insights. DAG for Heritability: Implementing Directed Acyclic Graphs (DAG) to estimate heritability and identify genetic relationships. Users can ask Emilia AI questions such as: What are the known genetic factors contributing to a certain trait? What are the genetic factors influencing milk yield in dairy cattle? What is the estimated breeding value (EBV) for a specific trait in a particular breed of cattle? Can you suggest a suitable mate for cow #XXX based on its genetic profile and breeding goals? Future Enhancement: Custom Models: Developing custom models for advanced analyses, such as: Genome-Wide Association Studies (GWAS) Gene expression analysis Re-engineering Analysis Tools: Integrating and re-engineering popular analysis tools, such as: PLINK: Whole genome association analysis toolset User Query → NLP Parser → Intent Classification → PLINK Command Generator → PLINK Execution → Result Parser → Context Enhanc

Apr 20, 2025 - 05:33

Dashboard for Researchers & Geneticists: Functional Requirements [System Design]

Overview of Features

The system is designed for researchers and geneticists to analyze genetic/genomic data, with multi-level roles using Role-Based Access Control (RBAC). The system will have the following components:

Genetic Data API: Built using Fast API, PostgreSQL for genetic data storage, and Redis for caching.
User Management: Using NoSQL database for user data storage, with Redis for caching.
ETL (Extract, Transform, Load): Using Apache Airflow to migrate legacy data to a modern SQL database with normalization, better relationships using ERD, indexing, etc.
Authentication: Single Sign-On (SSO), OAuth, and Decentralized Identifier (DID) using Keri or similar (e.g., AT Protocol). We will also integrate web3 wallets such as MetaMask.
Emilia AI: Integration with PubMed API, Nature Journal API, etc., for answering questions related to genetic data.
RStudio Server and JupyterHub: Integration for researchers to create and test RShiny apps and Jupyter notebooks, with S3 for blob storage under user RBAC.

Functional Requirements

Genetic Data API Gateway

API Endpoints:
- Create, Read, Update, Delete (CRUD) operations for genetic data.
- API documentation with OpenAPI/Swagger
- API for accessing (read access) genetic data from RStudio environment with limitations.
Database: PostgreSQL for genetic data storage, with sharding & replica sets for scalability & high availability.
Caching: Redis for caching genetic data.

User Management

Database: NoSQL database for user data storage.
Caching: Redis for caching user data.
RBAC with multi-level permissions:
- Researchers (focus): Full data analysis capabilities, access to computational environments
- Farmers: Limited access to relevant genetic data for their livestock, data entry capabilities, profile management

Future Enhancement: Farmers will have a dedicated interface to input livestock data and manage their profiles, with tailored access to the Emilia AI assistant. Their AI interactions will be focused on practical applications such as optimal mating recommendations and basic genetic analysis for their specific herds, without access to the broader research capabilities available to geneticists and researchers.

ETL

Legacy Data Migration: Utilizing Apache Airflow to extract data from legacy systems and load it into a modern PostgreSQL database, enabling efficient data management and analysis.
Data Normalization: Normalizing genetic data to optimize relationships and data integrity using Entity-Relationship Diagrams (ERD), ensuring a robust and scalable database design.
Indexing: Implementing indexing strategies to facilitate efficient data retrieval and querying, enhancing overall system performance.

Authentication

Single Sign-On (SSO): Implementing SSO for streamlined user authentication and access control.
OAuth: Utilizing OAuth for secure authorization and authentication.
Decentralized Identifier (DID): Integrating AT Protocol for decentralized identity management, enabling users to control their digital identity.
Web3 & MetaMask Integration: Incorporating Web3 technologies and MetaMask for seamless blockchain-based authentication and interaction, ensuring a secure and decentralized user experience.

Emilia AI Assistant

Genetic/Genomic Database Integration: Integrating genetic and genomic databases to provide accurate and informed answers.
LLM & RAG: Utilizing Large Language Models (LLM), Natural Language Processing (NLP - e.g., Lanchain) and Retrieval-Augmented Generation (RAG - e.g., Amazon Bedrock) to optimize answers based on genetic/genomic data and research publications.
Journal APIs: Leveraging APIs from reputable journals (e.g., PubMed API), Nature API) to access the latest research publications and stay up-to-date with the latest findings.
Downloaded Articles/Publications: In cases where APIs are not available, utilizing downloaded articles and publications as a starting point to ensure comprehensive coverage and insights.
DAG for Heritability: Implementing Directed Acyclic Graphs (DAG) to estimate heritability and identify genetic relationships.
Users can ask Emilia AI questions such as:
- What are the known genetic factors contributing to a certain trait?
- What are the genetic factors influencing milk yield in dairy cattle?
- What is the estimated breeding value (EBV) for a specific trait in a particular breed of cattle?
- Can you suggest a suitable mate for cow #XXX based on its genetic profile and breeding goals?
Future Enhancement:
- Custom Models: Developing custom models for advanced analyses, such as:
  - Genome-Wide Association Studies (GWAS)
  - Gene expression analysis
- Re-engineering Analysis Tools: Integrating and re-engineering popular analysis tools, such as:
  - PLINK: Whole genome association analysis toolset User Query → NLP Parser → Intent Classification → PLINK Command Generator → PLINK Execution → Result Parser → Context Enhancement (LLM) → User Response
  - BLUPF90: Software for genetic evaluation in animal breeding
  - GATK (Genome Analysis Toolkit)

RStudio Server and JupyterHub

![diagram-export-4-19-2025-11-52-21-PM.png(https://postimg.cc/vx8W0jBM)

Integration: Our platform integrates RStudio Server and JupyterHub, providing researchers with a seamless and secure environment to create, test, and deploy their data analysis workflows. Each user will have access to their own instance of RStudio Server or JupyterHub, allowing them to:
- Analyze their own data: Users can upload and analyze their own datasets, leveraging the power of R and Python programming languages.
- Access platform data: Researchers can also access platform-provided data, including genetic and genomic datasets, to support their research.
Features:
- RShiny App Development: Users can create and test RShiny apps to visualize and interact with their data.
- Jupyter Notebook: Researchers can create and share Jupyter notebooks to document and reproduce their analysis.
- S3 Blob Storage: User data and analysis outputs are stored securely in S3 blob storage, with access controlled by Role-Based Access Control (RBAC).
Scalability and High Availability: Our platform is built on AWS services, providing maximum scalability and high availability for users. Depending on their research requirements, users can:
- Leverage Amazon SageMaker: For machine learning (ML) tasks, users can leverage Amazon SageMaker to analyze large datasets and build predictive models.
- Tier-based access: Users can access different tiers of computing resources, depending on their research needs and budget.