Sharding with SpringBoot

Understanding Database Sharding Imagine your application has grown so popular that a single database server can’t handle all your data efficiently. Sharding is like splitting your database into multiple smaller pieces (called “shards”), where each shard: Lives on a different server Contains the same table structure Holds only a portion of your total data For example, with customer data: Shard 1 might contain customers with IDs 1–1000 Shard 2 might contain customers with IDs 1001–2000 Why We Need Sharding Performance: Queries run faster because each database has less data to search through Scalability: You can keep adding more shards as your data grows Availability: If one shard goes down, the others remain available Common Sharding Strategies The way you divide your data affects how you’ll retrieve it: Hash-Based Sharding A hash function determines which shard contains the data (e.g., customerID % numberOfShards). This evenly distributes data but makes range queries difficult. Range-Based Sharding Data is split based on ranges of a key value (e.g., customers A-M in Shard 1, N-Z in Shard 2). Good for range queries but can lead to uneven distribution. Directory-Based Sharding Uses a lookup table that maps keys to specific shards. More flexible but adds an extra lookup step. Implementing Sharding in Spring Boot Let’s explore three approaches to implement sharding in Spring Boot, from simplest to most advanced. Approach 1: Custom Shard Routing (AbstractRoutingDataSource) This approach gives you direct control over which shard to query. Step 1: Configure Multiple Data Sources @Configuration public class DataSourceConfig { // Configuration properties for each shard @Bean @ConfigurationProperties("app.datasource.shard1") public DataSourceProperties shard1Properties() { return new DataSourceProperties(); } @Bean @ConfigurationProperties("app.datasource.shard2") public DataSourceProperties shard2Properties() { return new DataSourceProperties(); } // Create actual connections to each database shard @Bean public DataSource shard1DataSource() { return shard1Properties().initializeDataSourceBuilder().build(); } @Bean public DataSource shard2DataSource() { return shard2Properties().initializeDataSourceBuilder().build(); } // Create a routing datasource that will switch between shards @Bean public DataSource routingDataSource() { Map dataSources = new HashMap(); dataSources.put("shard1", shard1DataSource()); dataSources.put("shard2", shard2DataSource()); ShardRoutingDataSource routingDataSource = new ShardRoutingDataSource(); routingDataSource.setTargetDataSources(dataSources); routingDataSource.setDefaultTargetDataSource(shard1DataSource()); // Fallback shard return routingDataSource; } // Set up transaction management @Bean public PlatformTransactionManager transactionManager() { return new DataSourceTransactionManager(routingDataSource()); } // Configure the entity manager factory for JPA @Bean public LocalContainerEntityManagerFactoryBean entityManagerFactory(EntityManagerFactoryBuilder builder) { return builder .dataSource(routingDataSource()) .packages("com.example.model") // Your entity package .persistenceUnit("shardedPersistenceUnit") .build(); } // Make this our primary DataSource that the application will use @Primary @Bean public DataSource dataSource() { // LazyConnectionDataSourceProxy improves performance by only // connecting to the database when actually needed return new LazyConnectionDataSourceProxy(routingDataSource()); } } Step 2: Create a Custom Routing DataSource This class extends Spring’s built-in routing capability to decide which shard to use: public class ShardRoutingDataSource extends AbstractRoutingDataSource { // ThreadLocal variable to store current shard for each thread/request private static final ThreadLocal currentShard = new ThreadLocal(); // Set which shard to use public static void setCurrentShard(String shardId) { currentShard.set(shardId); } // Get the current shard to use public static String getCurrentShard() { return currentShard.get(); } // Clean up after we're done - important to prevent memory leaks! public static void clearCurrentShard() { currentShard.remove(); } @Override protected Object determineCurrentLookupKey() { // This method asks our ThreadLocal which shard to use right now return getCurrentShard(); } } Step 3: Create a Shard Selection Strategy This service determines which shard should contain a particular piece of data: @Service public class ShardingService {

May 1, 2025 - 07:34
 0
Sharding with SpringBoot

Understanding Database Sharding

Imagine your application has grown so popular that a single database server can’t handle all your data efficiently. Sharding is like splitting your database into multiple smaller pieces (called “shards”), where each shard:

  • Lives on a different server
  • Contains the same table structure
  • Holds only a portion of your total data

For example, with customer data:

Shard 1 might contain customers with IDs 1–1000
Shard 2 might contain customers with IDs 1001–2000

Why We Need Sharding

Performance: Queries run faster because each database has less data to search through
Scalability: You can keep adding more shards as your data grows
Availability: If one shard goes down, the others remain available

Common Sharding Strategies

The way you divide your data affects how you’ll retrieve it:

  1. Hash-Based Sharding
    A hash function determines which shard contains the data (e.g., customerID % numberOfShards). This evenly distributes data but makes range queries difficult.

  2. Range-Based Sharding
    Data is split based on ranges of a key value (e.g., customers A-M in Shard 1, N-Z in Shard 2). Good for range queries but can lead to uneven distribution.

  3. Directory-Based Sharding
    Uses a lookup table that maps keys to specific shards. More flexible but adds an extra lookup step.

Implementing Sharding in Spring Boot

Let’s explore three approaches to implement sharding in Spring Boot, from simplest to most advanced.

Approach 1: Custom Shard Routing (AbstractRoutingDataSource)
This approach gives you direct control over which shard to query.

Step 1: Configure Multiple Data Sources

@Configuration
public class DataSourceConfig {

    // Configuration properties for each shard
    @Bean
    @ConfigurationProperties("app.datasource.shard1")
    public DataSourceProperties shard1Properties() {
        return new DataSourceProperties();
    }

    @Bean
    @ConfigurationProperties("app.datasource.shard2")
    public DataSourceProperties shard2Properties() {
        return new DataSourceProperties();
    }

    // Create actual connections to each database shard
    @Bean
    public DataSource shard1DataSource() {
        return shard1Properties().initializeDataSourceBuilder().build();
    }

    @Bean
    public DataSource shard2DataSource() {
        return shard2Properties().initializeDataSourceBuilder().build();
    }

    // Create a routing datasource that will switch between shards
    @Bean
    public DataSource routingDataSource() {
        Map dataSources = new HashMap<>();
        dataSources.put("shard1", shard1DataSource());
        dataSources.put("shard2", shard2DataSource());

        ShardRoutingDataSource routingDataSource = new ShardRoutingDataSource();
        routingDataSource.setTargetDataSources(dataSources);
        routingDataSource.setDefaultTargetDataSource(shard1DataSource()); // Fallback shard

        return routingDataSource;
    }

    // Set up transaction management
    @Bean
    public PlatformTransactionManager transactionManager() {
        return new DataSourceTransactionManager(routingDataSource());
    }

    // Configure the entity manager factory for JPA
    @Bean
    public LocalContainerEntityManagerFactoryBean entityManagerFactory(EntityManagerFactoryBuilder builder) {
        return builder
                .dataSource(routingDataSource())
                .packages("com.example.model") // Your entity package
                .persistenceUnit("shardedPersistenceUnit")
                .build();
    }

    // Make this our primary DataSource that the application will use
    @Primary
    @Bean
    public DataSource dataSource() {
        // LazyConnectionDataSourceProxy improves performance by only
        // connecting to the database when actually needed
        return new LazyConnectionDataSourceProxy(routingDataSource());
    }
}

Step 2: Create a Custom Routing DataSource
This class extends Spring’s built-in routing capability to decide which shard to use:

public class ShardRoutingDataSource extends AbstractRoutingDataSource {
    // ThreadLocal variable to store current shard for each thread/request
    private static final ThreadLocal currentShard = new ThreadLocal<>();

    // Set which shard to use
    public static void setCurrentShard(String shardId) {
        currentShard.set(shardId);
    }

    // Get the current shard to use
    public static String getCurrentShard() {
        return currentShard.get();
    }

    // Clean up after we're done - important to prevent memory leaks!
    public static void clearCurrentShard() {
        currentShard.remove();
    }

    @Override
    protected Object determineCurrentLookupKey() {
        // This method asks our ThreadLocal which shard to use right now
        return getCurrentShard();
    }
}

Step 3: Create a Shard Selection Strategy
This service determines which shard should contain a particular piece of data:

@Service
public class ShardingService {

    /**
     * Determines which shard contains data for a given customer ID
     * 
     * @param customerId The customer ID to look up
     * @return The shard key (e.g., "shard1" or "shard2")
     */
    public String determineShardKey(Long customerId) {
        // This is a simple example using the modulo operator
        // If customerId is odd, use shard1; if even, use shard2
        return "shard" + (customerId % 2 + 1);

        // In real applications, you might use more sophisticated strategies:
        // - Hash-based: using a hash function on the ID
        // - Range-based: specific ID ranges go to specific shards
        // - Geography-based: users from different regions go to different shards
    }
}

Step 4: Using the Sharding System in a Repository

@Repository
public class CustomerRepository {

    @Autowired
    private JdbcTemplate jdbcTemplate; // Spring's helper for database queries

    @Autowired
    private ShardingService shardingService;

    /**
     * Find a customer by their ID
     * 
     * @param customerId The customer to look up
     * @return The customer information
     */
    public Customer findById(Long customerId) {
        try {
            // Step 1: Figure out which shard has this customer
            String shardKey = shardingService.determineShardKey(customerId);

            // Step 2: Tell our system to use that shard
            ShardRoutingDataSource.setCurrentShard(shardKey);

            // Step 3: Run the query (it will automatically go to the right shard)
            return jdbcTemplate.queryForObject(
                "SELECT * FROM customers WHERE id = ?", 
                new Object[]{customerId},
                (resultSet, rowNum) -> {
                    Customer customer = new Customer();
                    customer.setId(resultSet.getLong("id"));
                    customer.setName(resultSet.getString("name"));
                    customer.setEmail(resultSet.getString("email"));
                    // Set other fields...
                    return customer;
                }
            );
        } finally {
            // Step 4: Always clean up the shard selection when done
            ShardRoutingDataSource.clearCurrentShard();
        }
    }
}

Step 5: A Simple Controller Example

@RestController
@RequestMapping("/api/customers")
public class CustomerController {
@Autowired
    private CustomerRepository customerRepository;

    @GetMapping("/{id}")
    public ResponseEntity getCustomer(@PathVariable Long id) {
        try {
            Customer customer = customerRepository.findById(id);
            return ResponseEntity.ok(customer);
        } catch (EmptyResultDataAccessException e) {
            return ResponseEntity.notFound().build();
        }
    }
}

Step 6: Configuration in application.properties

# Shard 1 Database Connection Info
app.datasource.shard1.url=jdbc:mysql://shard1-database-server:3306/customers_db
app.datasource.shard1.username=db_user
app.datasource.shard1.password=db_password
app.datasource.shard1.driver-class-name=com.mysql.cj.jdbc.Driver

# Shard 2 Database Connection Info
app.datasource.shard2.url=jdbc:mysql://shard2-database-server:3306/customers_db
app.datasource.shard2.username=db_user
app.datasource.shard2.password=db_password
app.datasource.shard2.driver-class-name=com.mysql.cj.jdbc.Driver

Approach 2: Using Apache ShardingSphere
For more complex sharding scenarios, Apache ShardingSphere provides a comprehensive solution that abstracts much of the complexity.

Step 1: Add Dependencies


    org.apache.shardingsphere
    shardingsphere-jdbc-spring-boot-starter
    5.3.0




    org.apache.shardingsphere
    shardingsphere-transaction-xa-core
    5.3.0

Step 2: Configure ShardingSphere in application.yml
spring:
  shardingsphere:
    mode:
      type: Standalone
      repository:
        type: JDBC
    datasource:
      names: ds0, ds1
      ds0:
        type: com.zaxxer.hikari.HikariDataSource
        driverClassName: com.mysql.cj.jdbc.Driver
        jdbcUrl: jdbc:mysql://localhost:3306/shard0
        username: root
        password: password
        maxPoolSize: 10
      ds1:
        type: com.zaxxer.hikari.HikariDataSource
        driverClassName: com.mysql.cj.jdbc.Driver
        jdbcUrl: jdbc:mysql://localhost:3307/shard1
        username: root
        password: password
        maxPoolSize: 10
    rules:
      sharding:
        default-database-strategy:
          standard:
            sharding-column: customer_id
            sharding-algorithm-name: database-inline
        sharding-algorithms:
          database-inline:
            type: INLINE
            props:
              algorithm-expression: ds${customer_id % 2}
        tables:
          customers:
            actual-data-nodes: ds${0..1}.customers
          orders:
            actual-data-nodes: ds${0..1}.orders
            database-strategy:
              standard:
                sharding-column: customer_id
                sharding-algorithm-name: database-inline
            key-generate-strategy:
              column: order_id
              key-generator-name: snowflake
        key-generators:
          snowflake:
            type: SNOWFLAKE
    props:
      sql-show: true # Log sharded SQL for debugging

Step 3: Use Standard JPA Repositories
With ShardingSphere, you can use standard Spring Data repositories:

@Entity
@Table(name = "customers")
public class Customer {
    @Id
    private Long id;
    private String name;
    private String email;
    // Getters and setters
}
@Repository
public interface CustomerRepository extends JpaRepository {
    // Regular JPA methods - ShardingSphere handles the routing
    List findByNameContaining(String nameFragment);
}

Step 4: Service Layer with Transaction Management
ShardingSphere supports distributed transactions using XA:

@Service
public class CustomerOrderService {

    @Autowired
    private CustomerRepository customerRepository;

    @Autowired
    private OrderRepository orderRepository;

    // This transaction will automatically be distributed across shards
    @Transactional(rollbackFor = Exception.class)
    public void createCustomerWithOrder(Customer customer, Order order) {
        customerRepository.save(customer);
        order.setCustomerId(customer.getId());
        orderRepository.save(order);
    }
}

Step 5: ShardingSphere’s Built-in Query Pagination
ShardingSphere automatically handles cross-shard pagination for you:

@Service
public class CustomerService {

    @Autowired
    private CustomerRepository customerRepository;

    public Page findAllCustomers(int page, int size) {
        // ShardingSphere automatically merges and orders results from all shards
        return customerRepository.findAll(PageRequest.of(page, size, Sort.by("id")));
    }
}

ShardingSphere Architecture Overview

ShardingSphere provides several key components:

ShardingSphere-JDBC: A client-side solution that intercepts JDBC calls and routes them to the appropriate shards
ShardingSphere-Proxy: A server-side database proxy that you can connect to like a regular database
ShardingSphere-Sidecar: Service mesh integration (for cloud-native deployments)
For most Spring Boot applications, ShardingSphere-JDBC is the preferred approach.

Approach 3: Using a Data Virtualization Layer
For complex environments with multiple data source types beyond just SQL databases, a data virtualization platform can provide a unified view.

Data Virtualization Architecture

A typical data virtualization setup includes:

Data Source Connectors: Connect to various data sources (SQL, NoSQL, API, etc.)
Query Engine: Optimizes and routes queries to appropriate data sources
Caching Layer: Improves performance by caching frequently accessed data
Virtual Database Layer: Presents a unified view through standard interfaces (JDBC, ODBC)

Step 1: Set Up a Data Virtualization Platform
Popular options include:

  • Denodo Platform
  • Dremio
  • IBM Data Virtualization Manager
  • Red Hat JBoss Data Virtualization

Step 2: Define Virtual Views
In your virtualization platform, define virtual views that combine data from multiple shards.

For example, in Denodo:

CREATE VIEW global_customers AS
    SELECT * FROM shard1_customers
    UNION ALL
    SELECT * FROM shard2_customers;

Step 3: Connect Spring Boot to the Virtualization Layer

@Configuration
public class DataVirtualizationConfig {

    @Bean
    @ConfigurationProperties("spring.datasource")
    public DataSource virtualDataSource() {
        return DataSourceBuilder.create().build();
    }

    @Bean
    public JdbcTemplate jdbcTemplate() {
        return new JdbcTemplate(virtualDataSource());
    }
}

Step 4: Use Standard JPA or JDBC to Query Virtual Views

@Repository
public class VirtualizedCustomerRepository {

    @Autowired
    private JdbcTemplate jdbcTemplate;

    public List findAllCustomers(int page, int size) {
        return jdbcTemplate.query(
            "SELECT * FROM global_customers ORDER BY id LIMIT ? OFFSET ?",
            new Object[]{size, page * size},
            customerRowMapper
        );
    }
}

Advantages of Data Virtualization

  • Provides a unified view across disparate data sources
  • Abstracts away the complexity of sharding from application code
  • Enables real-time data integration without ETL processes
  • Supports complex data transformations and advanced query capabilities

Transaction Management in Sharded Environments
Transaction management becomes challenging in sharded environments because traditional database transactions only work within a single database.

Single-Shard Transactions
These work just like regular database transactions:

@Transactional
public void updateCustomerDetails(Long customerId, String newName) {
    // ShardingContextHolder.setCurrentShard is handled by an aspect
    customerRepository.updateName(customerId, newName);
    auditRepository.logNameChange(customerId, newName);
    // Both operations occur in the same shard and form a single transaction
}

Cross-Shard Transaction Patterns

  1. Two-Phase Commit (2PC) The traditional approach for distributed transactions:

Prepare Phase: Each shard is asked if it can commit
Commit Phase: If all shards agree, they all commit; otherwise, all rollback
ShardingSphere supports 2PC using the XA protocol:

@Service
public class CrossShardService {

    // ShardingSphere handles the 2PC under the hood
    @ShardingSphereTransactionType(TransactionType.XA)
    @Transactional
    public void transferBetweenCustomers(Long fromCustomerId, Long toCustomerId, BigDecimal amount) {
        // These might be in different shards, but XA handles it
        accountRepository.deductBalance(fromCustomerId, amount);
        accountRepository.addBalance(toCustomerId, amount);
    }
}
  1. Saga Pattern A sequence of local transactions with compensating actions for failures:
@Service
public class SagaTransactionService {

    @Autowired
    private AccountRepository accountRepository;

    @Autowired
    private TransferRepository transferRepository;

    // Not a @Transactional method - we manage the transaction flow manually
    public boolean transferWithSaga(Long fromId, Long toId, BigDecimal amount) {
        // Step 1: Record the transfer intent
        Long transferId = recordTransferIntent(fromId, toId, amount);

        try {
            // Step 2: Deduct from source account (Shard 1)
            if (!deductFromAccount(fromId, amount, transferId)) {
                return false; // Failed at step 2
            }

            // Step 3: Add to destination account (Shard 2)
            if (!addToAccount(toId, amount, transferId)) {
                // Compensating transaction for step 2
                refundAccount(fromId, amount, transferId);
                return false; // Failed at step 3
            }

            // Step 4: Mark transfer as complete
            markTransferComplete(transferId);
            return true;

        } catch (Exception e) {
            // Handle unexpected errors, apply compensating transactions
            handleSagaFailure(transferId, fromId, toId, amount);
            return false;
        }
    }

    // Methods to implement each step and compensating actions...
}
  1. Eventual Consistency with Event-Based Approach Using events for asynchronous processing across shards:
@Service
public class EventBasedTransactionService {

    @Autowired
    private AccountRepository accountRepository;

    @Autowired
    private ApplicationEventPublisher eventPublisher;

    @Transactional // Local transaction in a single shard
    public void initiateTransfer(Long fromId, BigDecimal amount, Long toId) {
        // Deduct from source account
        accountRepository.deductBalance(fromId, amount);

        // Publish event for the second part of the transaction
        eventPublisher.publishEvent(new FundsTransferredEvent(fromId, toId, amount));
    }

    @EventListener
    @Transactional // Will run in the appropriate shard based on toId
    public void handleFundsTransfer(FundsTransferredEvent event) {
        // Add to destination account
        accountRepository.addBalance(event.getToId(), event.getAmount());
    }
}

Transaction Isolation in Sharded Environments
Be aware of limitations:

  • No global read consistency across shards
  • Potential for partial failures in distributed transactions
  • Transaction isolation only applies within a single shard

Approach Comparison: Pros and Cons

Approach: AbstractRoutingDataSource

Pros:

  1. Complete control over routing
  2. No extra dependencies
  3. Easier to understand and debug
  4. Works with any SQL database

Cons:

  1. Manual implementation of cross-shard operations
  2. No built-in support for distributed transactions
  3. Cross-shard pagination is complex
  4. More boilerplate code

Approach: Apache ShardingSphere

Pros:

  1. Handles cross-shard queries automatically
  2. Built-in distributed transaction support
  3. SQL parser with query optimization
  4. Supports various sharding algorithms
  5. Transparent to application code

Cons:

  1. Additional dependency
  2. Learning curve for configuration
  3. Less flexibility for custom routing logic
  4. Performance overhead for query parsing

Approach: Data Virtualization

Pros:

  1. Unified view across multiple data sources
  2. Can integrate non-SQL data sources
  3. Query optimization across sources
  4. Application completely unaware of sharding

Cons:

  1. Usually requires commercial licenses
  2. Higher infrastructure complexity
  3. Potential bottleneck for performance
  4. Limited transaction capabilities across sources

Best Practices:
Start Simple: Begin with the AbstractRoutingDataSource approach before moving to more complex solutions
Add Logging: Log which shard is being accessed for each query during development
Create Unit Tests: Verify your sharding logic with tests against each shard
Choose Smart Shard Keys: Select keys that distribute data evenly and minimize cross-shard queries
Keep Related Data Together: Try to keep related data on the same shard to avoid complex cross-shard operations
Use Transactions Carefully: Be aware of the limitations of transactions across shards
Plan for Growth: Design your sharding strategy to accommodate adding more shards in the future
Handle Errors Gracefully: Always include robust error handling for shard routing failures
Implement Circuit Breakers: Use patterns like Circuit Breaker for resilience against shard failures
Monitor Shard Distribution: Track the data distribution across shards to detect imbalances

Next Steps for Learning

  • Start with a simple two-shard setup as shown here
  • Add logging to see which shard is being used for each query
  • Create unit tests that verify your sharding logic
  • Practice implementing cross-shard queries
  • Experiment with different sharding strategies

How did I write this post:

Prompt 1: You are an experienced software developer. Can you explain me how would implement fetching of data from sharded databases in SpringBoot for SQL database?

And a few more refinements using other LLM Tools.