Untitled Publication

How We Built a Scalable Log Analytics Platform with OpenSearch

Mohammad Arsalan — Sun, 05 Oct 2025 13:16:09 GMT

Introduction

OpenSearch is a powerful, open-source search and analytics suite derived from Elasticsearch 7.10 and Kibana 7.10, developed and maintained by the OpenSearch Project, originally started by Amazon Web Services (AWS). It enables developers and organizations to ingest, search, analyze, and visualize large volumes of data in near real-time.

Designed to be scalable, extensible, and vendor-neutral, OpenSearch supports a wide range of use cases such as log analytics, application monitoring, and business intelligence. With its hot-warm-cold architecture, it also offers cost-effective storage tiering for managing large data volumes across different performance levels.

Advantages of Using OpenSearch

Multi-AZ Deployment Support. Data is replicated across AZs to ensure resilience during infrastructure failures.
Index Lifecycle Management: Time & Size-Based Splitting. Prevents index bloat and improves search performance.
Automatic Snapshots. Supports scheduled snapshots of indices to remote storage (like Amazon S3).
Rich Visualization with OpenSearch Dashboards
Hot-Warm-Cold Index Tiering with Retention Policies

OpenSearch Storage Tiers

Purpose

Master Node: Manages the cluster metadata and state.
Data Node: Stores and processes frequently accessed and actively written data.
Warm Node: Stores less frequently accessed but still queryable data.
Cold Node: Stores rarely accessed, long-term archival data.

Responsibilities

*Master Node*	*Data Node*	*Warm Node*	*Cold Node*
Creating/deleting indices	Real-time indexing and querying	Older logs (e.g., 1 week–1 month old)	Historical logs (e.g., several months or years old)
Tracking all nodes in the cluster	Recent logs, current metrics, etc.	Read-only Node	Read & Write not possible until & unless moved to Data or Warm Node.
Managing shard allocation	Active writes happen in this Node.	-	-
Electing new master if the current one fails	-	-	-

Characteristics

*Master Node*	*Data Node*	*Warm Node*	*Cold Node*
Does not store data or handle search/index requests directly (though it can if configured)	High-performance storage (e.g., SSDs)	Larger but slower disks (e.g., HDDs or lower-tier SSDs)	Very inexpensive storage (e.g., large HDDs)
Typically kept dedicated and lightweight	More CPU and RAM to handle fast operations	Less CPU-intensive than hot nodes	Minimal compute resources
Usually a small number of them (odd number, like 3, for quorum)	Fast response time for queries and aggregations	Cheaper infrastructure	-
-	Generally expensive but fast hardware	Slower response time is acceptable	-

What Is an Index in OpenSearch

In OpenSearch, an index is a logical namespace where documents are stored. You can think of an index as a database table in a relational system:

Each index contains documents, and each document is a JSON object.
Documents are stored in shards, which are distributed across the cluster’s data nodes.
Indexes support full-text search, filtering, aggregation, and more.

What Are Hot, Warm, and Cold Indices

Hot, warm, and cold indices refer to data lifecycle stages and where data is physically stored based on its age, access pattern, and importance. This approach helps balance cost vs performance.

*Hot Index*	*Warm Index*	*Cold Index*
Actively written to and frequently queried.	No longer written to but still queried occasionally.	Rarely accessed but stored for compliance or archival.
Stored on: Hot data nodes (fast SSDs, high-performance)	Stored on: Warm nodes (cheaper, slower storage)	Stored on: Cold nodes (very cheap, slow storage)

What Is an Alias in OpenSearch

An alias in OpenSearch is like a pointer or virtual name that refers to one or more indices. Instead of interacting with an index directly (e.g., logs-2025.06.16), clients can read from or write to an alias like logs-current.

Key Concepts of Aliases:

Aliases Can point to one or more indices
Aliases Can be used for search (read) and indexing (write) operations
An alias used for writing must point to exactly one index.
1. Common pattern: logs-write → logs-000001
2. When you roll over, the alias is updated to point to a new index, e.g. logs-000002.
Read (search) aliases can point to one or many indices.
1. Example: An alias logs-read could point to logs-000001, logs-000002, and so on.
2. This allows you to search across a series of indices seamlessly.

Managing Index Growth with Rollover in OpenSearch

What Is Rollover in OpenSearch

In OpenSearch, a rollover is the process of automatically creating a new index when the current index reaches a certain threshold, such as:

Maximum size (e.g., 50GB)
Maximum number of documents (e.g., 1 million)
Maximum age (e.g., 1 day old)

Why Use Rollover

Avoid performance degradation in large indices
Split data efficiently across smaller, manageable indices
Work with an alias, so applications always write to the current active index

How It Works

You define a write alias, such as logs-write.
The alias points to an index, e.g., logs-000001.
You set rollover conditions (e.g., max_age: 1d, max_size: 50gb).
When conditions are met, a new index is created: logs-000002.
The alias logs-write is automatically updated to point to the new index.

Understanding Shards in OpenSearch

In OpenSearch, shards are the fundamental units of data storage and processing. Each index is split into shards so that large datasets can be distributed across multiple nodes, enabling parallel operations.

A shard is a low-level, self-contained slice of an index that stores a portion of the data. Shards allow OpenSearch to:

Scale horizontally across multiple servers (nodes)
Handle large volumes of data
Perform parallel queries for faster response times

Primary Shard

Holds the original copy of documents.
Every index operation (create/update/delete) happens on the primary shard first.

Replica Shard

Acts as a copy of a primary shard.
Provides failover protection: if the primary shard or its node fails, a replica can be promoted.

What Does 5:2 Replication Strategy Mean

The 5:2 replication strategy is a common shard configuration in OpenSearch. It refers to:

5 primary shards
2 replica shards per primary
Breakdown:
- Primary Shards: 5
- Replica Shards: 5 × 2 = 10
- Total Shards: 15

Sharding and replication are critical to OpenSearch’s ability to scale and remain highly available. By understanding how primary and replica shards work—and how to use strategies like 5:2 replication—you can design clusters that are both efficient and resilient.

Snapshots in OpenSearch

A snapshot in OpenSearch is a backup of your index or cluster metadata and data. It captures the state of indices at a given point in time and allows you to restore that data if needed. Think of snapshots as point-in-time backups that are incremental (efficient) and stored externally—typically on object storage like Amazon S3.

Disaster Recovery – Restore lost data in case of accidental deletion or system failure.
Cluster Migration – Move data from one cluster to another.
Rolling Back Changes – Revert to a previous state after unwanted modifications.

How Snapshots Work

Snapshot Repository: Before taking snapshots, you need to register a snapshot repository, which is the storage location for backups like s3 bucket.
Snapshot Process: Snapshots are incremental, meaning only new or changed data is backed up after the first snapshot.
Restore: You can restore a snapshot at any time to: a new index, the original index, a new cluster (for migration).

Dissecting Apache Kafka

Mohammad Arsalan — Thu, 24 Apr 2025 10:44:24 GMT

Introduction to Kafka: The Need for a Distributed Messaging System

Kafka is a distributed messaging system that plays a crucial role in modern data pipelines by addressing the need for high-throughput, low-latency communication between different services in a scalable and fault-tolerant manner. Traditional messaging systems, such as RabbitMQ and JMS, while capable, often struggle with scaling to handle large volumes of data or ensuring data consistency across distributed systems.

Kafka overcomes these challenges by enabling real-time streaming and providing a durable, distributed event log that acts as a central data hub for all services. It is particularly well-suited for use cases such as log aggregation, stream processing, and event sourcing, where data must be collected, processed, and consumed in real time. Kafka’s ecosystem is designed to provide a seamless experience for managing large-scale data flows.

This includes Kafka itself, which handles message brokering, ZooKeeper (or KRaft in newer versions) for managing the Kafka cluster, Kafka Connect for integrating external systems, and Kafka Streams for stream processing and real-time analytics. Together, these components create a powerful platform for building scalable, resilient, and fault-tolerant data-driven applications.

Kafka Cluster / Brokers, Topics, and Partitions — The Backbone

Broker / Cluster: A Kafka server. Each broker handles part of the data load.
Topic: Logical channel to which producers send messages and consumers read from.
Partition: Unit of parallelism within a topic. Each message goes to one partition.
Replication: For each partition, Kafka can create multiple replicas (one leader + followers) for high availability.

Kafka Data Flow — From Producer to Consumer

Producer sends a record to a topic.
Kafka determines the partition:
- If a key is provided → Hash(key) % partitions.
- If no key → Round-robin or custom partitioner.
The record is stored sequentially in the target partition's log.
Leader broker of that partition writes the record, then replicates it to follower brokers.
Consumers in a consumer group fetch records from their assigned partitions.
Kafka tracks the offset each consumer has read. (Like a bookmark in a book.)

Kafka Partitions — Scaling, Increasing, and Decreasing

A partition in Kafka is a fundamental unit of parallelism and storage. Each topic can be divided into multiple partitions, which helps distribute records across them. Partitions allow Kafka to scale and handle large volumes of data efficiently, enabling high throughput and parallel processing.

When it comes to increasing the number of partitions for an existing topic, Kafka supports this operation after topic creation. This can offer several benefits:

Improved parallelism: More partitions mean more consumers can read from the topic concurrently.
Higher throughput: Producers and consumers can scale independently of each other.

The partition count directly impacts consumer behavior within a consumer group. Here’s how different scenarios play out:

With 3 partitions and 2 consumers, one consumer will handle 2 partitions, and the other will handle 1.
With 3 partitions and 3 consumers, the partitions will be evenly distributed (one-to-one mapping).
With 3 partitions and 4 consumers, one consumer will remain idle since there are more consumers than partitions.
With 3 partitions and 1 consumer, that single consumer will handle all 3 partitions.

Kafka Consumer Group, Offset, Polling, and Auto-Commit Explained

Consumer Group Concept

A consumer group in Kafka is a logical grouping of consumers that work together to consume messages from a topic. Each consumer in a group is assigned a subset of partitions, ensuring that each message is processed only once per group.

Each partition is consumed by exactly one consumer in a group.
Multiple groups can independently consume the same topic without interfering with each other.

Offset in Kafka

Kafka tracks the offset, which is the position of a consumer in a partition — essentially a pointer to the message being read.

Offsets are maintained per partition per consumer group.
By default, offsets are stored in an internal Kafka topic: __consumer_offsets.

You can configure how offsets are committed using auto-commit or manual commit modes.

Auto-Commit

Kafka consumers can be configured to automatically commit offsets at regular intervals using:

This tells the consumer to commit the latest offset after every 5 seconds (by default). While convenient, it may lead to message loss if the consumer fails after receiving a message but before processing it.

Manual Offset Commit

More robust approach: manually committing the offset after processing a message. This ensures at-least-once delivery.

Poll Interval

Kafka consumers use the poll() method to request messages from the broker. You must poll regularly — otherwise, Kafka considers the consumer as dead and triggers a rebalance.

If the consumer doesn't poll within this time, it's removed from the group, and partitions are reassigned.

Rebalancing in Kafka: Why It Happens and How It Affects Consumers

Rebalancing in Kafka is the process of redistributing partitions among consumers within a consumer group. It is triggered when there are changes in the group, such as a consumer joining or leaving, a topic being added or modified, or consumers failing to poll within a set interval.

This process temporarily pauses consumption as Kafka stops message delivery, reassigns partition ownership, and resumes once the new assignments are in place. While necessary for load balancing, rebalancing can cause latency or downtime, especially with stateful or slow-to-rejoin consumers.

Internally, Kafka handles rebalancing through:

Coordinator election to manage the group,
Partition assignment using strategies like range, round-robin, or sticky,
Offset fetching to resume processing,
Consumer resumption from newly assigned partitions.

Example: In a group with 4 partitions and 2 consumers, each consumer handles 2 partitions. If a third joins, Kafka rebalances to spread partitions across all three.

To minimize disruption from rebalancing:

Use sticky assignment to reduce reshuffling.
Adjust session.timeout.ms and heartbeat.interval.ms for better tolerance.
Avoid frequent consumer churn.
Use cooperative rebalancing for smoother transitions (available in newer Kafka versions).

Leader and Replica in Kafka: High Availability Through Replication

In Kafka, every partition of a topic is replicated across multiple brokers to ensure fault tolerance and high availability.

Leader Replica: Handles all read and write requests for the partition.
Follower Replicas: Passive replicas that copy data from the leader.

Only one broker at a time is the leader for a given partition. The remaining replicas are known as followers.

Let’s say you have a topic SendEmailQueue with 3 partitions and a replication factor of 3:

Partition	Leader Broker	Follower Brokers
P0	Broker 1	Broker 2, 3
P1	Broker 2	Broker 1, 3
P2	Broker 3	Broker 1, 2

Each broker is leading one partition and following two others.

What Happens If Leader Fails?

If a leader replica fails, Kafka elects a new leader from the ISR. If no replica is in sync, Kafka will wait until at least one follower catches up — unless unclean.leader.election is enabled (not recommended in production).

Frequently Asked Questions (FAQs) About Kafka

Kafka can be a complex system to understand, especially when you are first diving into its various components and concepts. Here are some of the most frequently asked questions that can help clarify common doubts about Kafka.

What is the difference between Kafka and a traditional messaging queue like RabbitMQ?

Kafka and RabbitMQ are both message brokers, but they have different use cases and design principles:

Kafka is designed for high throughput and distributed data streaming. It stores messages in topics and partitions, and consumers can read messages at their own pace, replaying them if needed.
RabbitMQ is more focused on messaging between services with high reliability and flexible routing patterns. It uses queues for message delivery and is designed for scenarios requiring complex routing and guarantees like exactly-once or at-least-once delivery.

Kafka is generally more suited for log aggregation, stream processing, and big data use cases, while RabbitMQ is preferred for traditional messaging with complex patterns like RPC or pub/sub.

What happens if a Kafka broker goes down?

Kafka has built-in fault tolerance. When a broker goes down, the replicas of the partitions stored on that broker become available through other brokers. Kafka uses the concept of replicas and leader-follower architecture to ensure no data is lost:

The leader replica for each partition will handle read and write operations.
The follower replicas replicate the leader’s data.

If a leader replica is lost due to a broker failure, Kafka will automatically elect a new leader from the available followers. However, if there are no available replicas, the partition may become unavailable until the broker recovers.

What is a Kafka Consumer Group?

A Consumer Group is a group of consumers that work together to consume messages from one or more topics. Kafka ensures that each partition in a topic is consumed by only one consumer within a group. Consumer groups provide scalability and fault tolerance by distributing partition consumption across multiple consumers.

If a consumer fails, other consumers in the group can pick up the partitions the failed consumer was consuming.
Consumer groups allow parallel processing of messages, and each message will only be processed once by a single consumer within the group.

What are Kafka Topics and Partitions?

Topics are logical channels to which producers publish messages and from which consumers consume messages. Topics can be thought of as message categories.
Partitions are the physical storage units within a topic. A topic can have multiple partitions, and messages within a partition are ordered. Partitions enable Kafka to scale horizontally by allowing parallel reads and writes.

Each partition can only be consumed by one consumer at a time in a consumer group, and messages in partitions are stored in offsets that consumers can track.

How does Kafka guarantee message order?

Kafka guarantees message order at the partition level, not across the entire topic. Within a single partition, messages are ordered based on the order in which they were produced. The partition key determines how messages are distributed across partitions:

If you want to preserve message order for a specific key, ensure that all messages with the same key are sent to the same partition.

However, Kafka does not guarantee order across different partitions within a topic.

How does Kafka handle message retention?

Kafka has a retention policy that controls how long messages are stored in a topic. There are two main retention mechanisms:

Time-based retention: Messages are retained for a specified period, after which they are deleted.
Size-based retention: Kafka deletes messages when a topic reaches a specified size limit.

Once messages are deleted, they are no longer available for consumption, but they can be replayed as long as they are within the retention window.

What is Kafka Consumer Lag?

Consumer lag refers to the difference between the latest offset (the last message produced) and the current offset (the last message consumed) for a consumer group in a partition. Lag occurs when consumers are behind in processing messages.

High lag indicates that consumers are not keeping up with the rate of incoming messages.
Kafka provides monitoring tools to track lag, and it’s important to ensure that lag remains low for timely processing.

How do Kafka Producers ensure data durability?

Kafka producers ensure durability through the acknowledgment mechanism:

acks=0: The producer does not wait for any acknowledgment from the broker. This is faster but less reliable.
acks=1: The producer waits for acknowledgment from the leader broker. This ensures that the message is written to at least one broker.
acks=all: The producer waits for acknowledgment from all in-sync replicas. This provides the highest durability but may impact performance.

What is Kafka's Exactly-Once Semantics (EOS)?

Kafka provides exactly-once semantics (EOS) to ensure that a message is neither lost nor duplicated during processing. EOS is achieved by:

Idempotent Producers: Producers are idempotent, meaning that even if they send the same message multiple times, it will only be written once to the topic.
Transactional Producers and Consumers: Kafka supports transactions that allow producers to send messages as part of a single atomic operation. Consumers that process messages in a transaction can ensure that only one message is consumed, even in the case of retries.

Can I change the number of partitions in Kafka?

Yes, you can increase the number of partitions in Kafka, but it is not possible to decrease the number of partitions. Increasing partitions allows Kafka to scale horizontally, distributing the load across more consumers.

However, adding partitions can disrupt consumer offset tracking because Kafka reassigns partitions to consumers. It’s important to handle rebalancing and consumer offsets carefully.

What is the difference between `kafka-console-consumer` and `kafka-console-producer`?

kafka-console-consumer is a command-line tool that allows you to consume messages from a Kafka topic.
kafka-console-producer is a command-line tool that allows you to produce messages to a Kafka topic.

Spring Security in Spring Boot — A Complete Beginner's Guide

Mohammad Arsalan — Sun, 13 Apr 2025 16:20:13 GMT

Introduction

In this guide, we’ll walk through how to set up Spring Security in a Spring Boot application using a custom user model, connect it to a PostgreSQL database, create a signup endpoint, encrypt passwords with BCrypt, and authenticate users securely.

We’ll cover:

Required dependencies
PostgreSQL configuration in application.properties
Creating user model and repository
Writing controller for signup
Creating a custom user detail service
Configuring Spring Security
Lifecycle of authentication in Spring Security

Required Dependencies in `pom.xml`

Before we start writing code, let's first discuss the dependencies that need to be included in the project. These dependencies will provide us with essential tools to handle database interactions, web services, and Spring Security.

Spring Boot Starter Data JPA


    org.springframework.boot
    spring-boot-starter-data-jpa

Purpose: This dependency is required to work with databases using JPA (Java Persistence API). It enables Spring Data JPA support and helps us interact with the PostgreSQL database seamlessly.
Benefit: With Spring Data JPA, we don’t have to write SQL queries manually. It provides powerful query capabilities and automatic entity mapping.

Spring Boot Starter Web


    org.springframework.boot
    spring-boot-starter-web

Purpose: This dependency adds everything needed to build web applications, including RESTful web services. It integrates Spring MVC, Tomcat as the default container, and Jackson for JSON binding.
Benefit: It allows us to create REST APIs and interact with users using HTTP requests (like GET, POST).

Spring Boot DevTools


    org.springframework.boot
    spring-boot-devtools
    runtime

Purpose: DevTools enhances the development experience by providing features like auto-restart and live reload. It automatically restarts the application when you make changes to the code, which saves time.
Benefit: It allows faster development cycles by reloading your application when you modify code, making it easier to see changes immediately.

Spring Boot Starter Security


    org.springframework.boot
    spring-boot-starter-security

Purpose: This dependency brings Spring Security into the project. Spring Security is a powerful and customizable authentication and access-control framework.
Benefit: With Spring Security, we can easily configure authentication, authorization, and protect APIs from unauthorized access.

PostgreSQL Driver

xmlCopyEdit
    org.postgresql
    postgresql

Purpose: This dependency is required to connect Spring Boot to PostgreSQL. It provides the necessary JDBC driver to interact with PostgreSQL databases.
Benefit: It allows the Spring Boot application to connect to a PostgreSQL database to store and retrieve data.

Lombok


    org.projectlombok
    lombok
    provided

Purpose: Lombok helps reduce boilerplate code in Java classes. It generates common methods like getters, setters, constructors, toString(), etc., at compile-time.
Benefit: It makes your code cleaner, with less code to write and maintain.

Configuring `application.properties`

In Spring Boot applications, the application.properties (or application.yml) file is where you configure various properties that control the behavior of the application, such as database connections, server port, logging, and more.

Spring Application Name & Server Port

spring.application.name=demo
server.port=8080

spring.application.name=demo: This sets the name of your Spring Boot application to "demo". It's helpful for logging and monitoring purposes.
server.port=8080: This specifies the port on which your application will run. The default is 8080, but you can change it to any port number you prefer.

Database Configuration (PostgreSQL)

spring.datasource.url=jdbc:postgresql://localhost:5432/customer
spring.datasource.username=postgres
spring.datasource.password=12345

spring.datasource.url: This is the JDBC URL for connecting to the PostgreSQL database. Here, localhost represents the database host, 5432 is the default PostgreSQL port, and customer is the database name.
spring.datasource.username: The username to connect to the PostgreSQL database. In this case, it’s postgres.
spring.datasource.password: The password for the postgres user in the database. Change this to match your database credentials.

Hibernate JPA Configuration

spring.jpa.hibernate.ddl-auto=update
spring.datasource.driver-class-name=org.postgresql.Driver
spring.jpa.database-platform=org.hibernate.dialect.PostgreSQLDialect

spring.jpa.hibernate.ddl-auto=update: This setting automatically updates the database schema based on the entities in your application. The options are:
- none: No schema management.
- update: Updates the schema to match the entities.
- create: Drops and creates the schema every time the application starts.
- validate: Validates the schema but does not modify it.
- In production, avoid using create or update, as it may cause data loss.
spring.datasource.driver-class-name: Specifies the JDBC driver for PostgreSQL.
spring.jpa.database-platform: This defines the Hibernate dialect for PostgreSQL. It helps Hibernate generate correct SQL queries for PostgreSQL.

Logging Configuration

propertiesCopyEditlogging.file.name=application.log
logging.level.org.springframework.security=DEBUG

logging.file.name=application.log: This sets the file name for the log output. In this case, the logs will be saved to application.log.
logging.level.org.springframework.security=DEBUG: This sets the logging level for Spring Security to DEBUG. It helps in logging detailed information about security-related actions, useful for debugging.

Controller Setup

In Spring Boot, controllers are responsible for handling HTTP requests, processing them, and returning responses. In this section, we’ll walk through the code for a simple UserController class that handles user registration and displays the user list.

UserController Code

Here’s the UserController class:

package com.example.demo.controllers;

import com.example.demo.models.UserModel;
import com.example.demo.repositories.UserRepository;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.security.crypto.password.PasswordEncoder;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RequestParam;
import org.springframework.web.bind.annotation.RestController;

import java.util.UUID;

@RestController
public class UserController {

    @Autowired
    private PasswordEncoder passwordEncoder;

    @Autowired
    private UserRepository userRepository;

    // Endpoint to fetch all users
    @GetMapping("/users")
    public String GetAllUsers() {
        return "Hello World";
    }

    // Endpoint for user registration
    @GetMapping("/signup")
    public String Signup(@RequestParam String UserName, @RequestParam String Password)
    {
        var user = UserModel.builder()
                .UserID(UUID.randomUUID().toString()) // Generate a unique ID
                .userName(UserName)
                .Password(passwordEncoder.encode(Password)) // Encode the password using BCrypt
                .build();
        userRepository.save(user); // Save the user to the database
        return "Registered Successfully"; // Return a success message
    }
}

Explanation of Code

@RestController: This annotation marks the class as a REST controller. It means that each method in this class will return data directly to the HTTP response (i.e., no need to render views like with traditional controllers).
@Autowired: This annotation is used to automatically inject the dependencies into the class. In this case, we inject:
- PasswordEncoder: To encrypt the user's password before saving it to the database.
- UserRepository: This handles interactions with the database to save and retrieve user data.

Repository and Model Setup

In this section, we will set up two crucial components in our Spring Boot application: UserModel and UserRepository.

UserModel Code

The UserModel class represents the structure of the User entity, which is mapped to a table in the database. Here's the code:

package com.example.demo.models;

import jakarta.persistence.Column;
import jakarta.persistence.Entity;
import jakarta.persistence.Id;
import jakarta.persistence.Table;
import lombok.*;

@Entity
@Table(name = "Users")
@AllArgsConstructor
@NoArgsConstructor
@Getter
@Setter
@Builder
public class UserModel {
    @Id
    private String UserID; // Primary Key for the User

    private String userName; // Username for the user

    private String Password; // Password for the user (encoded)
}

Explanation of `UserModel` Class

@Entity: This annotation marks the class as a JPA entity. JPA (Java Persistence API) is used to interact with the database. The UserModel class will be mapped to a table in the PostgreSQL database.
@Table(name = "Users"): This annotation specifies the name of the table in the database to which this entity will be mapped. In this case, the table is named Users.
@Id: This annotation marks the UserID field as the primary key for the entity. Each user in the Users table will have a unique UserID.
Lombok Annotations:
- @AllArgsConstructor: Generates a constructor that accepts all fields as arguments.
- @NoArgsConstructor: Generates a no-argument constructor.
- @Getter & @Setter: Automatically generates getter and setter methods for each field.
- @Builder: This generates a builder pattern for the UserModel class, which allows us to easily create instances of UserModel with method chaining.

UserRepository Code

The UserRepository interface is responsible for interacting with the database. It provides CRUD operations for UserModel objects. Here’s the code:

package com.example.demo.repositories;

import com.example.demo.models.UserModel;
import org.springframework.data.jpa.repository.JpaRepository;
import org.springframework.stereotype.Repository;

@Repository
public interface UserRepository extends JpaRepository<UserModel, String> {
    public UserModel findByUserName(String UserName);
}

Explanation of `UserRepository` Interface

@Repository: This annotation marks the interface as a Spring Data repository. It is used to perform CRUD operations on the UserModel entities in the database.
JpaRepository: The JpaRepository interface provides several methods for working with the UserModel entity, such as saving, finding, deleting, and updating records in the database. The generic parameters specify:
- UserModel: The type of the entity.
- String: The type of the primary key for the UserModel entity (UserID).
findByUserName(String UserName): This is a custom query method. Spring Data JPA will automatically generate the query to find a UserModel by the UserName field. It returns the user associated with the provided username.

Database Mapping in PostgreSQL

When Spring Boot runs with this configuration, it will map the UserModel class to a Users table in the PostgreSQL database. This table will have columns for UserID, userName, and Password. The database will store each user's information, including their encrypted password.

Custom User Details Service

In this section, we will implement a CustomUserDetailService that integrates Spring Security with our UserModel and ensures proper authentication and authorization.

CustomUserDetailService Code

The CustomUserDetailService class implements UserDetailsService and provides a custom way to load a user's details for Spring Security authentication. Here’s the code:

package com.example.demo.services;

import com.example.demo.repositories.UserRepository;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.security.core.userdetails.User;
import org.springframework.security.core.userdetails.UserDetails;
import org.springframework.security.core.userdetails.UsernameNotFoundException;
import org.springframework.stereotype.Service;

@Service
public class CustomUserDetailService implements UserDetailsService {
    @Autowired
    private UserRepository userRepository;

    @Override
    public UserDetails loadUserByUsername(String username) throws UsernameNotFoundException {
        var user = userRepository.findByUserName(username);  // Fetch user from the database

        if (user == null) {  // If no user is found, throw UsernameNotFoundException
            throw new UsernameNotFoundException("User not found: " + username);
        }

        // Returning a UserDetails object with the user info
        return User.builder()
                .username(user.getUserName())  // Set username
                .password(user.getPassword())  // Set password (already encoded)
                .roles("USER")  // Set the role for this user
                .build();
    }
}

Explanation of `CustomUserDetailService` Class

@Service: This annotation marks the class as a service that will be managed by Spring’s dependency injection container. It allows Spring to inject this service into other components like controllers and security configurations.
UserDetailsService Interface: This is a Spring Security interface that contains a method loadUserByUsername which is used to fetch user details from a database based on the username provided. It is a core interface used by Spring Security for authentication.
loadUserByUsername Method:
- This method takes a username as input and returns a UserDetails object. It is responsible for fetching the user information from the database.
- The userRepository.findByUserName(username) fetches the UserModel from the database using the UserName field.
- If no user is found, a UsernameNotFoundException is thrown.
- The User.builder() creates a User object, which is a Spring Security class that implements UserDetails. This object contains the username, password (which is already encoded), and roles for the user. In this case, the user has a role of "USER".
UserDetails: This is an interface in Spring Security that represents the user's information (like username, password, authorities, etc.) for authentication and authorization purposes.

Role of `CustomUserDetailService` in Spring Security

Authentication: When a user tries to log in, Spring Security will call the loadUserByUsername method to fetch the user's details from the database. The returned UserDetails object is then used to authenticate the user.
Authorization: Based on the roles assigned to the user (in this case, "USER"), Spring Security can authorize or deny access to specific resources within the application.

Configuring Spring Security

In this section, we will configure Spring Security to secure the application and control user access. We will do this by customizing the security filter chain and using Spring's authentication manager.

SecurityConfig Code

Here’s the SecurityConfig class where we configure Spring Security for the application:

javaCopyEditpackage com.example.demo.configs;

import com.example.demo.services.CustomUserDetailService;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.security.authentication.AuthenticationManager;
import org.springframework.security.config.Customizer;
import org.springframework.security.config.annotation.authentication.configuration.AuthenticationConfiguration;
import org.springframework.security.config.annotation.web.builders.HttpSecurity;
import org.springframework.security.config.annotation.web.configuration.EnableWebSecurity;
import org.springframework.security.core.userdetails.UserDetailsService;
import org.springframework.security.crypto.bcrypt.BCryptPasswordEncoder;
import org.springframework.security.crypto.password.PasswordEncoder;
import org.springframework.security.web.SecurityFilterChain;

@Configuration
@EnableWebSecurity
public class SecurityConfig {

    @Bean
    public SecurityFilterChain securityFilterChain(HttpSecurity http) throws Exception {
        http
                .csrf(Customizer.withDefaults())  // CSRF protection
                .formLogin(Customizer.withDefaults())  // Enable Form-based Authentication
                .authorizeHttpRequests(authorize -> authorize
                        .requestMatchers("/signup").permitAll()  // Allow access to /signup without authentication
                        .anyRequest().authenticated()  // Secure all other endpoints
                );

        return http.build();
    }

    @Bean
    public AuthenticationManager authenticationManager(AuthenticationConfiguration configuration) throws Exception {
        return configuration.getAuthenticationManager();
    }

    @Bean
    public static PasswordEncoder passwordEncoder() {
        // Use BCryptPasswordEncoder for password encryption
        return new BCryptPasswordEncoder();
    }
}

7.2 Explanation of `SecurityConfig` Class

@Configuration: This annotation marks the class as a source of bean definitions. It is used to configure Spring Beans in the context.
@EnableWebSecurity: This annotation enables Spring Security in the application. It tells Spring to look for a SecurityConfig class for security-related configurations.
SecurityFilterChain Bean:
- This bean is responsible for configuring the HTTP security of the application.
- CSRF Protection: http.csrf(Customizer.withDefaults()) enables Cross-Site Request Forgery (CSRF) protection by default. This is a critical security feature for web applications.
- Form-Based Authentication: http.formLogin(Customizer.withDefaults()) enables form-based authentication, where users will be required to provide their username and password to access secured endpoints.
- Authorization Rules: authorizeHttpRequests(authorize -> authorize...) defines access control rules for HTTP requests:
  - .requestMatchers("/signup").permitAll(): This allows unrestricted access to the /signup endpoint.
  - .anyRequest().authenticated(): This restricts access to all other endpoints and requires authentication.
AuthenticationManager Bean:
- This bean is responsible for authenticating the user. Spring Security uses this to authenticate the user when they log in.
PasswordEncoder Bean:
- BCryptPasswordEncoder: This bean is used to encrypt the password before storing it in the database and while comparing it during authentication.
- We have used BCryptPasswordEncoder here because it is one of the most secure password encoding algorithms available in Spring Security.

How Spring Security Filters Requests

Request Flow:
- When a request is made, Spring Security intercepts the request and checks the security configurations (e.g., authentication and authorization).
- If the request is for a public endpoint like /signup, it is permitted without authentication.
- If the request is for any other endpoint, it is secured and requires authentication. Spring Security checks if the user is logged in. If not, it will redirect the user to the login page.
Authentication:
- When a user submits their login credentials (username and password), Spring Security uses the CustomUserDetailService to fetch the user’s details from the database.
- The password is compared using the PasswordEncoder (in our case, BCryptPasswordEncoder) to ensure the credentials are valid.
- If the credentials are valid, the user is authenticated and allowed to access the requested resource.
Authorization:
- Once authenticated, Spring Security assigns the user roles and checks if they have the necessary permissions to access the requested resource.
- In our case, we have a basic role "USER". You can add more roles if needed, and Spring Security can be configured to allow or deny access to different parts of the application based on roles.

Next Steps

Now that we have configured basic security for your Spring Boot application, here are some potential next steps you can take to enhance the security:

Add JWT Authentication: Implement JSON Web Tokens (JWT) for stateless authentication instead of relying on session-based authentication.
Role-Based Access Control (RBAC): Extend the roles and permissions structure, allowing fine-grained access control for various parts of your application (e.g., allowing only users with the ADMIN role to access certain pages).
Two-Factor Authentication (2FA): Implement an additional layer of security by requiring users to verify their identity via a second factor (such as an OTP).
Rate Limiting: Protect your application against brute-force attacks by adding rate-limiting on endpoints such as login.
Logging and Monitoring: Use Spring Security's logging capabilities to monitor login attempts and failed authentication events. You can also set up alerts to notify you of suspicious activities.

Understanding EMR Architecture: Key Components, Configuration Options, and Scaling Strategies

Mohammad Arsalan — Thu, 20 Mar 2025 08:26:21 GMT

Introduction to EMR Architecture

Amazon EMR (Elastic MapReduce) is a cloud-native, fully managed service provided by AWS (Amazon Web Services) for processing large volumes of data quickly and cost-effectively. It enables the distributed processing of vast amounts of data across a scalable cluster of virtual machines (EC2 instances), making it an essential tool for big data processing, data analysis, and machine learning tasks.

Key Purposes and Use Cases of Amazon EMR:

Big Data Processing:
- EMR is designed to run distributed data processing frameworks such as Hadoop and Spark. These frameworks can process petabytes of data in parallel across many EC2 instances, ensuring fast and efficient computations.
- Common use cases include batch processing, data transformation, and machine learning tasks at scale.
Data Storage and Analysis with S3 and HDFS:
- EMR can leverage Amazon S3 as a storage system for input and output data. This integration makes it easy to manage large datasets stored in S3 while processing them using distributed computing on EMR.
- It also supports HDFS (Hadoop Distributed File System) if you want to store data locally within the EMR cluster.
Cost Optimization with On-Demand and Spot Instances:
- EMR allows you to scale your clusters up or down based on demand. You can provision clusters with a mix of On-Demand and Spot Instances, ensuring you get the most cost-effective performance for your processing needs.
- Spot Instances enable you to take advantage of unused EC2 capacity at a lower price, significantly reducing costs for data processing jobs.
Real-time Stream Processing:
- With Apache Kafka and Spark Streaming, EMR can process data in real-time, making it ideal for use cases like log analysis, clickstream analysis, and IoT data processing, where timely insights are critical.
Machine Learning at Scale:
- EMR supports frameworks like Apache Spark MLlib, TensorFlow, and other machine learning libraries to process large datasets and build machine learning models in a distributed environment.
- Using EMR for machine learning allows businesses to handle large volumes of data and perform model training and inference at scale.
Easy Integration with AWS Services:
- EMR is fully integrated with other AWS services, such as AWS Lambda, AWS Glue, Amazon RDS, Amazon Redshift, and more, making it easy to orchestrate end-to-end data processing pipelines.
- It also integrates with AWS CloudWatch for monitoring and AWS IAM for access control and security.
Fault Tolerance and Scalability:
- EMR provides high availability and fault tolerance. If a task or instance fails, EMR can automatically recover from failures by re-running tasks on other instances.
- The service is also scalable, allowing users to increase or decrease the number of instances in the cluster as per their workload requirements.
Simplified Cluster Management:
- EMR eliminates the need for managing infrastructure manually. AWS automatically takes care of cluster provisioning, configuration, and tuning, letting users focus on their data processing and analysis tasks.
- It supports auto-scaling, so clusters can automatically expand or shrink based on workload demands.

Key integration options that EMR offers:

Amazon S3 (Simple Storage Service)
Amazon RDS (Relational Database Service)
Amazon Redshift
Amazon DynamoDB
AWS Lambda
Amazon CloudWatch
AWS Glue Amazon
Kinesis
Amazon ElasticSearch (Amazon OpenSearch Service)
Apache HBase (via Amazon EMR)
Amazon SageMaker
Apache Kafka
Apache Hive and Apache HCatalog
AWS IAM (Identity and Access Management)
AWS Data Pipeline
AWS Step Functions
Third-Party Tools (e.g., Jupyter Notebooks, Apache Zeppelin)

The Building Blocks: Primary, Core, and Task Nodes

The building blocks of Amazon EMR (Elastic MapReduce) represent the key components that make up an EMR cluster. These components work together to process large-scale data efficiently and cost-effectively.

Let's dive deeper into each of these building blocks:

1. Cluster

Definition: An EMR cluster is a collection of Amazon EC2 instances that work together to process and analyze large datasets. A cluster can consist of different types of nodes (primary, core, task nodes) that serve various purposes.
Cluster Setup: When creating an EMR cluster, you can choose the number and type of EC2 instances for each node, configure software (like Hadoop or Spark), and select additional options like storage and scaling methods.
Cluster Lifecycle: EMR clusters can be provisioned on-demand, and you have full control over the cluster lifecycle, including scaling, termination, and instance configurations.

Nodes

The fundamental unit of an EMR cluster is the node. Each node in an EMR cluster runs a portion of the distributed data processing. There are three types of nodes in an EMR cluster:
- Primary Node (Master Node):
  - Role: The master node is responsible for managing the cluster's overall operation. It runs the resource manager (e.g., YARN for Hadoop or the Spark driver), which coordinates the distribution of tasks across the cluster.
  - Responsibilities: It tracks the health of other nodes, schedules jobs, and manages job execution across the cluster. The master node also manages the cluster configuration and keeps track of the logs and results.
- Core Nodes:
  - Role: Core nodes perform the actual data processing. They run the worker tasks (such as mappers and reducers in Hadoop) and store the data in the cluster's HDFS (Hadoop Distributed File System).
  - Responsibilities: These nodes handle the core tasks of computation and data storage, and they are typically required for the cluster to function. The loss of core nodes may impact the cluster's performance.
- Task Nodes (Optional):
  - Role: Task nodes are optional nodes that provide additional computational resources for performing tasks such as running map-reduce jobs. They don't store data but act as extra compute capacity for specific tasks.
  - Responsibilities: Task nodes only run computations and are transient, meaning they can be added or removed dynamically from the cluster to handle fluctuations in workload or processing capacity.
Executors

When you run a Spark job on Amazon EMR, the Executor runs on the core nodes (or task nodes, if used). Here's how it fits into the EMR ecosystem:
- The Primary node (Master node) coordinates the cluster and assigns tasks to executors.
- The Core nodes run executors, executing the actual Spark jobs.
- If Task nodes are used, they also run executors to provide additional computational capacity when needed.
HDFS (Hadoop Distributed File System)
- Definition: HDFS is the distributed file system that is used by Hadoop (and Spark when running in a Hadoop-compatible mode) for storing large datasets across multiple nodes.
- Role in EMR: EMR leverages HDFS (or optionally Amazon S3 as storage) to distribute data across the cluster so that it can be processed in parallel by the nodes. HDFS is designed for high throughput and fault tolerance, enabling data to be replicated across nodes in the cluster.
Amazon S3 (Simple Storage Service)
- Definition: While HDFS is typically used for data storage within an EMR cluster, Amazon S3 is commonly used for long-term storage of input/output data.
- Integration with EMR: You can use Amazon S3 to store data that is used for batch processing or streaming, as well as for storing the results of data processing jobs. It's a highly scalable and durable storage system, and EMR clusters can be configured to read and write data directly to/from S3.
- Storage Flexibility: Unlike HDFS, S3 is more flexible and cost-efficient for storing large volumes of data without the overhead of managing local storage.
Resource Manager
- Definition: The Resource Manager is the component of the cluster responsible for managing resources across the nodes. It allocates resources to the various tasks that need to run in the cluster.
- Examples:
  - YARN (Yet Another Resource Negotiator): For managing resources in a Hadoop ecosystem, YARN is responsible for resource management and job scheduling.
  - Spark’s Driver Program: For Spark applications, the Spark driver manages the resources and tasks.

Use cases for each node type

Primary Node (Master Node): The Primary Node is the central coordination unit of the EMR cluster. It doesn't handle data storage but is responsible for managing the cluster's lifecycle, running resource management services, and coordinating the execution of tasks. The Primary Node manages job scheduling and distributes tasks across the other nodes.
Core Node: The Core Nodes are responsible for data storage and data processing in the cluster. These nodes are the heart of the EMR cluster because they handle both the computation and store the data in HDFS (Hadoop Distributed File System). Core nodes store actual data that is being processed, utilizing HDFS for distributed storage. Each core node keeps part of the data and ensures redundancy and fault tolerance by replicating data blocks across other core nodes in the cluster.
Task Node: Task Nodes are optional nodes that can be added to an EMR cluster to provide additional computational resources for running tasks but without storing data. These nodes are typically used for scaling the cluster based on computational needs. Task nodes provide additional compute capacity when there is a need to process large datasets, and they are usually added when the workload increases beyond the capability of the core nodes.

Node Type	Role	Use Case	Storage	Compute
Primary Node	Master node responsible for coordination, management, and resource allocation	- Manages job scheduling and resource allocation

Configuring AWS EventBridge with Lambda: A Step-by-Step Guide

Mohammad Arsalan — Wed, 09 Oct 2024 15:18:01 GMT

Introduction

AWS EventBridge is a powerful event bus service that allows you to connect different AWS services and applications using events. By integrating EventBridge with AWS Lambda, you can create automated workflows that respond to changes in your cloud environment. This article will guide you through the steps to configure EventBridge and a Lambda function to respond to a recurring schedule.

Steps to Configure EventBridge and Lambda Function

Step 1: Create a Rule in EventBridge

Log in to the AWS Management Console: Navigate to the EventBridge service.
Go to Rules: In the left-hand menu, click on "Rules."
Create a New Rule:
- Click on the "Create rule" button.
- Enter a name and description for your rule.

Step 2: Set Up a Recurring Schedule

Select Schedule: Choose the option for "Schedule" under the rule type.
Choose Cron-based Schedule:
- Select "Cron expression" for a more flexible scheduling option.
- Enter your desired cron expression (e.g., cron(0 12 * * ? *) for every day at noon).
Flexible Time Window: If applicable, configure the flexible time window settings to specify when the rule should run.

Step 3: Create the Lambda Function

Navigate to the Lambda Service: In the AWS Management Console, go to the Lambda service.
Create a New Function:
- Click on the "Create function" button.
- Choose "Author from scratch."
- Enter a name for your Lambda function and select the appropriate runtime (e.g., Python, Node.js).
- Set the necessary execution role for the function.
Write Your Lambda Code: In the function editor, write the code that you want to execute when the event is triggered.
Deploy the Lambda Function: Click on "Deploy" to save your changes.

Step 4: Attach Lambda Function to EventBridge Rule

Return to EventBridge: Go back to the EventBridge rule you created earlier.
Configure Targets:
- In the "Targets" section, click on "Add target."
- From the "Target type" dropdown, select "Lambda function."
- Choose the Lambda function you created earlier from the dropdown menu.
Configure Permissions: Ensure that EventBridge has the necessary permissions to invoke your Lambda function. This is typically handled automatically when you attach the function, but you can check the IAM role permissions if needed.
Create the Rule: Finally, click on "Create rule" to activate the event.

Conclusion

By following these steps, you have successfully configured AWS EventBridge to trigger a Lambda function on a recurring schedule. This integration allows you to automate tasks and respond to events efficiently, making your cloud applications more dynamic and responsive. Experiment with different cron expressions and Lambda functionalities to further enhance your workflows!

AWS Transfer Family 101: Establishing an SFTP Server in the Cloud

Mohammad Arsalan — Sun, 22 Sep 2024 09:43:30 GMT

Overview

Comparing JSCAPE Server and AWS Transfer Family: Pros and Cons

Feature	JSCAPE Server	AWS Transfer Family
Setup Complexity	Requires manual installation and configuration.	Fully managed service, easy to set up via AWS Console.
Cost	Licensing costs can be high; ongoing maintenance.	Pay-as-you-go pricing; no upfront costs.
Scalability	Limited by server resources; requires manual scaling.	Automatically scales with demand, no server management needed.
Security	Offers various security features but requires configuration.	Integrated with AWS security services (IAM, KMS, etc.) for robust security.
Protocol Support	Supports multiple protocols (SFTP, FTP, HTTP, etc.).	Primarily focused on SFTP, FTPS, and FTP.
Monitoring and Reporting	Basic monitoring tools included.	Integrated with AWS CloudWatch for detailed monitoring.
Customizability	Highly customizable; can be tailored for specific needs.	Limited customizability; focused on standard use cases.
Maintenance	Requires ongoing maintenance and updates.	No maintenance; AWS handles updates and availability.
Integration	Can integrate with various systems but may require more setup.	Seamlessly integrates with other AWS services (S3, Lambda, etc.).
User Management	Manual user management; can be cumbersome.	Managed via AWS IAM; easier and more secure user management.

Setting Up Your S3 Bucket

Steps to Configure AWS Transfer Family for SFTP

IAM Policy Creation Steps

Generating Public and Private Keys

Steps to Create Users on AWS Transfer Family Server

Connecting MobaXterm to AWS Transfer Family

Setting Up Your Android Phone as a Web Server: From SSH to Cloudflare Integration

Mohammad Arsalan — Fri, 09 Aug 2024 18:07:53 GMT

Introduction:

Unlock the full potential of your Android phone by transforming it into a web server. This guide walks you through the essential steps to set up and manage your server directly from your mobile device. You’ll start by creating an SSH key pair, sending the public key to your Android device, and installing Termux, a powerful terminal emulator. After updating and linking your storage in Termux, you’ll configure SSH access by adding your public key to the authorized_keys file and starting the SSH daemon. Once connected to your Android phone via SSH, you’ll run your server with a simple command and install Cloudflare to secure and expose your server to the public. Follow these detailed instructions to get your web server up and running, with the added benefit of Cloudflare’s robust DNS and security features.

Create SSH Private/Public Key Pair

Generate an SSH key pair on your primary machine. This consists of a private key, which stays secure on your machine, and a public key, which you’ll share with your Android device. This key pair enables encrypted communication between your devices.

ssh-keygen -t ed25519 -f id_ed25519_android

After running the command you will get two keys generated. Send public key to your phone and store private key in your computer at C:\Users\Account\.ssh location.

Send SSH Public Key to Android Device

Transfer the public SSH key to your Android device. This step is crucial for establishing a secure connection and allows your machine to access the Android phone without needing to enter a password every time.

Install Termux on Android

Download and install Termux from the Google Play Store or F-Droid. Termux provides a powerful terminal emulator and Linux environment on your Android device, which is essential for running server software and executing commands.

Update/Upgrade Packages in Termux

Open Termux and update its package list and upgrade installed packages using the pkg update and pkg upgrade commands. Keeping your software up to date ensures you have the latest features and security patches.

pkg update
pkg upgrade
apt update
apt upgrade
pkg install git
pkg install nodejs-lts 
pkg install openssh
pkg install iproute2
pkg install nmap

Link Android Storage to Termux

Grant Termux access to your Android device’s storage using the termux-setup-storage command. This step allows you to access files and directories on your device, which can be crucial for project files and data.

Add SSH Public Key to `authorized_keys`

Use below command to add transferred Public Key as authorized_keys

cat id_ed25519.pub >> ~/.ssh/authorized_keys

chmod 600 ~/.ssh/authorized_keys

Start SSH Daemon

Launch the SSH daemon on your Android device by running the sshd command in Termux. This will enable your phone to accept SSH connections, allowing remote access and management.

Get IP Address of Android Device

Find the IP address of your Android device by executing ip addr or ifconfig in Termux. You’ll need this IP address to connect to your phone from your primary machine. Note: If you are unable to get IP Address via the above command you can navigate to Settings > IP address & Port to get your phone's IP Address. Make sure you have enabled Wireless Debugging (Developer Option > Wireless Debugging)

Connect to SSH Server from Your Machine

Use an SSH client on your main machine to connect to your Android phone using the IP address obtained earlier. This connection allows you to manage your Android-based web server remotely.

ssh -i ~/.ssh/id_ed25519_android 192.168.11.123 -p 8022

If at any point you get disconnected run sshd command to start SSH demon again.

For efficient use of SSH Client terminal on your laptop make sure to enable copy-paste functionality. Right Click on Terminal > Properties > Check the Copy Paste checkbox.

Run Your Server

On your Android device, navigate to the directory containing your project and run node index.js (or the appropriate command for your server setup). This will start your server and make it accessible via your phone.

Install Cloudflare on Android

Install Cloudflare’s tunneling service by running pkg install cloudflared in Termux. This tool will help you expose your local server to the internet securely.

Create Cloudflare Tunnel

Set up a Cloudflare tunnel to point to your local server. This configuration will make your Android-hosted server accessible via a publicly accessible domain, leveraging Cloudflare’s security and performance features.

cloudflared tunnel --url http://localhost:3000

References

Preview local projects with Cloudflare Tunnel

Install Cloudflare
Installation of Cloudflared in Termux

Copy/paste into SSH'd VIM from local (Windows) clipboard

Use Cases of SQL Server Integration Service

Mohammad Arsalan — Tue, 09 Jul 2024 13:53:07 GMT

Importing Files to SQL Server

Import CSV files to SQL Server: SQL Server Integration Services (SSIS) provides a straightforward way to import CSV files into SQL Server databases. Using SSIS, you can create a package that includes a Flat File Source component to read the CSV file and a SQL Server Destination component to write the data into a database table. SSIS handles the parsing of CSV files, ensuring that data types are correctly inferred and data is loaded efficiently.
Load fixed width files to SQL Server: SSIS supports loading fixed width files into SQL Server databases by defining columns based on their positions within the file. You can use the Flat File Source component in SSIS to specify the column widths and data types, ensuring that each field is correctly extracted from the fixed width file. This process ensures accurate data loading and transformation into SQL Server tables.
Load Excel files into SQL Server: SSIS simplifies the process of loading Excel files into SQL Server databases. With SSIS, you can configure an Excel Source component to read data from Excel spreadsheets, specifying the sheet name, range of cells, and data types. The data can then be transformed and loaded into SQL Server tables using a SQL Server Destination component. SSIS handles Excel data types and formats, ensuring compatibility and reliability in data integration tasks.
Load XML files into SQL Server: SSIS facilitates the loading of XML files into SQL Server databases by providing XML Source and XML Destination components. You can configure the XML Source component to parse XML data, specifying XPath expressions to extract elements and attributes. SSIS can transform XML data into relational format and load it into SQL Server tables using the SQL Server Destination component. This process enables structured and efficient handling of XML data within SQL Server Integration Services.

Exporting Data from SQL Server

Export data to CSV files: SSIS provides capabilities to export data from SQL Server databases to CSV files. You can use a Data Flow Task in SSIS, configuring a SQL Server Source component to retrieve data from the database and a Flat File Destination component to write the data to a CSV file. The Flat File Destination allows you to specify the delimiter (such as comma) and text qualifier, ensuring the exported CSV file is compatible with various applications and systems that consume CSV data.
Export data to fixed width files: SSIS supports exporting data from SQL Server databases to fixed width files, where each column's width is predefined. Using a Data Flow Task, you can configure a SQL Server Source component to fetch data and a Flat File Destination component to write the data into a fixed width file. In SSIS, you define column mappings and specify the start and end positions for each column, ensuring that data is exported correctly formatted according to fixed width specifications.
Export data to Excel files: SSIS facilitates exporting data from SQL Server databases to Excel files, leveraging Excel Destination components. With a Data Flow Task, you can configure a SQL Server Source component to retrieve data and an Excel Destination component to write the data into an Excel spreadsheet. SSIS allows you to specify the target sheet name, cell range, and data types, ensuring compatibility and proper formatting within Excel files. This approach provides flexibility in exporting SQL Server data to Excel for reporting and analysis purposes.

Data Transfer Operations

Copy data between SQL Server instances using SSIS: SQL Server Integration Services (SSIS) facilitates the transfer of data between different SQL Server instances through its Data Flow Task. This task allows you to configure a SQL Server Source to extract data from a source instance and a SQL Server Destination to load it into a destination instance. SSIS supports mapping, transformations, error handling, and performance optimizations to ensure reliable and efficient data migration across SQL Server environments.
Execute SQL tasks in SSIS: SSIS enables the execution of SQL tasks within its Control Flow, providing capabilities to perform various database operations such as executing SQL statements, running stored procedures, and managing transactions. The Execute SQL Task allows you to connect to SQL Server databases, execute commands, capture results, handle errors, and integrate these tasks seamlessly with other SSIS components. This feature enhances automation and flexibility in managing SQL operations as part of larger data integration workflows.

File System Operations

File System Tasks in SSIS: SSIS File System Tasks allow integration workflows to interact with files and directories on the operating system level. These tasks encompass operations such as copying, moving, renaming, deleting files, creating directories, and setting attributes. They are crucial for automating file-related tasks within SSIS packages, offering robust configuration options to manage file operations seamlessly alongside data integration processes.
Zip and Unzip Files in SSIS: SSIS provides capabilities to zip and unzip files, enhancing data management and storage efficiency. The Zip task compresses files into a zip archive format, reducing storage space and facilitating easier transmission. Conversely, the Unzip task extracts files from zip archives, enabling access to compressed data for further processing within SSIS workflows. These tasks streamline file handling operations, supporting comprehensive data integration and management tasks in SQL Server Integration Services.

Advanced Data Handling

Load multiple Excel sheets using SSIS: SSIS allows you to load data from multiple Excel sheets in a straightforward manner. SSIS supports parallel processing of multiple sheets, optimizing performance during data extraction.

SSIS Transformations

Aggregate transformation: Computes aggregate values such as SUM, AVG, MIN, MAX on groups of rows.
Row count transformation: Counts rows passing through it and stores the count in a variable.
Data conversion transformation: Converts data from one data type to another.
Character map transformation: Performs character-level operations like changing case or replacing characters based on defined mappings.
Copy column transformation: Copies data from one column to another within the data flow.
Derived column transformation: Creates new columns using expressions based on existing column values.
Multicast transformation: Copies data to multiple outputs for parallel processing.
Conditional split transformation: Routes rows to different outputs based on conditions.
Union all transformation: Combines multiple data flows into a single output without any transformation.
Merge transformation: Combines two sorted datasets into one dataset.
Merge join transformation: Joins two datasets based on matching keys.
Script component transformation: Allows custom transformations using C# or VB.NET scripts.
OLE DB command transformation: Executes SQL commands for each row in the data flow.
Lookup transformation: Performs lookups to retrieve related data from a reference dataset.
Fuzzy lookup transformation: Matches data based on similarity rather than exact matches.
Fuzzy grouping transformation: Groups data based on similarity rather than exact matches for aggregation purposes.

Getting started with SQL Server Integration Services

Mohammad Arsalan — Tue, 09 Jul 2024 12:56:02 GMT

Introduction to SSIS (SQL Server Integration Services)

SQL Server Integration Services (SSIS) is a powerful data integration and transformation tool provided by Microsoft as part of the SQL Server suite. It is used for building data integration and workflow solutions.

Here are several reasons why SSIS is valuable and why we need it:

Data Integration: SSIS allows organizations to integrate data from various sources such as databases, flat files, Excel spreadsheets, and more. This is crucial for businesses that need to consolidate data from multiple systems into a central data warehouse or data lake.
Data Transformation: SSIS provides a wide range of transformations that can be applied to data as it moves from source to destination. These transformations include cleaning, aggregating, merging, and validating data to ensure it meets the business requirements.
Workflow Orchestration: SSIS enables the creation of complex workflows or data pipelines. These workflows can automate the execution of tasks such as data extraction, transformation, and loading (ETL), making data integration processes more efficient and reliable.
Scalability: SSIS is designed to handle large volumes of data efficiently. It supports parallel processing, which improves performance when dealing with large datasets.
Extensibility: SSIS provides a rich set of tools and APIs that allow developers to extend its capabilities. Custom components can be created to address specific business requirements or integrate with other systems.
Maintenance and Monitoring: SSIS includes features for monitoring package execution, logging events, and handling errors. This helps administrators and developers identify issues quickly and ensure data integrity.
Integration with SQL Server and Microsoft Ecosystem: SSIS integrates seamlessly with SQL Server databases and other Microsoft products such as Azure Data Services, Excel, SharePoint, and Dynamics. This makes it easier to leverage existing investments in Microsoft technologies.
Compliance and Security: SSIS provides features for managing access control, encrypting sensitive data, and ensuring compliance with regulatory requirements such as GDPR or HIPAA.

Core Components of SSIS

In SQL Server Integration Services (SSIS), several key components and features play crucial roles in designing and executing data integration workflows. Let's break down each of these elements:

Control Flow Task

The Control Flow in SSIS defines the workflow or logical structure of tasks that execute in a specified order. Control Flow tasks include operations such as executing SQL commands, running scripts, sending emails, or executing other packages.

Examples of Control Flow tasks:

Execute SQL Task: Executes SQL statements or stored procedures.
Script Task: Runs custom code written in languages like C# or VB.NET.
Data Flow Task: Executes a Data Flow, which moves and transforms data between sources and destinations.
Execute Package Task: Runs another SSIS package as part of the workflow.
Send Mail Task: Sends email notifications during package execution.
File System Task: Performs operations on files and directories, like copying, moving, or deleting.

Data Flow Task

The Data Flow in SSIS is where data transformations occur. It enables the movement, manipulation, and transformation of data between sources and destinations. It consists of sources, transformations, and destinations.

Components of a Data Flow task:

Source: Retrieves data from a source system (e.g., database table, flat file).
Transformations: Modify, clean, aggregate, or join data as it moves through the pipeline.
Destination: Loads transformed data into a target system (e.g., database table, flat file).

Parameters

Parameters in SSIS allow you to pass values at runtime to packages or tasks. They provide flexibility and make packages easier to configure and reuse.

Types of Parameters:

Package Parameters: Defined at the package level and can be used by all tasks within the package.
Project Parameters: Defined at the project level and can be used across packages within the same project.
Environment Parameters: Stored in SSISDB (SSIS catalog) and can be used to configure packages deployed to different environments (development, test, production).

Event Handlers

Event Handlers in SSIS are workflows that respond to specific events raised during package execution. They allow you to handle errors, perform additional logging, or execute specific tasks based on the outcome of package events.

Types of Event Handlers:

OnError: Executes when an error occurs during package execution.
OnTaskFailed: Executes when a specific task fails.
OnWarning: Executes when a warning is generated during package execution.
OnPreExecute: Executes just before a task begins execution.
OnPostExecute: Executes immediately after a task completes successfully.
OnProgress: Executes periodically during the execution of long-running tasks.

Installing SSIS in Visual Studio

To install SSIS in Visual Studio, begin by downloading the Microsoft Data Tools - Integration Services extension from the Visual Studio Marketplace here.

Install SSIS Package in Visual Studio:
- Visit Microsoft Data Tools - Integration Services and install the extension.
Create SSIS Project:
- Open Visual Studio and select Create a new project.
- Choose Integration Services Project from the available project templates.
Detailed Installation Guide:
- For comprehensive installation instructions, refer to this instructional video: Installation Guide.

Installing SSIS in SQL Server

To install SSIS in SQL Server, follow these steps:

Download SQL Server:
- Download SQL Server from Microsoft's official website and create an ISO file.
  - To create ISO file click on setup and then select Download Media option.
Installation Process:
- Open the ISO file and run the setup.
- Select "New SQL Server standalone installation" during setup.
Component Selection:
- In the installation wizard, ensure to select the following checkboxes:
  - Database Engine Services
  - Integration Services
  - Scale Out Master
  - Scale Out Worker
Detailed Installation Guide:
- For a detailed walkthrough, watch this instructional video: Installation Guide.
Post-Installation Tips:
- After installation, use SQL Server Management Studio (SSMS) with Administrator permissions to avoid errors when accessing Integrated Service features.

Efficient Strategies for Populating Large Datasets in SQL Databases

Mohammad Arsalan — Mon, 15 Apr 2024 09:11:59 GMT

Introduction

Populating large datasets in SQL databases efficiently is a critical task for many applications, ranging from data warehousing to analytics platforms. However, inserting a large amount of data can be challenging and may impact database performance if not done properly. In this article, we'll explore strategies for efficiently populating large datasets in SQL databases, focusing on best practices and optimizations. We'll also provide examples, including the use of SQL queries for bulk data insertion.

Data Preparation: Before populating large datasets, it's essential to prepare the data and ensure that it's in the right format. This includes cleaning the data, transforming it if necessary, and organizing it into batches for efficient insertion.
Batch Insertion: One of the most efficient ways to insert large amounts of data into a SQL database is through batch insertion. Instead of inserting one row at a time, batch insertion allows multiple rows to be inserted in a single transaction, reducing overhead and improving performance.
Using SQL Bulk Insert: SQL databases often provide mechanisms for bulk data insertion, such as the SQL Server's Bulk Insert statement or PostgreSQL's COPY command. These methods are optimized for inserting large volumes of data quickly and efficiently.

Example Query: Let's consider an example of populating a large dataset using a SQL query:

CREATE TABLE YourTableName (
    id INT PRIMARY KEY,
    name VARCHAR(MAX),
    description VARCHAR(MAX),
    notes VARCHAR(MAX)
);

DECLARE @Counter INT = 0;
DECLARE @Name VARCHAR(MAX);
DECLARE @Description VARCHAR(MAX);
DECLARE @Notes VARCHAR(MAX);

-- Begin transaction
BEGIN TRANSACTION;

-- Loop to insert data
WHILE @Counter < 1000000  -- Inserting 1 million rows
BEGIN
    SET @Name = 'Name_' + CAST(@Counter AS VARCHAR(10));
    SET @Description = 'Description_' + CAST(@Counter AS VARCHAR(10));
    SET @Notes = 'Notes_' + CAST(@Counter AS VARCHAR(10));

    INSERT INTO YourTableName (id, name, description, notes)
    VALUES (@Counter, @Name, @Description, @Notes);

    SET @Counter = @Counter + 1;
END;

-- Commit transaction
COMMIT TRANSACTION;

In this query:

We declare variables for the columns to be inserted.
We start a transaction to ensure data consistency.
We use a loop to generate data and insert it into the table in batches.
Finally, we commit the transaction to make the changes permanent.

Conclusion

Efficiently populating large datasets in SQL databases requires careful planning and optimization. By following best practices such as batch insertion and using database-specific bulk insertion methods, you can improve performance and minimize the impact on database resources. Additionally, leveraging the power of SQL queries for data population can streamline the process and make it easier to manage large-scale data operations.

Harnessing the Potential of AWS API Gateway for REST APIs

Mohammad Arsalan — Thu, 04 Apr 2024 04:54:13 GMT

In the dynamic realm of modern software development, AWS API Gateway emerges as a pivotal service, offering a comprehensive solution for building, managing, and securing REST APIs on the Amazon Web Services (AWS) platform. With its rich set of features and seamless integration with other AWS services, API Gateway empowers developers to unleash the full potential of their REST APIs. Let's delve into three key functionalities that AWS API Gateway provides for REST APIs:

Query String Validation: AWS API Gateway simplifies the process of validating query strings, enabling developers to define rules for required or optional query parameters with ease. Through the intuitive API Gateway console or the AWS Management Console, developers can configure parameter constraints, data types, and validation rules effortlessly. This feature ensures that incoming requests conform to the expected structure, enhancing the reliability and security of the API endpoints.
Request Body Validation with Custom Models: Building upon the foundation of query string validation, AWS API Gateway offers robust support for validating request bodies using custom models. By leveraging JSON Schema or AWS-specific models such as AWS API Gateway Models, developers can define intricate data structures and enforce strict validation rules for incoming payloads. Whether it's validating complex nested objects or ensuring data integrity, API Gateway provides a flexible and scalable solution to validate request bodies effectively.
API Key and Usage Plan Management: Security and scalability are paramount concerns in API management, and AWS API Gateway addresses these challenges adeptly with its built-in support for API key and usage plan management. With API Gateway, developers can effortlessly generate API keys, associate them with usage plans, and enforce fine-grained access control policies. By configuring usage quotas, rate limits, and API throttling, developers can safeguard their APIs against abuse while optimizing resource utilization. Moreover, API Gateway seamlessly integrates with AWS Identity and Access Management (IAM), enabling developers to manage access permissions and authentication mechanisms seamlessly.

In conclusion, AWS API Gateway serves as a cornerstone for building resilient, scalable, and secure REST APIs on the AWS cloud. From query string validation to request body validation with custom models, and API key and usage plan management, API Gateway offers a plethora of features to streamline API development and management workflows. By harnessing the power of AWS API Gateway, developers can accelerate their journey towards building resilient and scalable applications while ensuring the highest standards of security and reliability.

Complete Guide: Triggering AWS Lambda Functions via S3 Bucket Events

Mohammad Arsalan — Sun, 24 Mar 2024 17:40:38 GMT

Setup Lambda Function

Click on the "Create Function" button.
Give the desired name and click on "Create Function" button.
After creation, you will get a dashboard like this.

Setup S3 Bucket

Click on the "Create Bucket" button.
Now give the desired name and click on the create button.

Create IAM to Have Access to S3

Click on the "Create Role" button to create a new IAM role.
Select the use case as Lambda.
Now select S3 full access IAM role.
Finally, give a name to your IAM role and click on the create button.

Attach IAM to Your Lambda

Navigate to your Lambda and select the "Permissions" tab from the configuration and click on the "Edit" button.
Select the name of IAM role you have created and click on the save button.

Add the Trigger

Now click on "Add Trigger" button to add the trigger.
Select S3 and your bucket name from the dropdown and click on "Add" button.

Write Below Code and Push to AWS Lambda

Now write the below code in index.js, install aws-sdk package via npm
Create the zip file.

Click on "Upload From" button to upload this .zip file.

  const AWS = require('aws-sdk');
  const s3 = new AWS.S3();

  exports.handler = async (event) => {
      try {
          for (const record of event.Records) {
              const bucketName = record.s3.bucket.name;
              const objectKey = decodeURIComponent(record.s3.object.key.replace(/\+/g, ' '));
              console.log('Bucket Name:', bucketName);
              console.log('Object Key:', objectKey);

              const getObjectParams = {
                  Bucket: bucketName,
                  Key: objectKey
              };
              const fileContent = await s3.getObject(getObjectParams).promise();
              console.log('File Content:', fileContent.Body.toString('utf-8'));
          }

          return {
              statusCode: 200,
              body: JSON.stringify({
                  message: 'Successfully processed all S3 file uploads'
              })
          };
      } catch (error) {
          console.error('Error:', error);
          return {
              statusCode: 500,
              body: JSON.stringify({
                  message: 'Error processing S3 file uploads',
                  error: error.message
              })
          };
      }
  };

Whenever You Upload Anything on S3 Lambda Will Get Triggered

Inside the bucket add the file by clicking on "Upload" button.
Now navigate to your Lambda function and click on cloud watch log group to check the logs.

Conclusion

From the Logs We Can See That Lambda Got Triggered when file uploaded to S3.

It Was Able to Read Bucket Name, File Name and Its Content.

How to Set Up and Configure API Gateway on AWS: A Comprehensive Guide

Mohammad Arsalan — Thu, 21 Mar 2024 13:26:28 GMT

Navigate to API Gateway:

Open your web browser and go to the API Gateway service provided by your cloud provider.

Click on Create API button:

Look for the "Create API" button on the API Gateway dashboard and click on it.
In the API creation wizard, select the type of API you want to create. For this example, select "HTTP API". Then click on the "Build" button to proceed.

Add Integration:

Now, you need to point this route to an already configured backend. Click on the "Add Integration" button.

In the integration settings, add the server URL where your backend service is hosted. Also, give your API a name to identify it later.

Define Resource Path:

After setting up the integration, define the resource path. This is the URL path where your API will be accessible.

Add Stage Name:

Next, add a stage name for your API. The stage represents a specific deployment of your API (e.g., "development", "production", etc.).

Click on Create Button:

Once you've configured all the necessary settings, click on the "Create" button. This will create your API with the specified configurations.

Deploy your API:

After creating the API, it needs to be deployed to make it accessible. Click on the "Deploy" button.
Select the stage that you've created earlier (e.g., "development", "production", etc.) from the dropdown menu.
Confirm the deployment by clicking on the "Deploy" button. After deployment you will get dashboard like this

Access the Dashboard:
- After deployment, you will be directed to a dashboard where you can manage your API.
- Here, you can monitor usage, set up custom domains, configure authorizers, etc.

Use the API:

To use your API, simply copy the provided endpoint URL (usually displayed on the dashboard or deployment confirmation page) along with your route.
Use this URL to make requests to your API. You should receive responses from your backend service accordingly.

Infinite Scrolling: Intersection Observer

Mohammad Arsalan — Sun, 10 Dec 2023 09:41:19 GMT

The Intersection Observer is an API used to detect when a specific element enters the viewport. With this functionality, we can create logic that depends on the visibility of that particular element. For example, fetching data when a specific element is reached, implementing infinite scroll, collecting user interactions with a particular section, and more.

In the code below, there is a 'colors' array and a 'moreColors' array. When we reach the end of the page at "this is a div with a black color," more colors are fetched, thereby implementing infinite scroll.

import { useEffect, useRef, useState } from "react";
import "./App.css"

const App = () => {
  const moreColors = ['#FF00FF', '#00FFFF', '#FFD700', '#8A2BE2', 
  '#00FF00', '#FF4500', '#7FFF00', '#FF1493', '#32CD32', '#FF8C00'];
  const lastElement = useRef();

  const [colors, setColors] = useState(['#FF5733', '#33FF57', 
  '#5733FF', '#FF33A6', '#33A6FF', '#A6FF33', '#FF3366', 
  '#3366FF', '#66FF33', '#FF6633']);
  const [isVisible, setIsVisible] = useState(false);

  useEffect(() => {
    const observer = new IntersectionObserver((entries) => {
      const entry = entries[0];
      setIsVisible(entry.isIntersecting);
    })
    observer.observe(lastElement.current);
  }, [])

  useEffect(() => {
    if (isVisible === true) {
      console.log("Fetch more colors...");
      setColors((preColors) => {
        return [...preColors, ...moreColors];
      });
      console.log(colors);
    }
  }, [isVisible])

  return (
    <div>
      {colors && colors.map((color, index) => {
        return (
          <div key={index} style={{ backgroundColor: color }} 
           className="divClass">
            this is div with {color} color.
          div>
        )
      })}
      <div style={{ backgroundColor: "black" }} 
       className="divClass" ref={lastElement}>
        this is div with black color.
      div>
    div>
  );
}

export default App;

.divClass {
  color: white;
  height: 5rem;
  border: 2px solid black;
  border-radius: 1rem;
  display: flex;
  justify-content: center;
  align-items: center;
  font-size: 20px;
  font-weight: 700;
  margin: 5px;
}

How to Use CDC Tables to Capture Change Data

Mohammad Arsalan — Tue, 11 Jul 2023 14:06:32 GMT

Introduction to CDC tables

Change Data Capture (CDC) tables are a type of database table that is used to capture changes that are made to other tables. CDC tables are typically used to track changes that are made to data in a real-time or near-real-time manner. CDC tables can be scaled to handle large volumes of data. This makes them a good choice for applications that need to track changes to large datasets. CDC tables can capture changes that are made to data in real-time or near-real-time. This allows you to track changes as they happen and take action as needed.

Comparison of CDC vs Triggers

Now the question arises why do we need a CDC table when we can do the same thing using triggers? There are a few reasons why we might want to use a CDC table instead of triggers.

Performance: CDC tables are typically less performance-intensive than triggers because they do not have to fire for every individual DML operation. This can be especially important for high-volume databases.
Scalability: CDC tables are more scalable than triggers because they can be used to capture changes from multiple tables at the same time. This can be helpful for large databases with a lot of data.
Flexibility: CDC tables are more flexible than triggers because they can be used to capture changes in a variety of ways. This can be helpful for applications that need to capture specific types of changes or changes in a specific order.

Feature	CDC Tables	Triggers
Performance	Less performance-intensive	More performance-intensive
Scalability	More scalable	Less scalable
Flexibility	More flexible	Less flexible
Complexity	More complex to set up and manage	Less complex to set up and manage
Capabilities	Can capture changes from multiple tables	Can only capture changes from the source table

Setting up CDC tables

Run the below script in the Student database to enable CDC

USE Student;
GO
EXECUTE sys.sp_cdc_enable_db;
GO

Enable SQL Server Agent
1. Open SQL Server Management Studio.
2. In the Object Explorer, expand Management > Services.
3. Right-click SQL Server Agent and select Start.
Run the below command in the details table to enable CDC for the Details table

EXEC sys.sp_cdc_enable_table
     @source_schema = N'dbo',
     @source_name = N'Details',
     @role_name = NULL
GO

Terminologies of CDC Table

Data captured in the CDC table from the source table will look something like this:

Source table for corresponding CDC table:

__$operation column represents what exactly the changes were:-

Operation	Value
Delete	1
Insert	2
Update (Old Value)	3
Update (New Value)	4

"Unlocking the Power of Multithreading: Enhancing Performance and Concurrency in Your Applications"

Mohammad Arsalan — Sun, 02 Jul 2023 11:12:00 GMT

Boost performance using Multi-threading

Print count of even and odd numbers from 1 to 1e9. You will see a trivial approach of doing with its run time taken, multi-threading way involving two threads and then involving all the threads currently present in your PC. Run time might differ depending on your PC's performance but you will see a drastic change in run time.

Normal Way:

#include
using namespace std;

void findEven(long long num) {
   long long cnt=0;
   for(int i=0; i<=num; i++)
   {
      if(i%2==0) cnt++;
   }
   cout << "Count: " << cnt << endl;
}

void findOdd(long long num) {
   long long cnt=0;
    for(int i=0; i<=num; i++)
    {
        if(i%2!=0) cnt++;
    }
   cout << "Count: " << cnt << endl;
}

signed main()
{
    freopen("output.txt", "w", stdout);
    long long num=1e9;
    vector<long long> even, odd;

    auto start = std::chrono::high_resolution_clock::now();

    findEven(num);
    findOdd(num);

    auto end = std::chrono::high_resolution_clock::now();
    double duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();

    duration = (duration/1000);
    cout << duration << " seconds";
}

Involving two threads:

#include
using namespace std;

void findEven(long long num) {
   long long cnt=0;
   for(int i=0; i<=num; i++)
   {
      if(i%2==0) cnt++;
   }
   cout << "Count: " << cnt << endl;
}

void findOdd(long long num) {
   long long cnt=0;
    for(int i=0; i<=num; i++)
    {
        if(i%2!=0) cnt++;
    }
   cout << "Count: " << cnt << endl;
}

signed main()
{
    freopen("output.txt", "w", stdout);
    long long num=1e9;
    vector<long long> even, odd;

    auto start = std::chrono::high_resolution_clock::now();

   thread t1(findEven, num);
   thread t2(findOdd, num);

   t1.join(); 
   t2.join();

    auto end = std::chrono::high_resolution_clock::now();
    double duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();

    duration = (duration/1000);
    cout << duration << " seconds";
}

Involving multiple threads:

#include
using namespace std;

int countEvenNumbers(int start, int end)
{
    int count = 0;
    for (int i = start; i <= end; ++i)
    {
        if (i % 2 == 0)
        {
            count++;
        }
    }
    return count;
}

int counOddNumbers(int start, int end)
{
    int count = 0;
    for (int i = start; i <= end; ++i)
    {
        if (i % 2 != 0)
        {
            count++;
        }
    }
    return count;
}

void calculateCount(int chunkStart, int chunkEnd) 
{
    int count = countEvenNumbers(chunkStart, chunkEnd);
    int count1 = counOddNumbers(chunkStart, chunkEnd);
    cout << "Thread " << this_thread::get_id() << ": " << count << " even numbers." << endl;
}

int main()
{
    freopen("output.txt", "w", stdout);

    int numThreads = thread::hardware_concurrency();
    cout << "Number of threads available: " << numThreads << endl;

    int startNumber = 1;
    int endNumber = 1e9;

    int chunkSize = (endNumber - startNumber + 1) / numThreads;
    vector threads;

    auto start = std::chrono::high_resolution_clock::now();

    for (int i = 0; i < numThreads; ++i)
    {
        int chunkStart = startNumber + i * chunkSize;
        int chunkEnd = chunkStart + chunkSize - 1;

        if (i == numThreads - 1)
        {
            chunkEnd = endNumber;
        }

        threads.push_back(thread(calculateCount, chunkStart, chunkEnd));
    }

    for (auto& thread : threads)
    {
        thread.join();
    }

    auto end = std::chrono::high_resolution_clock::now();
    double duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();

    duration = (duration/1000);
    cout << duration << " seconds";
    return 0;
}

Run time Comparision:

Normal Way	Involving two Threads	Involving multiple Threads
`7.066 seconds`	`3.483 seconds (50.70%)`	`0.945 seconds (86.62%)`

From the above table, you can see that just by using two threads we improved run time by 50.70% and by using multiple threads we improved run time by 86.62%.

Terminologies in Multi-threading

join(): The join() function is a member function of the std::thread class in C++. It is used to wait for a thread to complete its execution. When you call join() on a thread object, the calling thread will block until the thread being joined finishes its execution.
joinable(): The joinable() function is a member function of the std::thread class. It is used to check if a thread is joinable. A thread is joinable if it represents a thread of execution that has not been joined or detached. If a thread is joinable, it means that it can be joined by calling the join() function.
detach(): The detach() function is a member function of the std::thread class. It allows you to detach a thread from the calling thread. When a thread is detached, it continues its execution independently, and the calling thread does not wait for its completion. The detached thread will clean up itself when it finishes execution. The detach() function is useful when you have a thread that performs a background task or provides a service independently from the main thread. You may not need to wait for its completion or retrieve any return value from it. Detaching the thread allows it to run in the background while the main thread continues its execution or terminates.
mutex: A mutex (short for mutual exclusion) is a synchronization primitive used to protect shared resources from simultaneous access by multiple threads. It ensures that only one thread can access a critical section of code (protected by the mutex) at a time. In C++, the std::mutex class is provided as a standard mutex implementation.
race condition: A race condition occurs when two or more threads access shared data concurrently, and the outcome of the program depends on the relative timing of their execution. Race conditions can lead to unpredictable and erroneous behaviour in a multi-threaded program. They typically occur when at least one thread is modifying shared data, and there is no synchronization mechanism (such as a mutex) in place to enforce exclusive access.
timed_mutex: A timed mutex is an extension of the std::mutex class in C++ that allows for timed locking and unlocking. It provides additional capabilities to acquire a lock on a mutex for a specified duration. The std::timed_mutex class provides member functions like try_lock_for() and try_lock_until() that attempt to acquire a lock for a specified duration or until a specific point in time, respectively. These functions return immediately with a success or failure indication, allowing the caller to handle the case when a lock cannot be acquired within the given time frame.

Timed mutexes are useful in scenarios where you want to acquire a lock on a mutex for a specific duration or wait until a certain time, and then proceed with the execution accordingly.

It's important to note that using mutexes and synchronization mechanisms correctly is crucial to prevent race conditions and ensure thread safety in multi-threaded programs.

Difference between unique_lock and lock_guard

std::unique_lock is a more versatile class that provides additional features and flexibility compared to std::lock_guard.
Unlike std::lock_guard, std::unique_lock allows manual control over the locking and unlocking of the associated mutex. It can be locked and unlocked multiple times within its lifetime.
std::unique_lock is movable, which means it can be transferred between scopes or threads by moving its ownership.
It supports more advanced synchronization scenarios, such as condition variable waiting (cv.wait()) and timed locking.
std::unique_lock is typically used when you need more fine-grained control over locking and unlocking, or when you want to use additional synchronization features like condition variables.

#include 

std::mutex mtx;

void someFunction()
{
    std::unique_lock<std::mutex> lock(mtx);
    // Critical section: Mutex is locked here
    // Perform operations on shared resources
    lock.unlock(); // Manually unlock the mutex
    // Perform some non-critical operations without holding the lock
    lock.lock(); // Manually lock the mutex again
    // Continue with critical section
    // Mutex is automatically released when 'lock' goes out of scope
}

Condition Variable

In this example, we have three functions: withdrawMoney(), addMoney(), and the main() function.

The withdrawMoney() function represents a thread that tries to withdraw a certain amount of money from the account. It first acquires the lock on the mutex using std::unique_lock. Then, it waits on the condition variable cv until there are sufficient funds in the account. The lambda function [amount] { return accountBalance >= amount; } specifies the condition for the waiting thread to continue. Once the condition is satisfied, the withdrawal is performed, and the account balance is updated.
The addMoney() function represents a thread that adds a certain amount of money to the account. It also acquires the lock on the mutex using std::unique_lock and adds the money to the account balance. After updating the balance, it calls cv.notify_one() to notify one waiting thread that funds are available.
In the main() function, three threads are created: one for withdrawal and two for adding money. The threads run concurrently, and the withdrawals and additions happen in an interleaved manner.
This makes sure that withdraw function is only executed when you have sufficient funds. Let's say instead of adding 350 you added only 300 then one thread would be waiting till infinity.
Condition variable makes sure that one thread is executed when a particular condition is met.

By using std::mutex, std::condition_variable, and std::unique_lock, we ensure that withdrawals are blocked until sufficient funds are available in the account. The condition variable allows threads to wait efficiently without busy waiting, and the unique lock ensures exclusive access to shared resources.

#include 
#include 
#include 
#include 

std::mutex mtx;
std::condition_variable cv;
int accountBalance = 0;

void withdrawMoney(int amount)
{
    std::unique_lock<std::mutex> lock(mtx);
    // Wait until there are sufficient funds in the account
    cv.wait(lock, [amount] { return accountBalance >= amount; });

    // Perform the withdrawal
    accountBalance -= amount;
    std::cout << "Withdrawn " << amount << " units. Account balance: " << accountBalance << std::endl;
}

void addMoney(int amount)
{
    std::unique_lock<std::mutex> lock(mtx);

    // Add money to the account
    accountBalance += amount;
    std::cout << "Added " << amount << " units. Account balance: " << accountBalance << std::endl;

    // Notify one waiting thread that funds are available
    cv.notify_one();
}

int main()
{
    freopen("output.txt", "w", stdout);
    std::thread t1(withdrawMoney, 100);
    std::thread t2(addMoney, 350);
    std::thread t3(withdrawMoney, 150);

    t1.join();
    t2.join();
    t3.join();

    return 0;
}

Deadlock

A deadlock occurs in a multi-threaded program when two or more threads are blocked indefinitely, waiting for each other to release resources. This situation leads to a state where none of the threads can make progress, resulting in a deadlock. Deadlocks can occur when there is a circular dependency among threads and resources they are trying to acquire.

Let's consider an example scenario with two threads (Thread A and Thread B) and two resources (Resource 1 and Resource 2). The threads need to acquire both resources to perform their tasks. The following code snippet demonstrates a potential deadlock situation:

#include 

std::mutex mutex1;
std::mutex mutex2;

void threadA()
{
    // Acquire lock on mutex1
    std::unique_lock<std::mutex> lock1(mutex1);

    // Sleep to simulate some processing
    std::this_thread::sleep_for(std::chrono::seconds(1));

    // Attempt to acquire lock on mutex2
    std::unique_lock<std::mutex> lock2(mutex2);

    // Perform Thread A's task using Resource 1 and Resource 2
}

void threadB()
{
    // Acquire lock on mutex2
    std::unique_lock<std::mutex> lock2(mutex2);

    // Sleep to simulate some processing
    std::this_thread::sleep_for(std::chrono::seconds(1));

    // Attempt to acquire lock on mutex1
    std::unique_lock<std::mutex> lock1(mutex1);

    // Perform Thread B's task using Resource 1 and Resource 2
}

int main()
{
    std::thread t1(threadA);
    std::thread t2(threadB);

    t1.join();
    t2.join();

    return 0;
}

In this example, Thread A and Thread B both acquire a lock on one mutex and then attempt to acquire a lock on the other mutex. However, if the timing is unfavourable, a deadlock can occur. Here's how the deadlock scenario unfolds:

Thread A acquires mutex1 and Thread B acquires mutex2.
Thread A tries to acquire mutex2 but gets blocked because Thread B holds the lock on mutex2.
Thread B tries to acquire mutex1 but gets blocked because Thread A holds the lock on mutex1.
Both threads are now waiting for the other thread to release the lock, resulting in a deadlock. None of them can proceed.

Here's a diagram illustrating the deadlock scenario:

Thread A                        Thread B
----------------------------------------------
Acquire lock on mutex1
                                    Acquire lock on mutex2
Attempt to acquire lock on mutex2
                                    Attempt to acquire lock on mutex1
      [Deadlock occurs]

To avoid deadlocks, it's important to carefully analyze and synchronize access to shared resources, ensuring that potential circular dependencies are avoided. Techniques such as using a fixed ordering of locks, resource allocation hierarchies, or employing deadlock detection and recovery algorithms can help prevent and handle deadlocks in multi-threaded programs.

Producer Consumer Problem

The Producer-Consumer problem is a classic synchronization problem in concurrent programming. It involves two entities, the producer and the consumer, which share a common buffer or queue. The producer generates data items and adds them to the buffer, while the consumer consumes the data items by removing them from the buffer.

Here's an example implementation in C++ using threads and a shared queue:

#include 
#include 
#include 
#include 
#include 

std::queue<int> buffer;
const int bufferSize = 10;
std::mutex mtx;
std::condition_variable cv;

void producer()
{
    for (int i = 1; i <= 20; ++i) {
        std::unique_lock<std::mutex> lock(mtx);
        cv.wait(lock, [] { return buffer.size() < bufferSize; });

        buffer.push(i);
        std::cout << "Produced: " << i << std::endl;

        lock.unlock();
        cv.notify_one();
    }
}

void consumer()
{
    for (int i = 1; i <= 20; ++i) {
        std::unique_lock<std::mutex> lock(mtx);
        cv.wait(lock, [] { return !buffer.empty(); });

        int item = buffer.front();
        buffer.pop();
        std::cout << "Consumed: " << item << std::endl;

        lock.unlock();
        cv.notify_one();
    }
}

int main()
{
    std::thread producerThread(producer);
    std::thread consumerThread(consumer);

    producerThread.join();
    consumerThread.join();

    return 0;
}

In this example, the producer function produces values from 1 to 20 and adds them to the shared buffer until it reaches the bufferSize. If the buffer is full, the producer waits for the consumer to consume items before adding more.

The consumer function consumes items from the buffer. If the buffer is empty, the consumer waits for the producer to produce items before consuming them. Once an item is consumed, it is removed from the buffer.

The std::mutex (mtx) is used to protect the shared buffer and ensure that only one thread can access it at a time. The std::condition_variable (cv) is used for signalling and synchronization between the producer and consumer threads. The cv.wait function is used to wait until a certain condition is met, and cv.notify_one is used to notify the waiting threads.

By coordinating the producer and consumer using mutexes and condition variables, we ensure that the producer doesn't produce items when the buffer is full, and the consumer doesn't consume items when the buffer is empty, thus avoiding synchronization issues.

Sleep vs Wait

Multi-threading vs async-await

Multi-threading is utilizing multiple threads for parallel execution whereas async-await utilizes threads efficiently. For example, if you made an API call then instead of waiting for the response of the API call thread can perform other tasks and this is asynchronous i.e. not blocking the execution of the program. And when the response comes it handles that.

(18) Array: Number of Sub-arrays With Odd Sum

Mohammad Arsalan — Thu, 20 Jan 2022 17:49:39 GMT

Question Link and Solution Link

Difficulty: Medium

Problem Statement: Given an array, we need to return the number of subarrays with an odd sum.

Approach: Calculate prefix sum of the array and then traverse over that array and whenever you encounter even number increment even else increment odd and finally return ((even*odd)%mod + odd%mod)%mod.

int numOfSubarrays(vector<int> &arr)
{
    long long odd = 0, even = 0, n = arr.size(), mod = 1e9 + 7;
    int prex[n];
    prex[0] = arr[0];
    for (int i = 1; i < n; i++) prex[i] = arr[i] + prex[i - 1];
    for (int i = 0; i < n; i++)
    {
        if (prex[i] % 2 == 0) even++;
        else odd++;
    }
    return (int)((even * odd) % mod + odd % mod) % mod;
}

Subscribe to the newsletter so that you never miss any post or update just like this one.

You can follow me on Hashnode for:

Daily Data Structure and Algorithm based questions
Getting knowledge of various development-related tools, concepts, and practices

Twitter , GitHub , LinkedIn and Hashnode

(17) Array: Special Reverse

Mohammad Arsalan — Wed, 19 Jan 2022 13:23:03 GMT

Difficulty: Easy

Problem Statement: Given a string and you need to reverse it such that after the reversal position of a special character do not change.

Input: intell#ect, Output: tcelle#tni

Input: h@ello, Output: o@lleh

Input: a#b@c, Output: c#b@a

Approach: We store special characters in res and the character which is not special we store space instead of that.

In newS we store the reverse of string except for a special character. Now we traverse both the list and at any point if there is a special character we store it in the answer else we store the alphabet. At last print the answer.

s = input()
res = []
for i in s:
    if i >= 'a' and i <= 'z': res.append(' ')
    else: res.append(i)
s = s[::-1]
pos = 0
newS = ''
for i in s:
    if i >= 'a' and i <= 'z':
        newS += i

(x, y) = (0, 0)
(n, m) = (len(newS), len(res))
ans = ''
while x < n and y < m:
    if res[y] != ' ':
        ans += res[y]
        y += 1
    else:
        ans += newS[x]
        x += 1
        y += 1
while x < n:
    ans += newS[x]
    x += 1
while y < m:
    if res[y] == ' ':
        y += 1
        continue
    ans += res[y]
    y += 1
print ans

Subscribe to the newsletter so that you never miss any post or update just like this one.

You can follow me on Hashnode for:

Daily Data Structure and Algorithm based questions
Getting knowledge of various development-related tools, concepts, and practices

Twitter , GitHub , LinkedIn and Hashnode

(16) Array: Minimum swaps and K together

Mohammad Arsalan — Tue, 18 Jan 2022 17:24:17 GMT

Question Link

Difficulty: Medium

Problem Statement: Given an array and a number k. You need to group all numbers less than equal to number k together.

Approach: Here we will use the window to solve this problem. Basically here we calculate all numbers such that they are less than equal to k and store them in count. Also, we calculate bad which we define as all number that is greater than the number k.

Now every time we will shift the window. By shifting window we mean that excluding i and including j. We exclude i when it is greater than k and similarly we include j in bad if it's greater than k.

int minSwap(int arr[], int n, int k)
{
    int bad = 0, count = 0;
    for (int i = 0; i < n; i++)
    {
        if (arr[i] <= k) count++;
    }
    for (int i = 0; i < count; i++)
    {
        if (arr[i] > k) bad++;
    }
    int ans = bad;
    for (int i = 0, j = count; j < n; i++, j++)
    {
        if (arr[i] > k) bad--;
        if (arr[j] > k) bad++;
        ans = min(ans, bad);
    }
    return ans;
}

Subscribe to the newsletter so that you never miss any post or update just like this one.

You can follow me on Hashnode for:

Daily Data Structure and Algorithm based questions
Getting knowledge of various development-related tools, concepts, and practices

Twitter , GitHub , LinkedIn and Hashnode

(15) Recursion: Combination Sum

Mohammad Arsalan — Mon, 17 Jan 2022 09:56:28 GMT

Question Link and Solution Link

Difficulty: Medium

Problem Statement: Given an array of elements and a target sum you need to select elements from that array such that the sum of that elements are equal to the target element. Also, you can pick an element any number of times.

Approach: Every time we check whether the current element is less than the target sum or not. If it's less than the target sum then we will include that in a temporary vector and not change the index because in the future we might need to include that element and decrease the target sum value by that number.

Else what we can do is that to not include the number and in this case, we increment the index. The most important part is to pop_bcak() from the temporary vector. Also when our current index is equal to the size of the array and tar=0 then this means we achieve our sum and hence we will include it in our resultant vector.

void solve(int idx, vector<int> nums, int tar, vector<vector<int>> &res, vector<int> tmp) {
    if(idx==nums.size())
    {
        if(tar==0) res.push_back(tmp);
        return;
    }
    if(nums[idx] <= tar)
    {
        tmp.push_back(nums[idx]);
        solve(idx, nums, tar-nums[idx], res, tmp);
        tmp.pop_back();
    }
    solve(idx+1, nums, tar, res, tmp);
}
vector<vector<int>> combinationSum(vector<int>& candidates, int target) {
    vector<vector<int>> res;
    vector<int> tmp;
    solve(0, candidates, target, res, tmp);
    return res;
}

Subscribe to the newsletter so that you never miss any post or update just like this one.

You can follow me on Hashnode for:

Daily Data Structure and Algorithm based questions
Getting knowledge of various development-related tools, concepts, and practices

Twitter , GitHub , LinkedIn and Hashnode

Untitled Publication

How We Built a Scalable Log Analytics Platform with OpenSearch

Introduction

Advantages of Using OpenSearch

OpenSearch Storage Tiers

Purpose

Responsibilities

Characteristics

What Is an Index in OpenSearch

What Are Hot, Warm, and Cold Indices

What Is an Alias in OpenSearch

Managing Index Growth with Rollover in OpenSearch

What Is Rollover in OpenSearch

Why Use Rollover

How It Works

Understanding Shards in OpenSearch

Primary Shard

Replica Shard

What Does 5:2 Replication Strategy Mean

Snapshots in OpenSearch

How Snapshots Work

Dissecting Apache Kafka

Introduction to Kafka: The Need for a Distributed Messaging System

Kafka Cluster / Brokers, Topics, and Partitions — The Backbone

Kafka Data Flow — From Producer to Consumer

Kafka Partitions — Scaling, Increasing, and Decreasing

Kafka Consumer Group, Offset, Polling, and Auto-Commit Explained

Consumer Group Concept

Offset in Kafka

Auto-Commit

Manual Offset Commit

Poll Interval

Rebalancing in Kafka: Why It Happens and How It Affects Consumers

Leader and Replica in Kafka: High Availability Through Replication

What Happens If Leader Fails?

Frequently Asked Questions (FAQs) About Kafka

What is the difference between Kafka and a traditional messaging queue like RabbitMQ?

What happens if a Kafka broker goes down?

What is a Kafka Consumer Group?

What are Kafka Topics and Partitions?

How does Kafka guarantee message order?

How does Kafka handle message retention?

What is Kafka Consumer Lag?

How do Kafka Producers ensure data durability?

What is Kafka's Exactly-Once Semantics (EOS)?

Can I change the number of partitions in Kafka?

What is the difference between kafka-console-consumer and kafka-console-producer?

Spring Security in Spring Boot — A Complete Beginner's Guide

Introduction

Required Dependencies in pom.xml

Spring Boot Starter Data JPA

Spring Boot Starter Web

Spring Boot DevTools

Spring Boot Starter Security

PostgreSQL Driver

Lombok

Configuring application.properties

Spring Application Name & Server Port

Database Configuration (PostgreSQL)

Hibernate JPA Configuration

Logging Configuration

Controller Setup

UserController Code

Explanation of Code

Repository and Model Setup

UserModel Code

Explanation of UserModel Class

UserRepository Code

Explanation of UserRepository Interface

Database Mapping in PostgreSQL

Custom User Details Service

CustomUserDetailService Code

Explanation of CustomUserDetailService Class

Role of CustomUserDetailService in Spring Security

Configuring Spring Security

SecurityConfig Code

7.2 Explanation of SecurityConfig Class

How Spring Security Filters Requests

Next Steps

Understanding EMR Architecture: Key Components, Configuration Options, and Scaling Strategies

What is the difference between `kafka-console-consumer` and `kafka-console-producer`?

Required Dependencies in `pom.xml`

Configuring `application.properties`

Explanation of `UserModel` Class

Explanation of `UserRepository` Interface

Explanation of `CustomUserDetailService` Class

Role of `CustomUserDetailService` in Spring Security

7.2 Explanation of `SecurityConfig` Class

Add SSH Public Key to `authorized_keys`