Serverless analytics on AWS

Business need

HealthCare is one of the vertical for General Electric. Healthcare system is to bring different type of source data from Hospital equipment which includes sensor data, PHI data, data generated by Equipment’s, preserve them on AWS cloud environment for Analytical use. The source data comprises of different type of file format and the historical data to be available on centralized location for processing, enriching and visualization.

Approach & solution

The previous infrastructure could no longer handle demand, Healthcare system switched to AWS involving multiple services & products to deliver fast, reliable, and secure data needs & processing power.

Solved the purpose of data replication on s3 between a highly restricted region without exposing the data.

AWS Services Used:

  • Amazon S3: To store the source data, processed data, and enriched data
  • AWS EMR: used to handle the periodic workloads to process, aggregate and enrich the data using the services available such as SPARK, Sqoop to import the data from multiple source systems to S3, Hive to store the processed data, oozie to schedule the workloads, Zeppelin to query data stored in s3.
  • Amazon Redshift: As a warehouse to preserve the historical data
  • AWS IOT: To collect the sensor data and store that into S3 as RAW to process it further using EMR.
  • AWS Ec2: Serves as an end user portal for EMR cluster communication & also used to deploy & run custom software for data visualization, monitoring and auditing.
  • AWS Lambda: To schedule & trigger workloads to EMR step action, the schedules are time-based and file based.
  • AWS SQS: Used for file movement into a restricted region.

Approach is to bring the data from different source systems in China region containing different file formats (flat, json, sensor data and structural data), China region is restricted and not allowed to expose any of the data stored in the region. China data processing relies on data from other regions for business logic.

As a cross region replication feature used SQS & lambda functions to put & get objects into s3 buckets in cross regions.

Used S3 as a centralized storage to store RAW, enriched and cleansed data. Lambda function triggers step action on EMR to submit workloads to Hadoop ecosystem. EMR as the big data analytics and application platform to process data from S3 using the services provided by EMR. AWS redshift as a data warehouse to store all the processed historical loads. SQS messages triggers a lambda function to get the objects from s3 and put the objects in to destination bucket.

Posted by wissenadmin | 11 August 2022
APPROACH & SOLUTION: OwlDQ Web application that can connect to source and destination data stores & run spark-based jobs to compare & score the data. This tool helps business visualize…
21 LikesComments Off on RDS – Performance Improvement & Cost Reduction
Posted by wissenadmin | 11 August 2022
Transportation (Heterogeneous) Industry Vertical made their application availability 100% with 45% Increase in End-to-End Query performance Business Need:  Industries that depend on data extractions from Distinct source like Databases, Sensors,…
20 LikesComments Off on AWS Relational Database Service
Posted by wissenadmin | 11 August 2022
AWS Elastic MapReduce (EMR) Increased operational efficiency 40% and reduced 34% of cost Business Need:  Transportation Industry vertical has a wide range of usage pattern in which business has some…
18 LikesComments Off on AWS Elastic MapReduce