HealthCare is one of the vertical for General Electric. Healthcare system bring different type of source data from Hospital equipment which includes sensor data, PHI data, data generated by Equipment’s, preserve them on AWS cloud environment for Analytical use. Before performing Analytics, the data has to be validated, hence owlDQ is the tool used for data validation between source and destination.
OwlDQ is a web application that can connect to source and destination data stores & run spark-based jobs to compare & score the data. This tools helps business visualize the data quality on different data stores.
Approach is to host a sophisticated web application that automate the Data quality without the need of rules. Owl applies the latest advancements in Data Science and Machine Learning to the problem of Data Quality. OwlDQ creates and submits the spark workloads on top of EMR cluster tor run the analytical jobs & publish reports on the data quality between different data stores. The reports can be visualized from a webbrowser connection to owlDQ. ELB is used to route the traffic to webapplication based on the load. EC2 instance is used to host the webapplication. RDS (postresql) stores the metadata written by the workloads. OwlDQ can connect to multiple services provided by AWS (like S3, RDS, redshift, EBS…etc.)