Preprocessing Healthcare Data for Predictive Analytics

Laurence Wingo
May 16, 2019
3 min read

Updated: May 20, 2019

This week with the heart condition project that predicts mortality based on previous patients visit to their physician, I was able to do SQL Joins (Left, Right, Inner, and Outter) while also using SQL statements such as AS, WHERE, FROM, SELECT, SET, UPDATE, CREATE, and ALTER. The reasoning for this was to build a new table based on the patients that have passed away from heart conditions from the doctor office’s data. An SQLite database is what I used and it’s so small that it could also be used for mobile devices. SQLite has its own language and core functionality. For example, I used the SQLlite JulianDate() function which returns the number of days that have elapsed since Nov 24, 4714 BC 12:00pm Greenwich time in the Gregorian calendar. While building this new table from existing data, there were some values missing from the doctor office’s database which required our new table to be cleaned. Following some simple cleaning, I needed to aggregate a few new columns to be use as the machine learning features. As I continued on, I became curious if cleaning could be done in both Python as well as SQL. I did learn however that data wrangling and munging in Python means concatenating, merging, and creating new datasets) for similar data types. To create the features to be used for machine learning, I made SQL joins and binned the data using a binary number. To establish a risk score, I added together the binned values.

In order to begin using machine learning modules in Python, I utilized Pandas for a data type called data frames in which I could inspect the data in Jupyter notebook for indexing arrays, aggregating, filtering, grouping, and transforming data such as timestamps. With Pandas data frames, I’m able to operate on the data using a few mathematical operations such as getting the max, min, standard deviation, mean, and the describe function which does all of those operations using one statement. Pandas uses a data type called a series which is a structure similar to dictionaries in Swift by holding the labels and data. A data frame has the capability of combining multiple series with a common index into a tabular object similar to an Excel spreadsheet with rows and columns. A concept I found unique was the index object which keeps track of items within another data structure. Aside from Pandas, Numpy (matrix computing from mathematics. Numpy matrices inherits from python arrays to do dimensional computations from the field of mathematics such as standard deviations, Eigen values from linear algebra, or for vector transformations from arrays to give them context, also the covariance between two values to find the expected distance between two numbers depending on their mean, etc.). Scipy is built on top of numpy for more scientific math operations. SciKit-learn is the area that’ll require knowledge of how statistics works in order to apply ideas to mathematical theories.

In pursuit of data analysis, I found a video series by Julian Keil where he shows the use of statistical operations in MatLab in conjunction with a platform called “Field Trip” (link: https://vimeo.com/43116694 ). I became even more excited in knowing that it was recently used by the Neuroscience program at UC Davis. UC Davis is where I enjoyed attending a hackathon to build an app called Life’s Thesis (link to GitHub repo: https://github.com/cosmicarrows/Lifes-Thesis ). I’m starting to realize the key areas where I’ll need to apply myself in order to persist with this passion project which can be done by following through with the Healthcare data analytics examples, practice using Open CV with iOS for more fun with machine learning, study statistical material, and find the correlations frequency and time series data from EEGs.

Preprocessing Healthcare Data for Predictive Analytics

Recent Posts

Comments