Let’s face it: many business users would prefer to use a nice GUI when analyzing their data, rather than having to write code. KNIME is an open-source data analytics platform where end users can do just that – that is, build an analytic workflow using a GUI. In this article, we will look at an example of how to use KNIME to analyze open-source heart disease data.
The open-source data used in the example that follows was obtained from the University of California Irvine Machine Learning Repository and can be downloaded here: http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/ . (We will use the processed Switzerland, Cleveland, and VA data sets and the reprocessed Hungarian data set).
NOTE: The authors of the databases have requested that any publications resulting from the use of the data include the names of the principal investigator responsible for the collection of the data at each institution. They would be:
- Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
- University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
- University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
- V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.
This data was collected from 920 cardiac patients; 294 were from the Hungarian Institute of Cardiology, 123 were from the University Hospitals in Zurich and Basel, Switzerland, 200 were from the V.A. Medical Center in Long Beach, California, and 303 were from the Cleveland Clinic in Ohio. The combined data set includes 725 men and 193 women; the ages of the patients ranged from 28-77 years. Each patient was classified according to severity of heart disease:
- 0 = no disease
- 1 = mild disease
- 2 = moderate disease
- 3 = acute disease
- 4 = severe disease
For reference, here is a table of the different variables in our data set:
age of patient
gender of patient (1 = male, 0 = female)
type of chest pain experienced by patient (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)
resting blood pressure (mmHg)
serum cholesterol (mg/dl)
fasting blood sugar > 120 mg/dl? 1 = true, 0 = false
resting ECG results (0 = normal, 1 = ST-T wave abnormality, 2 = probable or definite left ventricular hypertrophy according to Estes’ Criteria)
maximum heart rate
exercise-induced angina (1 = yes, 0 = no)
ST depression (a measurement on the patient's ECG readout) induced by exercise relative to rest
slope of peak exercise ST segment (a measurement on the patient's ECG readout; 1 = positive slope, 2 = flat, 3 = negative slope)
number of major vessels colored by fluoroscopy (0-3)
type of heart defect present (3 = normal, 6 = fixed defect, 7 = reversible defect)
severity of heart disease (0 = no disease, 1 = mild disease, 2 = moderate disease, 3 = acute disease, 4 = severe disease)
Some of these data sets contain the “?” symbol (without quotes) wherever there are missing values. For now, let's replace those with the value -9. Since these are small data sets, we can do this by opening our favorite text editor and running a "Find and Replace" on each of the input data sets.
Also, with the reprocessed Hungarian data set, it is recommended that you replace all the spaces between the values with commas.
It also helps if you get rid of the ".0"s (without quotes) in the Cleveland data set. (This is so that our input data is formatted consistently).
We’ll assume that, once we have preprocessed our data, we can hand it over to our DBA, who then loads it into a Teradata Aster table called [schema].heartdx_prelim. (To see an example of how to create a Teradata Aster table and load .csv data into it, please see Sentiment Analysis with Teradata Aster ).
Data Loading/Prep Work:
In this example, we will be using KNIME 3.2.1, which is licensed under the GNU General Public License, Version 3.
To begin, we open KNIME and create a new workflow. Here is what the KNIME screen looks like:
To create a new workflow, go to File > New…
In the dialog box that appears, select New KNIME Workflow > Next > Give your workflow a title in the “Name of the workflow to create” field > Click “Finish”. Our KNIME screen should now look like this:
We will now connect to the database where our data is stored, which, in this case, is a Teradata Aster database. In the Node Repository section on the bottom left side of the screen, select Database > Connector > Database Connector.
NOTE: Nodes can be selected either by double-clicking their name in the Node Repository panel or by dragging and dropping them to the grid panel in the center of the screen. Also, you can search for any node by using the Node Repository search window in the upper right corner of the Node Repository panel.
Now, right click the new Database Connector node and click “Configure,” as shown below:
In the dialog box that appears, fill out the following fields with the credentials of the database that we are connecting to:
NOTE: In the Database URL field, the value for <protocol> when connecting to a Teradata Aster database is jdbc:ncluster . Also, in order to connect KNIME to Aster, you will need to have the noarch-aster-jdbc-driver.jar file saved on your local system. Once you have downloaded this file, go to KNIME and select File > Preferences > KNIME > Databases > Add File > navigate to the location where you saved the noarch-aster-jdbc-driver.jar file. Select the .jar file, then click Open > OK.
When our node is configured, it will look like this (in the rectangle under the node, the middle circle is yellow, in case you have trouble seeing the color):
To execute the node and establish our database connection, either right click the node and select “Execute”, or left-click the node and press F7.
Now that we are connected to our database, we need KNIME to look at our specific table of interest, [schema].heartdx_prelim. For this, we can use a Database Reader node:
- In the Node Repository, select Database > Read/Write > Database Reader .
- Connect the two nodes:
- Click the red box on the right side of the Database Connector node.
- Drag a line to the white box on the left side of the Database Reader node.
Our nodes should look like this:
Right click the Database Reader node and click “Configure.” Ok, so for this node, we are going to need to write some code, but don’t worry; this is the only code writing that we will do in this workflow.
In the dialog box that appears, write the following code in the SQL Statement panel:
DROP TABLE IF EXISTS [schema].knime_pract;
CREATE TABLE [schema].knime_pract
DISTRIBUTE BY HASH(age)
COMPRESS LOW AS (
(ROW_NUMBER() OVER (ORDER BY age)) AS rownum,
nullif (age, -9) AS age,
nullif (***, '-9') AS gender,
nullif (chestpaintype, '-9') AS chestpaintype,
WHEN restbps IN (-9, 0) THEN null
END AS restbps,
WHEN chol IN (-9, 0) THEN null
END AS chol,
nullif (fastbloodsug, '-9') AS fastbloodsug,
nullif (restecg, '-9') AS restecg,
nullif (maxheartrate, -9) AS maxheartrate,
nullif (exindang, '-9') AS exindang,
nullif (oldpeak, -9) AS oldpeak,
nullif (slope, '-9') AS slope,
nullif (numvessels, -9) as numvessels,
nullif (defect, '-9') AS defect,
nullif (dxlevel, '-9') AS dxlevel,
WHEN dxlevel IN ('1', '2', '3', '4') THEN '1'
END AS dxpresent
SELECT * FROM [schema].knime_pract;
This takes our [schema].heartdx_prelim table and replaces the null values (denoted by -9) with true null values (which KNIME denotes with ?s). It then takes the result and creates a new table called [schema].knime_pract, from which it then selects everything. Notice also that we have created an ID variable called "rownum" (without quotes) and a new response variable called "dxpresent" (also without quotes). This dxpresent variable indicates whether or not a particular patient has heart disease.
Click “OK” and then run the node. Now, if we right click the Database Reader node and select “Data from Database”, we can see our data in KNIME:
Now that KNIME can both see our data and recognize the missing values as “missing,” we need to tell KNIME how to handle the missing values. For now, we will focus on getting rid of the columns with over 50% missing values. To do this, we will use a Missing Value Column Filter node.
- In the Node Repository, select Manipulation > Column > Filter > Missing Value Column Filter. Drag this node to our workflow and connect it to the Database Reader node.
- Right click the node and select “Configure.”
- In the dialog box that appears, remove rownum from the “Include” list; this is the ID variable, so it should not be missing, anyway.
- In the Missing Value Threshold (in %) box, enter the desired missing value threshold. In other words, if a column is over x% missing, we want KNIME to drop it from our dataset. Here, we’ll set this to 50%.
- Click OK and run the node.
- Now, if we right click the node and select “Filtered table,” we can see that the variables numvessels and defect were dropped from our dataset.
At this point, we will do a little exploration of our remaining data. We will do this using Pie Chart, Histogram, and Correlation nodes (there are many other nodes that can be used for data exploration, but for now, we’ll focus on these three).
To make a Pie chart, do the following:
- In the node repository, select Views > Pie Chart (interactive). Drag this node to the workflow and connect it to the Missing Value Column Filter node.
- Right click the node and select “Configure.”
- In the dialog box that appears, in the drop-down menu for Pie Column, select the variable that you want to see displayed in the pie chart.
- In the drop-down menu for Aggregation Column, select an aggregation column, if desired.
- Click OK and run the node.
- To see the pie chart, right click the node and select View: Pie Chart. Here is what our pie chart looks like with the chestpaintype variable selected and no aggregation:
Notice that our pie chart is in black and white. To add color to our pie chart, do the following:
- Disconnect the Pie Chart (interactive) node from the Missing Value Column Filter node.
- Drag and drop a Color Manager node into the workflow.
- Connect the Missing Value Column Filter node to the Color Manager node. Connect the Color Manager node to the Pie Chart (interactive) node.
- Configure the Color Manager node. Select the desired variable from the drop-down menu at the top of the dialog box. Select the desired colors from the swatches down below if you want to change them from the defaults.
- Click OK. Run the Color Manager and Pie Chart (interactive) nodes.
- Our workflow and pie chart should now look like this:
As we can see, a little over half of our patients have a chest pain type of 4, which is asymptomatic (in other words, they do not have chest pain).
To make a Histogram, do the following:
- Type “histogram” into the Node Repository search window.
- Drag and drop a Histogram (interactive) node into the workflow. Connect it to the Missing Value Column Filter node (unless you want to add color first; in that case, add a Color Manager node to the workflow, first, as described above).
- Configure the histogram node. Select the desired variable from the Binning Column drop-down menu. If desired, aggregation variables can be added or removed in the panels below.
- Click OK. Run the node.
- To see the histogram, right click the node and select View: Interactive Histogram View.
If you chose the age variable and used a Color Manager node, your histogram should look something like this:
As you can see, the ages of our patients appear to be approximately normally distributed.
A good thing to do when we are working on building a predictive model is to check our predictor variables to see if any of them are highly correlated with each other. Let’s do this now with a Rank Correlation node.
NOTE: The Rank Correlation node calculates correlations between all predictor pairs; the Linear Correlation node will calculate correlations for numeric-numeric pairs and nominal-nominal predictor pairs, but not for numeric-nominal pairs.
- Drag and drop a Rank Correlation node into the workflow. Connect the node to the Missing Value Column Filter node.
- Configure the Rank Correlation node. Remove the rownum variable from the “Include” list (remember that this is the ID variable). You can also remove the dxlevel and dxpresent variables, since these are response variables.
- Click OK and run the node. You will get a warning about how rows containing missing values are filtered and that this issue needs to be resolved with a Missing Value node. In this case, I would say to ignore this warning because we don’t really want to calculate correlations based on values that may be quite different from what the patients’ true values were.
- To see the correlation matrix, right click the node and select View: Correlation Matrix.
It should look something like this:
If we hover our mouse over each of the squares, we can see what the rank correlation is for that pair of variables. We see that chestpaintype is moderately correlated with maxheartrate and exindang (that is, the presence/absence of exercise-induced angina), which intuitively makes sense. Also, maxheartrate, exindang, oldpeak, and slope are all moderately correlated with each other, which makes sense, since all of these measures have to do with the state of a patient’s heart while they are exercising. If moderate correlation doesn’t bother us, then we’re fine to leave all the predictors in for model building.
We’re almost ready to move on to model building, but first, we need to take care of the missing values in our data set. (We can’t just ignore them at this point because the Logistic Regression Learner node that we are about to use won’t work unless we tell KNIME what to do about any missing values…trust me).
- Drag and drop a Missing Value node into the workflow. Connect it to the Missing Value Column Filter node.
- Configure the node. Let’s set the Number drop-down fields to “Median” and the String drop-down field to “Most Frequent Value.”
- Click OK and run the node. For numeric fields, any missing values will be replaced with the median value; for categorical fields, missing values will be replaced with the most frequently occurring value.
Now, let’s partition our main data set into training and test data sets. For this example, we’ll put 70% of our data into the training data set and 30% of our data into the test data set. Also, we’ll focus use dxpresent (1 = disease present, 0 = no disease) as our response variable.
- Drag and drop a Partitioning node into the workflow. Connect it to the Missing Value node.
- Configure the node. Click the Relative[%] radio button and enter 70 into the field at the right. Also, to ensure that our response variable has the same proportions of 1s and 0s in both our training and test data sets, let’s use stratified sampling. Click the Stratified Sampling radio button and make sure that dxpresent is selected in the drop-down menu to the right.
- Click OK and run the node.
- The top arrow on the right side of the node corresponds to our training data set, while the bottom arrow corresponds to the test data set.
Now, we are ready to build some statistical models. In this example, we will build and test a logistic regression model.
- Drag and drop a Logistic Regression Learner node into the workflow. Connect the training data set from the Partitioning node to the Logistic Regression Learner node (use the top arrow from the Partitioning node).
- Configure the Logistic Regression Learner node. Make sure that dxpresent is selected in the Target column drop-down menu. Reference category can be left as 1. In the Include/Exclude panels, make sure that rownum and dxlevel are excluded and that everything else is included.
- Click OK and run the node.
If you right click on the node and select View: Logistic Regression Result View, you can see your model-building results, as shown below:
If we use a typical α = 0.05 significance threshold, we see that our model considers age, gender, chest pain type, cholesterol, fasting blood sugar, maximum heart rate, exercise-induced angina, and oldpeak to be significant predictors of whether or not a person has heart disease.
How reliable are these model results, though? To answer this question, we’ll run our model on our test data set and score the results.
- Drag and drop a Regression Predictor node into the workflow. Connect the blue box on the Logistic Regression Learner node to the blue box on the Regression Predictor node. Connect the bottom arrow from the Partitioning node to the Regression Predictor node.
- The blue box connection hooks up the logistic model to the Regression Predictor node; the connection to the Partitioning node tells the Regression Predictor node to run the test data set through the logistic model.
- This portion of the workflow should look like this:
There isn’t really any configuration that needs to be done with this node for this example, so go ahead and run the node.
If we right click on the node and select Predicted Data, we can see the predictions generated by the node:
To summarize our test results, let's use a Scorer node:
- Drag and drop a Scorer node into the workflow. Connect it to the Regression Predictor node.
- Configure the node. Make sure that dxpresent is selected as “First Column” and that Prediction (dxpresent) is selected as “Second Column.” Also, in the Sorting Strategy drop-down, choose “Lexical.”
- Click OK and run the node.
- If we right click the node and select View: Confusion Matrix, we see the following:
Here, we see that our logistic model’s accuracy at correctly classifying patients as having heart disease/not having heart disease is about 79%. If we right click the Scorer node and select Accuracy Statistics, we can see more measures of accuracy, such as sensitivity, specificity, recall and precision.
Our current workflow looks like this:
At this point, we have several options to continue our work. Our current model does pretty well at predicting whether or not a patient has heart disease; however, we could try to tune our logistic model to improve our accuracy (or whichever accuracy statistic we care about most). We could also try different models, such as Naïve Bayes and Random Forest, to see if a different type of statistical model gives us better results. If going this route, we can hook up multiple model-building nodes to the same Partitioning node. Make sure to use a different Regression Predictor node and Scorer node with each model, though.
In this post, we have seen how to build an analytic workflow using KNIME connected to an Aster database. Of course, this only scratches the surface of what KNIME can do. For more information, you may either contact me at email@example.com or visit www.knime.org.