Skip navigation
All Places > Learn Aster > Aster Works > Blog

Aster Works

1 post

This was initially built as a Jupyter notebook and so I am going to format the commentary so that it generally precedes the related code block.  Plus, the entire notebook will be available at the end of this post to make it easy for anyone who would like to grab it and follow along.


This is the first in a series of posts which will explore and analyze a data set, found on Kaggle. The data set can be found here and contains a record of questions asked on the popular data science forum Cross Validated, along with their answers and tags.  This data set was contributed by David Robinson and Julia Silge (Stack Overflow | Kaggle ) and shareable under the CC-BY-SA 3.0 license.  For the unfamiliar, this website allows users to pose questions to the community which are answered by other members. All of the content is voted up and down by the community which makes it easy to find quality content.

The vision for this series is to use the power and flexibility of D3 to build a unique visual tool that allows us to interact with the data in a way that illustrates the connections and gleans insights from the conversations that ultimately underpin this data set.

We will model the interactions of users on this website as a network and apply some graph and text analysis to understand the structure and dynamics of this network. I would like to answer three key questions: 1. What does this network of connections look like? 2. How can we measure the strength of these connections? 3. What can we learn about the users based on the content of their questions and responses?

What I would like to do in this first post is explore the first question by analyzing some of the basic attributes of the network. To do our analysis, we will use the Python packages Pandas and Networkx to quantify attributes of the users which will be the 'nodes' of our graph. When users ask and respond to questions, we model this as an interaction. These interactions form the 'edges' of our network. We will use this network to build an interactive visualization using D3 that allows us to explore both qualitative and quantitative aspects of this network.


Want to jump straight to the result? Check it out here.


import os
import pandas as pd
import networkx as nx
import numpy as np
import json
from IPython.display import Javascript



The data set contains three files: questions.csv, answers.csv, and tags.csv. We want to load each data set as a Pandas dataframe. This will allow us to do some filtering and preprocessing of the data using convenient pandas syntax.


answers = pd.read_csv('Answers.csv')
questions = pd.read_csv('Questions.csv')
tags = pd.read_csv('Tags.csv')


For whatever reason, the OwnerUserId field is imported as a float by default. Let's convert it to an integer so things will look nicer down the road.


answers.OwnerUserId = answers.OwnerUserId.astype(int)
questions.OwnerUserId = questions.OwnerUserId.astype(int)


To reduce the scale of this analysis, we'll limit our data set to high scoring questions only. To that end, we will apply a filter to our pandas dataframe.


questions_sample = questions[questions['Score'] >= 75]
tags_sample = tags[tags['Id'].isin(questions_sample['Id'])]


Since we have chosen to restrict ourselves to small scale analytics, our Python workflow works just fine for now. However, imagine trying to reproduce this type of analysis over millions or billions of records. You quickly reach a point where you will run into problems.


In situations like this, you have to turn to Big Data technologies. Of these, there are many options. Teradata Aster offers best-in-class solutions across a wide range of data science applications, particularly its text and graph capabilities. Later on, when we dig deeper into this data set, we will showcase these capabilities by using the SQL-MR and SQL-GR analysis frameworks to quickly extract insights from this data set.


Now that we have a subset of questions, we will join them to their answers using the convenient merge() function from the pandas library.


result = pd.merge(questions_sample, answers, how = 'inner', left_on = 'Id', right_on = 'ParentId')


So, let's go ahead and build the graph. Networkx makes this pretty easy. There is even a convenient function to build it right from our pandas dataframe. All we have to do is tell it what to use as the source and target, and the function will set up the necessary data structure.

To keep it simple, we are going to model a user's response to another user's question as a connection between these two users. We could also add edge weights based on the number of connections or the comment score. For now, we will leave these details out. We'll come back to them in a later installment where we consider the content of each contribution in greater detail.


G = nx.from_pandas_dataframe(result, 'OwnerUserId_x', 'OwnerUserId_y')


A few of the graph analytics' algorithms that we want to use are undefined for graphs with self-loops as edges. These occur when users submit an answer to their own question. It's not uncommon to see users answer their own questions, but since we want to model the relationships simply, we will purge these records from the data. Similarly, we will discard all but the largest connected component from our analysis, since that's all we want to visualize later on.


self_loops = G.selfloop_edges()
largest_cc = max(nx.connected_components(G), key=len)
G = G.subgraph(largest_cc)
> 10
> 0


First, let's look at some basic information on our network. 659 users either asked or responded to questions that passed the score > 75 threshold we set earlier. On average, each user has 2.92 connections to other users in the network.


> Name:
> Type: Graph
> Number of nodes: 659
> Number of edges: 961
> Average degree: 2.9165


Next, we'll look at some basic attributes of the graph in attempt to characterize how well connected it is. Let's start with the average clustering coefficient. This is a measure of how well connected the nodes are to groups of neighbor nodes. It does this by looking at a node in the network and sampling two other nodes at random. If the nodes are all connected, we count it as a triangle. The average clustering coefficient is the sum of the count of triangles that were found, divided by the number of tests.


> 0.023916867875055498


The density tells us how well connected our network is. Specifically, it is proportional to the number of actual connections in the network relative to the possible connections. For example, consider 50 people at a family reunion vs. 50 people on a bus. The network of people at the family reunion is likely to have a high density since everyone probably has some level of relationship with everyone else. By contrast, the network of people on the bus likely have low density since most of these people probably do not even know one another.

Applying this interpretation to our data set, we find that in fact the network is not particularly dense. This makes a certain amount of sense. We aren't dealing with a purely social forum, so you would not necessarily expect to see a lot of frivolous conversation. Also, since we are only looking at high scoring questions, there are likely some additional connections between users from lower scoring comments that are not seen here.

However, there may still be some interesting relationships if we dig deeper, so let's do that.


> 0.004432431933804096


In order to add color to the graph, we'll apply some graph metrics to add attributes to the nodes. These measures give us insight into the importance and connectedness of each user based on their relationship to the rest of the network.

The first two node attributes we will look at are measures of centrality. The simplest of these is called degree centrality, which is the fraction of nodes in the network that a given node shares a connection with. In the context of this network, we can interpret this score as the propensity for an individual to exert significant influence on the network since his remarks would have high exposure and visibility.

The other centrality measure we want to add is called the betweenness centrality. This metric is interesting because it allows us to find nodes of the network that act as connections between other nodes. It is calculated by counting the number of times the node appears as part of the shortest path between a pair of nodes. The more times a node appears in the shortest path, the higher it's score. Nodes with high betweenness can also have a high influence factor if they also connect to nodes with high degree.

The next two node attributes are measures of clustering. Clusters are computed by looking at 1st and 2nd order neighbors and finding triangles (2 nodes connected the root and each other) and squares (2 nodes connected to the root node that also share another common neighbor). The default clustering algorithm computes a ratio of actual to potential triangles that exists for a given node. In a similar matter, the square clustering coefficient is the fraction of actual squares for a given node relative to potential squares. These clusters can point to members of the network that form cliques or neighborhoods based on the association they have with common neighbors.

The final node attribute we will look at is called eccentricity. Eccentricity is a measure of the distance from a node to its furthest neighbor. In this example, the eccentricity is inversely correlated to the degree centrality, so it doesn't add much new information. We'll keep it in our visualization, since it gives us a different way to look at our network.


centrality = nx.degree_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
tr_clstr = nx.clustering(G)
sq_clstr = nx.square_clustering(G)
eccentricity = nx.eccentricity(G)

node = []
deg_cent = []
bet_cent = []
tr_cl = []
sq_cl = []
ecc = []
for v in G.nodes():
node_attr = zip(node, deg_cent, bet_cent, tr_cl, sq_cl, ecc)
print("(node, deg_cent, bet_cent, tr_cl, sq_cl, ecc)")
> (node, deg_cent, bet_cent, tr_cl, sq_cl, ecc)
> (3205, 0.001519756838905775, 0.0, 0.0, 0.0, 8)


Now that we have some metrics to score our network, let's build a visualization to let us see what this thing actually looks like.


Both Python and Networkx have a number of methods that visualize network data, and they all do a decent job. However, in order to enable interactions with the data and build a framework that we can use to incorporate new types of analysis, we are going to build the visualization using the Javascript charting library, D3. This library has a steep learning curve, but the trade-off is that it allows you to build almost anything you can imagine.


The first thing to do is pass the data to our visualization window. We will do that by attaching it to the global 'window' object which is a base object in any web-based application. We'll also go ahead and convert our data to .json format which is the standard for Javascript and D3 based visualizations. You could also convert to json later in Javascript but it's so easy to do in Python, so why bother?


The syntax here is pretty straightforward. We'll use json.dumps as well as some standard Python functions like zip to build a json object that our force simulation will like. For the visualization, we won't use the tags just yet, but we have them set up, since it will be important in part 2 of this series.



edges = G.edges()
edgesJson = json.dumps([{'source': source, 'target': target} for source, target in edges], default = str, indent=2, sort_keys=True) # called a 'list comprehension'
nodesJson = json.dumps([{'id': node, 'degree_cent': centrl, 'betweenness_cent': btwn, 'tr_clstr':tr_cl, 'sq_clstr': sq_cl, 'eccentricity': ecc} for node, centrl, btwn, tr_cl, sq_cl, ecc in node_attr], indent=4)
tagsJson = json.dumps([{'id': id, 'tag': tag} for id, tag in zip(tags_sample['Id'], tags_sample['Tag'])], default = str, indent=4, sort_keys=True)


We'll also go ahead and add another json here, which will later be used to drive the interactions with the visualization. Each object in this json will correspond to a function that alters properties of the visualization, such as the node attribute selection, the gravity, and the highlighted sub-graph.


controlsJson = '[ { "control": "clear", "abbrev": "CLR", "index": 0 }, { "control": "gravity_up", "abbrev": "H G", "index": 1 }, { "control": "gravity_down", "abbrev": "L G", "index": 2 }, { "control": "degree_cent", "abbrev": "DC", "index": 3 }, { "control": "betweenness_cent", "abbrev": "BC", "index": 4 }, {"control":"tr_clstr", "abbrev": "TC", "index" : 5},{ "control": "sq_clstr", "abbrev": "SQC", "index": 6 }, { "control": "eccentricity", "abbrev": "ECC", "index": 7 } ] '


It's a a bit hacky, but this is how we get the data downstream to the visualization.










Now that we have the data prepared, it's time to build our D3 visualization!

There are lots of ways to visualize graph data, but few libraries have both the ability to build in user interactions and the power to control every detail of the graphical representation. D3 does both of these things exceptionally well.

To add D3 to our workbook, we use a %%javascript tag in the cell and call require.config to define our path to D3.


paths: {
d3: '//'


The last step is to load our D3 code. This is accomplished by copying it into the callback of the require function below. We also wrap it in a try/catch block so that any subsequent error messages are clearly attributed to this visualization in the log.

One of the trade offs of using D3 is that while you are afforded complete control over the layout of the result, getting there can be kind of tedious.

Let's go through what's happening below at a high level. The script loads the data that was previously attached to the window element, defines a force simulation on the nodes and edges, and renders the result.

We start with a bit of jQuery to remove any previous DOM element with the element name #chart1. Then we'll append an element a new div with id=chart1. Next, we define some parameters and add our svg element which will contain all of the objects associated with our visualization.

Following that, we set up our force simulation. In D3, force simulation is a simple physics engine that automatically positions nodes based on attributes associated with the nodes and edges. By default, there are no forces initialized by the simulation.

Let's add a few forces and talk about what they do.  d3.forceLink() defines a spring force that connects the nodes together.   d3.forceManyBody() sets up a charge force which repels the nodes.  d3.forceCenter() modifies the node positions so that the center of mass is in the center of the viewport.

The next things we set up are the control switches. To avoid writing external HTML controls, I use svg rectangles that we can bind to interactions in our graph. The first, labeled CLR, can be used to clear the current selected sub-graph. The second and third switches toggle between high and low gravity (well, really charge).  This acts sort of like a zoom functionality. The last five switches toggle the node coloring between the graph metrics that were calculated in newtorkx.

The real meat and potatoes of the visualization starts with the drawSimulation() function. This function takes two arguments, nodes and edges. Next, we assign the svg shape objects that will represent the nodes and edges. What is neat about this formulation in D3 is the simulation doesn't restrict us to any particular visual representation of our network. We could use rectangles, triangles, pictures of cats, whatever. Since we are all reasonable Data Scientists, we'll let our nodes be circles and the edges be lines.

Following our shape object definitions, we tell the simulation how to handle the links and edges. For every 'tick' of the simulation, we update the positions of the nodes and edges. We also tell it that the link object relates to the forceLink() function, initialized earlier.

Since we want the graph to be interactive, there is also a bit of event handling to tell the simulation what to do when we manually reposition the nodes. To accomplish this, there are three functions: dragstarted, dragged, and dragended. These tell the simulation to set the position of a dragged node to the mouse position, until dropped by the user.

That's basically it. If you play around with the graph, you will notice there are some additional features. Click and hold a node to highlight a sub-graph. When released, the selection persists so you can examine other connected nodes. In this way, you can traverse the graph and follow the paths between groups of nodes. Check it out here.



require(['d3'], function(d3){
//create canvas

var margin = {top: 50, right: 50, bottom: 50, left: 50};
var width = 1000 - margin.left - margin.right;
var height = 900 - - margin.bottom;
var forceCenterOffset = {x: 50, y: 50}
var svg ="#chart1").append("svg")
.style("position", "relative")
.attr("width", width + "px")
.attr("height", (height) + "px")
.attr("transform", "translate(" + margin.left + "," + + ")");
var simulation = d3.forceSimulation()
.force("link", d3.forceLink().id(function(d) { return; }).strength(0.5))
.force("charge", d3.forceManyBody().strength(-5))
.force("center", d3.forceCenter(width / 2 - forceCenterOffset.x, height / 2- forceCenterOffset.y));
var colorScale = d3.scaleLinear().range(["#6b24a5", "#ffffff"]);
var strokeWidth = 1.0, cSize = 4;
var controlBoxes = svg.append("g")
.attr("class", "control-boxes")
.style("font","14px sans-serif")
.attr("transform", "translate(175,0)")
.text("Click and hold to highlight connections. Boxes on the right adjust graph settings.")
var legendSize = {width: 20, height: 200};
var colorLegendYScale = d3.scaleLinear().range([0, legendSize.height]);
var div ="#chart1").append("div")
.attr("class", "graph-tooltip")
.style("opacity", 0)
.style("z-index", 1);
var colorLegend = svg
.attr("class", "legend")
var linearGradient = colorLegend.append("defs")
.attr("id", "linear-gradient");
.attr("x1", "0%")
.attr("y1", "0%")
.attr("x2", "0%")
.attr("y2", "100%");
.attr("offset", function(d,i) { return i/(colorScale.range().length-1); })
.attr("stop-color", function(d) { return d; })
.attr('stop-opacity', 1);
.attr("width", legendSize.width)
.attr("height", legendSize.height)
.attr("transform", "translate(0,0)")
.style("fill", "url(#linear-gradient)")
.style("stroke", "black");

function drawSimulation(nodes, edges){
var nodeAttrSelection = "degree_cent";
setColorScale(nodes, nodeAttrSelection);
var link = svg.append("g")
.attr("class", "links")
.attr("class", "edge")
.style("stroke-width", 1.5)
.style("stroke", "#bbb");
var node = svg.append("g")
.attr("class", "nodes")
.attr("class", "node")
.attr("r", cSize)
.style("stroke", "black")
.style("stroke-width", strokeWidth)
.style("fill", function(d){return colorScale(nodeAttrAccessor(d, nodeAttrSelection)); })
.on("start", dragstarted)
.on("drag", dragged)
.on("end", dragended));
nodeTooltip(node, nodeAttrSelection);
.on("tick", ticked);
function ticked() {
.attr("x1", function(d) { return d.source.x; })
.attr("y1", function(d) { return d.source.y; })
.attr("x2", function(d) { return; })
.attr("y2", function(d) { return; });
.attr("cx", function(d) { return d.x; })
.attr("cy", function(d) { return d.y; });

function dragstarted(d) {
if (! simulation.alphaTarget(0.3).restart();
d.fx = d.x;
d.fy = d.y;
function dragged(d) {
d.fx = d3.event.x;
d.fy = d3.event.y;
function dragended(d) {
if (! simulation.alphaTarget(0);
d.fx = null;
d.fy = null;

function castNodeData(nodeData){
nodeData.forEach(function(d) {
d.degree_cent = +d.degree_cent;
d.betweenness_cent = +d.betweenness_cent;
d.tr_clstr = +d.tr_clstr;
d.sq_clstr = +d.sq_clstr;
d.eccentricity = +d.eccentricity;

function hideOtherNodes(d){
var g_nodes = svg.selectAll(".node");
var g_edges = svg.selectAll(".edge");
var shownNodes = [];
g_edges.filter(function (x) {
if ( != && != )
return true;
} else {
shownNodes.push(; // push ids for nodes connected to dragged node
return false;
.style("stroke", "#bbb")
.style("stroke-opacity", 0.1)
.style(); // fade out everything not connected to dragged node
g_edges.filter(function(x){ return === || ===; })
.style("stroke", "#000000");
g_nodes.filter(function (x) { return (shownNodes.indexOf( === -1); })
.style("fill-opacity", 0.1)
.style("stroke-opacity", 0.1)
.style("stroke", "#000000")
.style("stroke-width", strokeWidth);
g_nodes.filter(function (x) { return (shownNodes.indexOf( != -1); })
.style("stroke", "#3039e8")
.style("stroke-width", 2*strokeWidth)
.style("r", 2*cSize);
function showOtherNodes(d){
var g_nodes = svg.selectAll(".node");
var g_edges = svg.selectAll(".edge");
.style("fill-opacity", 1)
.style("stroke-opacity", 1)
.style("r", cSize);
.style("stroke-opacity", 0.6);

function nodeAttrAccessor(d, valueType) {
if (valueType === "degree_cent") {
return d.degree_cent;
} else if ( valueType === "betweenness_cent") {
return d.betweenness_cent;
} else if ( valueType === "tr_clstr") {
return d.tr_clstr;
} else if (valueType === "sq_clstr") {
return d.sq_clstr;
} else if (valueType === "eccentricity") {
return d.eccentricity;

function drawControls(controls, nodes){
var transitionDuration = 75;
var controlBoxSize = 30;
var controlBoxScaleUp = 1.33;
var g_box = controlBoxes
.attr("transform", function (d,i){
return "translate("+(width - 150)+","+(i*(controlBoxSize+ 5))+")"
.attr("class", "controls");
.attr("class", "control")
.attr("width", controlBoxSize)
.attr("height", controlBoxSize)
.style("stroke", function(d){
if (d.control === "clear") {
return "#3039e8";
} else {
return "black";
.style("fill", function(d){
if (d.control === "clear") {
return "#ffffff";
} else if (d.control === "gravity_up" || d.control === "gravity_down") {
return "#b8b9bc"
} else {
return "#b592d2"
.attr("x", 0.08*controlBoxSize)
.attr("y", 0.6*controlBoxSize)
.text(function(d){ return d.abbrev ;})
.on("click", function(d){
if (d.control === "clear") {
} else if (d.control === "gravity_up") {
} else if (d.control === "gravity_down") {
} else {
setNodeAttribute(d.control, nodes);
.on("mouseover", function(d, i){
.attr("width", controlBoxSize*controlBoxScaleUp)
.attr("height", controlBoxSize*controlBoxScaleUp)
.style("stroke-width", 2);
var index = d.index, additionalOffset = (controlBoxScaleUp-1)*controlBoxSize;
.attr("transform", function (d,i){
if ( i > index) {
return "translate("+(width - 150)+","+(i*(controlBoxSize+5)+additionalOffset)+")"
} else {
return "translate("+(width - 150)+","+(i*(controlBoxSize+5))+")"
controlTooltip(g_box, index);
.on("mouseout", function(d){
.attr("width", controlBoxSize)
.attr("height", controlBoxSize)
.style("stroke-width", 1);
.attr("transform", function (d,i){
return "translate("+(width - 150)+","+(i*(controlBoxSize+ 5))+")"
function resetNodeBorder(){
var g_nodes = svg.selectAll(".node");
var g_edges = svg.selectAll(".edge");
.style("stroke", "#000000")
.style("stroke-width", strokeWidth);
.style("stroke", "#bbb")

function changeGravity(direction){
if (direction ==="down") {
simulation.force("charge", d3.forceManyBody().strength(-25));
setTimeout(function() { simulation.alphaTarget(0); }, 2500);
} else if (direction === "up") {
simulation.force("charge", d3.forceManyBody().strength(-5));
setTimeout(function() { simulation.alphaTarget(0); }, 2500);
} else {
.force("charge", d3.forceManyBody().strength(-5));

function setNodeAttribute(attributeType, nodes){
setColorScale(nodes, attributeType);
var node = svg.selectAll("circle.node")
.style("fill", function(d){return colorScale(nodeAttrAccessor(d, attributeType)); });
nodeTooltip(node, attributeType);

function setColorScale(nodes, attributeType){
var nodeAttrMax = d3.max(nodes, function(d){ return nodeAttrAccessor(d, attributeType);});
//var nodeAttrMin = 0;
var nodeAttrMin = d3.min(nodes, function(d){ return nodeAttrAccessor(d, attributeType);});
var nodeAttrScaleAdj = (nodeAttrMax - nodeAttrMin)
var nodeAttrExtent = [(nodeAttrMax-nodeAttrScaleAdj*0.25), (nodeAttrMin)];
setLegendScale(nodes, attributeType, nodeAttrExtent)

function nodeTooltip(node, nodeAttrSelection){
.on("mouseover", function(d) {
.style("opacity", .9);
div.html("id:" "
" +nodeAttrSelection+": " + nodeAttrAccessor(d, nodeAttrSelection).toFixed(5))
.style("left", (d3.event.pageX) + "px")
.style("top", (d3.event.pageY) + "px");
console.log("x: "+d3.event.pageX+"; y: "+d3.event.pageY);
.on("mouseout", function(d) {
.style("opacity", 0);

function controlTooltip(cBox, index){
var tooltipHTML = "";
switch(index) {
case 0:
tooltipHTML = "Clear Selection";
case 1:
tooltipHTML = "High Gravity";
case 2:
tooltipHTML = "Low Gravity";
case 3:
tooltipHTML = "Show Degree Centrality";
case 4:
tooltipHTML = "Show Betweenness Centrality";
case 5:
tooltipHTML = "Show Triangle Clustering";
case 6:
tooltipHTML = "Show Square Clustering";
case 7:
tooltipHTML = "Show Eccentricity";
.on("mouseover", function(d) {
.style("opacity", .9);
.style("left", (d3.event.pageX) + "px")
.style("top", (d3.event.pageY) + "px");
console.log("x: "+d3.event.pageX+"; y: "+d3.event.pageY);
.on("mouseout", function(d) {
.style("opacity", 0);

function setLegendScale(data, nodeAttrSelection, colorDomain){
var colorLegendYAxis = d3.axisRight(colorLegendYScale);
.attr("class","y axis")
.attr("transform", "translate(25,0)");
.attr("class", "tick")
.attr("transform", "rotate(-90)")
.attr("y", 6)
.attr("dy", ".71em")
.style("text-anchor", "end");
.attr("class", "label")
.attr("transform", "translate(-5,200)")
.style("font","14px sans-serif")
.attr("transform", "rotate(-90)")

function setCSS(){"#chart1")
.style("font", "10px sans-serif");

.style("pointer-events", "none");"div.graph-tooltip")
.style("position", "fixed")
.style("font","12px sans-serif")


/* ************************************************************** */
// MAIN:
/* ************************************************************** */
var nodesData = window.nodes;
var edgesData = window.edges;
var controlsData = window.controls;

drawSimulation(nodesData, edgesData);
drawControls(controlsData, nodesData);
} catch(err) {
console.log("Viz Error: ");

Filter Blog

By date: By tag: