Step 3: Discovering Emergent Themes
In the next stage of coding, often called “pattern coding” (Miles, Huberman, and Saldaña 2020), we group the descriptive codes made in the previous phase into a smaller number of categories or themes. Themes or categories “are broad units of information that consist of several codes aggregated to form a common idea” (Creswell and Poth 2018, 194). These categories can be thought of as somewhat of a meta-code.*
Categories should span multiple codes that were previously identified. These categories “capture some recurring pattern that cuts across your data” (Merriam and Tisdell 2016, 207). Merriam and Tisdell (2016) suggest this process of discovering themes from codes feels somewhat like constantly transitioning one’s perspective of a forest, from looking at the “trees” (codes) to the “forest” (themes) and back to the trees.
Discoving Themes
As I looked over my descriptive codes, I asked myself what these codes tell me about the nature of the data science skills students used in their projects. Some themes immediately jumped out at me, whereas others took a bit of time to mull over. I’ll walk you through my process below.
“Obvious” Themes
There were two themes I expected to see due to the nature of the project and the requirements stipulated by the professor. For their project, students were expected to (1) use an analysis strategy learned in the course and (2) create a visualization to accompany any analysis and resulting discussion. Thus, I expected themes of “Data Model” and “Data Visualization” to emerge from the data.
From my own experiences, I also expected that students would need to perform some aspect of data wrangling to prepare their data for analysis. The data students used for their project were from their own research, so, although I knew data wrangling would play some role, I was unsure what type of tasks might appear in the codes.
Emergent Themes
While I was looking over the data wrangling tasks students performed in their projects, I noticed the techniques called upon specific attributes of different data structures (e.g., dataframe, vector, matrix). The implementation of some tasks was fairly uniform (select variable from dataframe using $
operator), whereas other tasks were highly variable. Data filtering was sometimes done with the subset()
function, which requires little explicit knowledge of data structures. However, other times this filtering was carried out using the []
/ extraction operator, a technique which requires an understanding of how extraction differs across different data structures.
I also noticed while looking at the R code for the “Data Model” and “Data Visualization” themes that certain statements of code included some knowledge (or lack thereof) regarding the R Environment. The most obvious statement that evoked this theme used with()
to temporarily attach a dataframe. There were, however, other statements that also fit into this theme, such as function arguments being bypassed, sourcing in an external R script, loading in datasets, and loading in packages.
Within the themes of “Data Model” and “Data Wrangling,” I uncovered an additional theme which speaks to the efficiency of a statement of code. The notion of efficiency came to me from the “don’t repeat yourself” principle (Wilson et al. 2014), which recommends scientists modularize their code rather than copying and pasting and re-use their code instead of rewriting it (p. 2). Thus, I considered code which adhered to these practices “efficient” and code which did not adhere to these practices “inefficient.”
The final theme I discovered were statements of code whose purpose was more for a student’s workflow than anything else. Code comments were my first indication of this theme, where students used code comments to create sections of code or flag what was happening in a particular line / lines of code. I expanded this theme to include statements of code which inspect some characteristic of an object (e.g., structure of a datafame, names of a dataframe, summary of a linear model).
Assigning Descriptive Codes to Themes
For each of the themes outlined above, the associated “atoms” / statements of code are listed. Keep in mind one statement can apply to two themes! For example, the code
linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)
applies to three themes. First and foremost, this code uses lm()
to fit a linear regression model to the data (data model). Second, in order to fit the data model, the student uses data wrangling to select the variables of interest(PADataNoOutlier$Lipid
, PADataNoOutlier$PSUA
). Finally, this code does not make use of the data =
argument built in to lm()
, which implies a lack of understanding of the function and thus the R environment.
Data Model
Definition: Statements of code whose purpose is to create a statistical model from data.
Data Visualization
Definition: Statements of code whose purpose is to visualize relationships between variables
Sub-themes
- scatterplot
- adding lines to plot
- differentiated colors
- including a legend
- changing plotting environment
- modifying axis labels / plot titles
Data Wrangling
Definition: Statements of code whose purpose is to prepare a dataset for analysis and / or visualization
Sub-themes
- selecting variables
- filtering observations
- mutating variables
Data Structures
Definition: An statement of code which explicitly calls upon attributes of a data structure (e.g., dataframe, matrix, vector)
R Environment
Definition: A statement of code which calls on explicit aspects of the R environment
Efficiency / Inefficiency
Definition: A statement of code which adheres to the “don’t repeat yourself” principle
Workflow
Definition: A statement of code which facilitates a smooth execution of a working process