A Guide to Qualitatively Analyzing Computing Code
  • Home
  • Step-by-Step Guide
  • About

Step 3: Discovering Emergent Themes

On this page

On this page

  • Discoving Themes
    • “Obvious” Themes
    • Emergent Themes
  • Assigning Descriptive Codes to Themes
    • Data Model
    • Data Visualization
    • Data Wrangling
    • Data Structures
    • R Environment
    • Efficiency / Inefficiency
    • Workflow

Report an issue

  • Step 1: Selecting a Unit of Analysis
  • Step 2: Descriptive Codes
  • Step 3: Discovering Emergent Themes
  • Optional: Comparing Across Students
  • Step 4: Digging Deeper

On this page

  • Discoving Themes
    • “Obvious” Themes
    • Emergent Themes
  • Assigning Descriptive Codes to Themes
    • Data Model
    • Data Visualization
    • Data Wrangling
    • Data Structures
    • R Environment
    • Efficiency / Inefficiency
    • Workflow

Report an issue

Step 3: Discovering Emergent Themes

In the next stage of coding, often called “pattern coding” (Miles, Huberman, and Saldaña 2020), we group the descriptive codes made in the previous phase into a smaller number of categories or themes. Themes or categories “are broad units of information that consist of several codes aggregated to form a common idea” (Creswell and Poth 2018, 194). These categories can be thought of as somewhat of a meta-code.*

*For quantitative researchers, this process can be thought of as an analog to cluster-oriented or factor-oriented approaches in statistical analysis.
Miles, M. B., A. M. Huberman, and J. Saldaña. 2020. Qualitative Data Analysis. Thousand Oaks, CA: Sage.
Creswell, J. W., and C. N. Poth. 2018. Qualitative Inquiry & Research Design. Thousand Oaks, CA: Sage.
Merriam, S. B., and E. J. Tisdell. 2016. Qualitative Research. San Francisco, CA: John Wiley & Sons.

Categories should span multiple codes that were previously identified. These categories “capture some recurring pattern that cuts across your data” (Merriam and Tisdell 2016, 207). Merriam and Tisdell (2016) suggest this process of discovering themes from codes feels somewhat like constantly transitioning one’s perspective of a forest, from looking at the “trees” (codes) to the “forest” (themes) and back to the trees.

Discoving Themes

As I looked over my descriptive codes, I asked myself what these codes tell me about the nature of the data science skills students used in their projects. Some themes immediately jumped out at me, whereas others took a bit of time to mull over. I’ll walk you through my process below.

“Obvious” Themes

There were two themes I expected to see due to the nature of the project and the requirements stipulated by the professor. For their project, students were expected to (1) use an analysis strategy learned in the course and (2) create a visualization to accompany any analysis and resulting discussion. Thus, I expected themes of “Data Model” and “Data Visualization” to emerge from the data.

From my own experiences, I also expected that students would need to perform some aspect of data wrangling to prepare their data for analysis. The data students used for their project were from their own research, so, although I knew data wrangling would play some role, I was unsure what type of tasks might appear in the codes.

Emergent Themes

While I was looking over the data wrangling tasks students performed in their projects, I noticed the techniques called upon specific attributes of different data structures (e.g., dataframe, vector, matrix). The implementation of some tasks was fairly uniform (select variable from dataframe using $ operator), whereas other tasks were highly variable. Data filtering was sometimes done with the subset() function, which requires little explicit knowledge of data structures. However, other times this filtering was carried out using the [] / extraction operator, a technique which requires an understanding of how extraction differs across different data structures.

I also noticed while looking at the R code for the “Data Model” and “Data Visualization” themes that certain statements of code included some knowledge (or lack thereof) regarding the R Environment. The most obvious statement that evoked this theme used with() to temporarily attach a dataframe. There were, however, other statements that also fit into this theme, such as function arguments being bypassed, sourcing in an external R script, loading in datasets, and loading in packages.

Within the themes of “Data Model” and “Data Wrangling,” I uncovered an additional theme which speaks to the efficiency of a statement of code. The notion of efficiency came to me from the “don’t repeat yourself” principle (Wilson et al. 2014), which recommends scientists modularize their code rather than copying and pasting and re-use their code instead of rewriting it (p. 2). Thus, I considered code which adhered to these practices “efficient” and code which did not adhere to these practices “inefficient.”

Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, et al. 2014. “Best Practices for Scientific Computing.” PLOS Biology 12 (1): e1001745.

The final theme I discovered were statements of code whose purpose was more for a student’s workflow than anything else. Code comments were my first indication of this theme, where students used code comments to create sections of code or flag what was happening in a particular line / lines of code. I expanded this theme to include statements of code which inspect some characteristic of an object (e.g., structure of a datafame, names of a dataframe, summary of a linear model).

Assigning Descriptive Codes to Themes

For each of the themes outlined above, the associated “atoms” / statements of code are listed. Keep in mind one statement can apply to two themes! For example, the code

linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)

applies to three themes. First and foremost, this code uses lm() to fit a linear regression model to the data (data model). Second, in order to fit the data model, the student uses data wrangling to select the variables of interest(PADataNoOutlier$Lipid, PADataNoOutlier$PSUA). Finally, this code does not make use of the data = argument built in to lm(), which implies a lack of understanding of the function and thus the R environment.

Data Model

Definition: Statements of code whose purpose is to create a statistical model from data.

R CodeDescriptive CodeAdditional Notes
1linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)fit a linear model, uses $ to select columns from dataframelm()
2plot(expAnterior)visualizes model diagnostics for linear modeldiagnostic plots
3EarlyWeightAge <- ddply(Early, ~Age, summarise, meanWE=mean(Weight, na.rm = T))data summary (mean) or variable by groups of another variable, create a new dataframegrouped summary statistics
4Weight1 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 1], na.rm = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summarycalculating the mean
5nmle <- function(P, t, y, N15_NO3_O){ yhat <- N15_NO3_O * (1 - exp(-P[1]*t)) - sum(dnorm(y, yhat, exp(P[2]), log = T)) }user-defined function which takes multiple inputs, filters a vector using brackets, calculates values based on variables and function outputcreate function to estimate parameter
6ktotEst <- mle.outD$estimate[1]pulling off named function output using $, filtering elements using bracketspull out MLE estimate
7sigmaEst <- exp(mle.outD$estimate[2])create new variable, pull off named function output using $, filter elements using bracketsobtain point estimate
8predictionD <- fracDenD*N15_NO3_O_D*(1-exp(-ktotEst*predictionTimesD))creating new object, using values of previously defined variablesobtain predictions
9likelihoods <- apply(X = pMat, MARGIN = 1, FUN = nmle, t = timeD, y = obsD, N15_NO3_O = fracDenD*(N15_NO3_O_D))creating a new dataframe, applying function over rows of existing dataframe, declare additional function arguments, name all arguments  obtaining likelihood estimates
10mlle <- -min(likelihoods)create new variable, data summary (min)obtaining minimum of likelihood estimates
Showing 1 to 10 of 16 entries
Previous12Next

Data Visualization

Definition: Statements of code whose purpose is to visualize relationships between variables

Sub-themes

  • scatterplot
  • adding lines to plot
  • differentiated colors
  • including a legend
  • changing plotting environment
  • modifying axis labels / plot titles
R CodeDescriptive CodeAdditional Notes
1with(PADataNoOutlier, plot(Lipid ~ PSUA, las = 1, col = ifelse(PADataNoOutlier$`Fork Length` <  280, "red", "black")))uses with() to attach dataframe, creates scatterplot, colors points based on conditional statement, rotates y-axis ticksscatterplot, trend line, differing colors
2abline(linearAnterior)adds linear trendlinelinear smoother
3with(PADataNoOutlier, plot(Lipid ~ log(PSUA), las = 1, col = ifelse(PADataNoOutlier$`Fork Length` <  260, "red", "black"))) col = ifelse(PADataNoOutlier$`Fork Length` <  260, "red", "black")))uses with() to attach dataframe, creates scatterplot, colors points based on conditional statement, rotates y-axis ticksscatterplot, trend line, differing colors
4abline(expAnterior)adds linear trendlinelinear smoother
5plot(EarlyLengthAge$meanLE ~ EarlyLengthAge$Age, las = 1, ylab = "Fork Length (mm)", xlab = "Age")creates scatterplot, selects variables using $, specifies x- and y-axis labels, rotates y-axis ticksscatterplot, axis labels, tick mark orientation
6lines(EarlyLengthAge$meanLE ~ EarlyLengthAge$Age)adds segments between lines, selects variables using $line segments
7points(MidLengthAge$meanLM ~ MidLengthAge$Age, col = "red")adds additional points to plot, selects variables using $, colors points redpoints, colors
8lines(MidLengthAge$meanLM ~ MidLengthAge$Age, col = "red")adds segments between lines, selects variables using $, colors segments redline segments, colors
9legend(15, 600, legend = c("1998-2003", "2006-2017"), col = c("black", "red"), lty = 1:1, cex = 0.8)specify position for legend, declare text to display, colors to display with text, types of lines to display, thickness of lines displayedlegend
10plot(WeightAge$meanW ~ WeightAge$Age)creates scatterplot, selects variables using $scatterplot
Showing 1 to 10 of 11 entries
Previous12Next

Data Wrangling

Definition: Statements of code whose purpose is to prepare a dataset for analysis and / or visualization

Sub-themes

  • selecting variables
  • filtering observations
  • mutating variables
R CodeDescriptive CodeAdditional Notes
1linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)fit a linear model, uses $ to select columns from dataframeselecting columns using $
2expAnterior <- lm(PADataNoOutlier$Lipid ~ log(PADataNoOutlier$PSUA))fit a linear model, uses $ to select columns from dataframe, uses function to mutate variableselecting columns using $
3early <- subset(RPMA2Growth, StockYear < 2006)filter data, relational statement, create new dataframefilter based on values of variable
4mid <- subset(RPMA2Growth, StockYear < 2014 & StockYear > 2003)filter data, relational statement, joined by logical (&), create new dataframefilter based on values of variable and logical statement (&)
5RPMA2GrowthSub <- transform(RPMA2Growth, Age = as.integer(Age))mutate existing variable to integer, create new dataframemutate existing column, change data type
6Early <- subset(RPMA2GrowthSub, StockYear < 2004)filter data, relational statement, create new dataframefilter based on values of variable
7Mid <- subset(RPMA2GrowthSub, StockYear < 2018 & StockYear > 2005)filter data, relational statement, joined by logical (&), create new dataframefilter based on values of variable and logical statement (&)
8plot(WeightAge$meanW ~ WeightAge$Age)creates scatterplot, selects variables using $selecting columns using $
9plot(LengthAge$mean ~ LengthAge$Age)creates scatterplot, selects variables using $selecting columns using $
10Weight1 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 1], na.rm = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryselecting columns using $, filtering rows with relational operator (==)
Showing 1 to 10 of 41 entries
Previous12345Next

Data Structures

Definition: An statement of code which explicitly calls upon attributes of a data structure (e.g., dataframe, matrix, vector)

R CodeDescriptive CodeAdditional Notes
1WeightChange <- rep(NA, 9)create variablevector
2legend(15, 600, legend = c("1998-2003", "2006-2017"), col = c("black", "red"), lty = 1:1, cex = 0.8)specify position for legend, declare text to display, colors to display with text, types of lines to display, thickness of lines displayeduse of c() for legend text and colors
3WeightChangeinspect dataframevector
4Weight1 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 1], na.rm = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryvector, [] used to extract elements
5Weight1inspect objectvector
6Length1 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 1], na.rm  = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryvector, [] used to extract elements
7Weight2 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 2], na.rm = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryvector, [] used to extract elements
8Length2 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 2], na.rm  = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryvector, [] used to extract elements
9Weight3 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 3], na.rm = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryvector, [] used to extract elements
10Length3 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 3], na.rm  = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryvector, [] used to extract elements
Showing 1 to 10 of 56 entries
Previous123456Next

R Environment

Definition: A statement of code which calls on explicit aspects of the R environment

R CodeDescriptive CodeAdditional Notes
1linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)fit a linear model, uses $ to select columns from dataframedoesn't use data = argument
2with(PADataNoOutlier, plot(Lipid ~ PSUA, las = 1, col = ifelse(PADataNoOutlier$`Fork Length` <  280, "red", "black")))uses with() to attach dataframe, creates scatterplot, colors points based on conditional statement, rotates y-axis tickswith()
3expAnterior <- lm(PADataNoOutlier$Lipid ~ log(PADataNoOutlier$PSUA))fit a linear model, uses $ to select columns from dataframe, uses function to mutate variabledoesn't use data = argument
4with(PADataNoOutlier, plot(Lipid ~ log(PSUA), las = 1, col = ifelse(PADataNoOutlier$`Fork Length` <  260, "red", "black"))) col = ifelse(PADataNoOutlier$`Fork Length` <  260, "red", "black")))uses with() to attach dataframe, creates scatterplot, colors points based on conditional statement, rotates y-axis tickswith()
5library(plyr)load packageload package
6load("***REDACTED***/gas")loading data, specifying full path to access datafull path to data
7load("***REDACTED***/carboys")loading data, specifying full path to access datafull path to data
8timeD <- (subset(gas, gas$carboy == "D"))$dayscreate new object, filter rows, using subset(), relational statement (==), select column using $carboy variable can called on without referencing gas dataframe
9obsD <- subset(gas, gas$carboy == "D")$N15_N2_Arcreate a new object, filter rows, using subset(), relational statement (==), select column using $carboy variable can called on without referencing gas dataframe
10N15_NO3_O_D <- 40*((carboys[carboys$CarboyID == "D",]$EstN15NO3) + (0.7*RstN/(1 +RstN)))/(subset(gas, gas$carboy == "D")$Ar[1])filters rows (using brackets) and a relational statement (==), selects column (with $), then filters rows using subset() and a relational statement (==), selects column from filtered data and pulls out index of column with []carboy variable can called on without referencing gas dataframe
Showing 1 to 10 of 10 entries
Previous1Next

Efficiency / Inefficiency

Definition: A statement of code which adheres to the “don’t repeat yourself” principle

R CodeDescriptive CodeAdditional Notes
1EarlyWeightAge <- ddply(Early, ~Age, summarise, meanWE=mean(Weight, na.rm = T))data summary (mean) or variable by groups of another variable, create a new dataframerepeated operations on same groups
2EarlyLengthAge <- ddply(Early, ~Age, summarise, meanLE = mean(ForkLength, na.rm = T))data summary (mean) or variable by groups of another variable, create a new dataframerepeated operations on same groups
3MidLengthAge <- ddply(Mid, ~Age, summarise, meanLM = mean(ForkLength, na.rm = T))data summary (mean) or variable by groups of another variable, create a new dataframerepeated operations on same groups
4WeightChange <- rep(NA, 9)create variablepackages loaded after use
5library(plyr)load packagepackage already loaded previously
6Weight1 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 1], na.rm = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryrepeated operation
7Length1 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 1], na.rm  = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryrepeated operation
8Weight2 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 2], na.rm = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryrepeated operation
9Length2 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 2], na.rm  = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryrepeated operation
10Weight3 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 3], na.rm = TRUE)select column with $, filter rows using [] and relational statement (==), calculate data summaryrepeated operation
Showing 1 to 10 of 25 entries
Previous123Next

Workflow

Definition: A statement of code which facilitates a smooth execution of a working process

R CodeDescriptive CodeAdditional Notes
1str(PADataNoOutlier)inspect datainspect object
2str(PADataNoOutlierMultMeasure)inspect datainspect object
3names(RPMA2Growth)inspect datainspect object
4#upper anterior measurement Linear modelcode commentcomment on actions taken in code below
5summary(linearAnterior)views summary of lm objectinspect object
6linearAnteriorinspects lm objectinspect object
7#Exponential functioncode commentcode comment on change in model
8summary (expAnterior)views summary of lm objectinspect object
9expAnteriorinspects lm objectinspect object
10#Tanner's code/helpcomment on where the code was acquiredcode comment on where code came from
Showing 1 to 10 of 28 entries
Previous123Next
Step 2: Descriptive Codes
Optional: Comparing Across Students
© Copyright 2023, Allison Theobold
This page is built with ❤️ and Quarto.