Step 3: Discovering Emergent Themes

In the next stage of coding, often called “pattern coding” (Miles, Huberman, and Saldaña 2020), we group the descriptive codes made in the previous phase into a smaller number of categories or themes. Themes or categories “are broad units of information that consist of several codes aggregated to form a common idea” (Creswell and Poth 2018, 194). These categories can be thought of as somewhat of a meta-code.*

*For quantitative researchers, this process can be thought of as an analog to cluster-oriented or factor-oriented approaches in statistical analysis.

Miles, M. B., A. M. Huberman, and J. Saldaña. 2020. Qualitative Data Analysis. Thousand Oaks, CA: Sage.

Creswell, J. W., and C. N. Poth. 2018. Qualitative Inquiry & Research Design. Thousand Oaks, CA: Sage.

Merriam, S. B., and E. J. Tisdell. 2016. Qualitative Research. San Francisco, CA: John Wiley & Sons.

Categories should span multiple codes that were previously identified. These categories “capture some recurring pattern that cuts across your data” (Merriam and Tisdell 2016, 207). Merriam and Tisdell (2016) suggest this process of discovering themes from codes feels somewhat like constantly transitioning one’s perspective of a forest, from looking at the “trees” (codes) to the “forest” (themes) and back to the trees.

Discoving Themes

As I looked over my descriptive codes, I asked myself what these codes tell me about the nature of the data science skills students used in their projects. Some themes immediately jumped out at me, whereas others took a bit of time to mull over. I’ll walk you through my process below.

“Obvious” Themes

There were two themes I expected to see due to the nature of the project and the requirements stipulated by the professor. For their project, students were expected to (1) use an analysis strategy learned in the course and (2) create a visualization to accompany any analysis and resulting discussion. Thus, I expected themes of “Data Model” and “Data Visualization” to emerge from the data.

From my own experiences, I also expected that students would need to perform some aspect of data wrangling to prepare their data for analysis. The data students used for their project were from their own research, so, although I knew data wrangling would play some role, I was unsure what type of tasks might appear in the codes.

Emergent Themes

While I was looking over the data wrangling tasks students performed in their projects, I noticed the techniques called upon specific attributes of different data structures (e.g., dataframe, vector, matrix). The implementation of some tasks was fairly uniform (select variable from dataframe using $ operator), whereas other tasks were highly variable. Data filtering was sometimes done with the subset() function, which requires little explicit knowledge of data structures. However, other times this filtering was carried out using the [] / extraction operator, a technique which requires an understanding of how extraction differs across different data structures.

I also noticed while looking at the R code for the “Data Model” and “Data Visualization” themes that certain statements of code included some knowledge (or lack thereof) regarding the R Environment. The most obvious statement that evoked this theme used with() to temporarily attach a dataframe. There were, however, other statements that also fit into this theme, such as function arguments being bypassed, sourcing in an external R script, loading in datasets, and loading in packages.

Within the themes of “Data Model” and “Data Wrangling,” I uncovered an additional theme which speaks to the efficiency of a statement of code. The notion of efficiency came to me from the “don’t repeat yourself” principle (Wilson et al. 2014), which recommends scientists modularize their code rather than copying and pasting and re-use their code instead of rewriting it (p. 2). Thus, I considered code which adhered to these practices “efficient” and code which did not adhere to these practices “inefficient.”

Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, et al. 2014. “Best Practices for Scientific Computing.” PLOS Biology 12 (1): e1001745.

The final theme I discovered were statements of code whose purpose was more for a student’s workflow than anything else. Code comments were my first indication of this theme, where students used code comments to create sections of code or flag what was happening in a particular line / lines of code. I expanded this theme to include statements of code which inspect some characteristic of an object (e.g., structure of a datafame, names of a dataframe, summary of a linear model).

Assigning Descriptive Codes to Themes

For each of the themes outlined above, the associated “atoms” / statements of code are listed. Keep in mind one statement can apply to two themes! For example, the code

linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)

applies to three themes. First and foremost, this code uses lm() to fit a linear regression model to the data (data model). Second, in order to fit the data model, the student uses data wrangling to select the variables of interest(PADataNoOutlier$Lipid, PADataNoOutlier$PSUA). Finally, this code does not make use of the data = argument built in to lm(), which implies a lack of understanding of the function and thus the R environment.

Data Model

Definition: Statements of code whose purpose is to create a statistical model from data.

Show entries

Search:

	R Code	Descriptive Code	Additional Notes
1	linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)	fit a linear model, uses $ to select columns from dataframe	lm()
2	plot(expAnterior)	visualizes model diagnostics for linear model	diagnostic plots
3	EarlyWeightAge <- ddply(Early, ~Age, summarise, meanWE=mean(Weight, na.rm = T))	data summary (mean) or variable by groups of another variable, create a new dataframe	grouped summary statistics
4	Weight1 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 1], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	calculating the mean
5	nmle <- function(P, t, y, N15_NO3_O){ yhat <- N15_NO3_O * (1 - exp(-P[1]*t)) - sum(dnorm(y, yhat, exp(P[2]), log = T)) }	user-defined function which takes multiple inputs, filters a vector using brackets, calculates values based on variables and function output	create function to estimate parameter
6	ktotEst <- mle.outD$estimate[1]	pulling off named function output using $, filtering elements using brackets	pull out MLE estimate
7	sigmaEst <- exp(mle.outD$estimate[2])	create new variable, pull off named function output using $, filter elements using brackets	obtain point estimate
8	predictionD <- fracDenDN15_NO3_O_D(1-exp(-ktotEst*predictionTimesD))	creating new object, using values of previously defined variables	obtain predictions
9	likelihoods <- apply(X = pMat, MARGIN = 1, FUN = nmle, t = timeD, y = obsD, N15_NO3_O = fracDenD*(N15_NO3_O_D))	creating a new dataframe, applying function over rows of existing dataframe, declare additional function arguments, name all arguments	obtaining likelihood estimates
10	mlle <- -min(likelihoods)	create new variable, data summary (min)	obtaining minimum of likelihood estimates

Showing 1 to 10 of 16 entries

Previous1 2Next

Data Visualization

Definition: Statements of code whose purpose is to visualize relationships between variables

Sub-themes

scatterplot
adding lines to plot
differentiated colors
including a legend
changing plotting environment
modifying axis labels / plot titles

Show entries

Search:

	R Code	Descriptive Code	Additional Notes
1	with(PADataNoOutlier, plot(Lipid ~ PSUA, las = 1, col = ifelse(PADataNoOutlier$`Fork Length` < 280, "red", "black")))	uses with() to attach dataframe, creates scatterplot, colors points based on conditional statement, rotates y-axis ticks	scatterplot, trend line, differing colors
2	abline(linearAnterior)	adds linear trendline	linear smoother
3	with(PADataNoOutlier, plot(Lipid ~ log(PSUA), las = 1, col = ifelse(PADataNoOutlier$`Fork Length` < 260, "red", "black"))) col = ifelse(PADataNoOutlier$`Fork Length` < 260, "red", "black")))	uses with() to attach dataframe, creates scatterplot, colors points based on conditional statement, rotates y-axis ticks	scatterplot, trend line, differing colors
4	abline(expAnterior)	adds linear trendline	linear smoother
5	plot(EarlyLengthAge$meanLE ~ EarlyLengthAge$Age, las = 1, ylab = "Fork Length (mm)", xlab = "Age")	creates scatterplot, selects variables using $, specifies x- and y-axis labels, rotates y-axis ticks	scatterplot, axis labels, tick mark orientation
6	lines(EarlyLengthAge$meanLE ~ EarlyLengthAge$Age)	adds segments between lines, selects variables using $	line segments
7	points(MidLengthAge$meanLM ~ MidLengthAge$Age, col = "red")	adds additional points to plot, selects variables using $, colors points red	points, colors
8	lines(MidLengthAge$meanLM ~ MidLengthAge$Age, col = "red")	adds segments between lines, selects variables using $, colors segments red	line segments, colors
9	legend(15, 600, legend = c("1998-2003", "2006-2017"), col = c("black", "red"), lty = 1:1, cex = 0.8)	specify position for legend, declare text to display, colors to display with text, types of lines to display, thickness of lines displayed	legend
10	plot(WeightAge$meanW ~ WeightAge$Age)	creates scatterplot, selects variables using $	scatterplot

Showing 1 to 10 of 11 entries

Previous1 2Next

Data Wrangling

Definition: Statements of code whose purpose is to prepare a dataset for analysis and / or visualization

Sub-themes

selecting variables
filtering observations
mutating variables

Show entries

Search:

	R Code	Descriptive Code	Additional Notes
1	linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)	fit a linear model, uses $ to select columns from dataframe	selecting columns using $
2	expAnterior <- lm(PADataNoOutlier$Lipid ~ log(PADataNoOutlier$PSUA))	fit a linear model, uses $ to select columns from dataframe, uses function to mutate variable	selecting columns using $
3	early <- subset(RPMA2Growth, StockYear < 2006)	filter data, relational statement, create new dataframe	filter based on values of variable
4	mid <- subset(RPMA2Growth, StockYear < 2014 & StockYear > 2003)	filter data, relational statement, joined by logical (&), create new dataframe	filter based on values of variable and logical statement (&)
5	RPMA2GrowthSub <- transform(RPMA2Growth, Age = as.integer(Age))	mutate existing variable to integer, create new dataframe	mutate existing column, change data type
6	Early <- subset(RPMA2GrowthSub, StockYear < 2004)	filter data, relational statement, create new dataframe	filter based on values of variable
7	Mid <- subset(RPMA2GrowthSub, StockYear < 2018 & StockYear > 2005)	filter data, relational statement, joined by logical (&), create new dataframe	filter based on values of variable and logical statement (&)
8	plot(WeightAge$meanW ~ WeightAge$Age)	creates scatterplot, selects variables using $	selecting columns using $
9	plot(LengthAge$mean ~ LengthAge$Age)	creates scatterplot, selects variables using $	selecting columns using $
10	Weight1 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 1], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	selecting columns using $, filtering rows with relational operator (==)

Showing 1 to 10 of 41 entries

Previous1 2 3 4 5Next

Data Structures

Definition: An statement of code which explicitly calls upon attributes of a data structure (e.g., dataframe, matrix, vector)

Show entries

Search:

	R Code	Descriptive Code	Additional Notes
1	WeightChange <- rep(NA, 9)	create variable	vector
2	legend(15, 600, legend = c("1998-2003", "2006-2017"), col = c("black", "red"), lty = 1:1, cex = 0.8)	specify position for legend, declare text to display, colors to display with text, types of lines to display, thickness of lines displayed	use of c() for legend text and colors
3	WeightChange	inspect dataframe	vector
4	Weight1 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 1], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	vector, [] used to extract elements
5	Weight1	inspect object	vector
6	Length1 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 1], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	vector, [] used to extract elements
7	Weight2 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 2], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	vector, [] used to extract elements
8	Length2 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 2], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	vector, [] used to extract elements
9	Weight3 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 3], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	vector, [] used to extract elements
10	Length3 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 3], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	vector, [] used to extract elements

Showing 1 to 10 of 56 entries

Previous1 2 3 4 5 6Next

R Environment

Definition: A statement of code which calls on explicit aspects of the R environment

Show entries

Search:

	R Code	Descriptive Code	Additional Notes
1	linearAnterior <- lm(PADataNoOutlier$Lipid ~ PADataNoOutlier$PSUA)	fit a linear model, uses $ to select columns from dataframe	doesn't use data = argument
2	with(PADataNoOutlier, plot(Lipid ~ PSUA, las = 1, col = ifelse(PADataNoOutlier$`Fork Length` < 280, "red", "black")))	uses with() to attach dataframe, creates scatterplot, colors points based on conditional statement, rotates y-axis ticks	with()
3	expAnterior <- lm(PADataNoOutlier$Lipid ~ log(PADataNoOutlier$PSUA))	fit a linear model, uses $ to select columns from dataframe, uses function to mutate variable	doesn't use data = argument
4	with(PADataNoOutlier, plot(Lipid ~ log(PSUA), las = 1, col = ifelse(PADataNoOutlier$`Fork Length` < 260, "red", "black"))) col = ifelse(PADataNoOutlier$`Fork Length` < 260, "red", "black")))	uses with() to attach dataframe, creates scatterplot, colors points based on conditional statement, rotates y-axis ticks	with()
5	library(plyr)	load package	load package
6	load("*REDACTED*/gas")	loading data, specifying full path to access data	full path to data
7	load("*REDACTED*/carboys")	loading data, specifying full path to access data	full path to data
8	timeD <- (subset(gas, gas$carboy == "D"))$days	create new object, filter rows, using subset(), relational statement (==), select column using $	carboy variable can called on without referencing gas dataframe
9	obsD <- subset(gas, gas$carboy == "D")$N15_N2_Ar	create a new object, filter rows, using subset(), relational statement (==), select column using $	carboy variable can called on without referencing gas dataframe
10	N15_NO3_O_D <- 40((carboys[carboys$CarboyID == "D",]$EstN15NO3) + (0.7RstN/(1 +RstN)))/(subset(gas, gas$carboy == "D")$Ar[1])	filters rows (using brackets) and a relational statement (==), selects column (with $), then filters rows using subset() and a relational statement (==), selects column from filtered data and pulls out index of column with []	carboy variable can called on without referencing gas dataframe

Showing 1 to 10 of 10 entries

Previous1Next

Efficiency / Inefficiency

Definition: A statement of code which adheres to the “don’t repeat yourself” principle

Show entries

Search:

	R Code	Descriptive Code	Additional Notes
1	EarlyWeightAge <- ddply(Early, ~Age, summarise, meanWE=mean(Weight, na.rm = T))	data summary (mean) or variable by groups of another variable, create a new dataframe	repeated operations on same groups
2	EarlyLengthAge <- ddply(Early, ~Age, summarise, meanLE = mean(ForkLength, na.rm = T))	data summary (mean) or variable by groups of another variable, create a new dataframe	repeated operations on same groups
3	MidLengthAge <- ddply(Mid, ~Age, summarise, meanLM = mean(ForkLength, na.rm = T))	data summary (mean) or variable by groups of another variable, create a new dataframe	repeated operations on same groups
4	WeightChange <- rep(NA, 9)	create variable	packages loaded after use
5	library(plyr)	load package	package already loaded previously
6	Weight1 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 1], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	repeated operation
7	Length1 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 1], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	repeated operation
8	Weight2 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 2], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	repeated operation
9	Length2 <- mean(RPMA2GrowthSub$ForkLength[RPMA2GrowthSub$Age == 2], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	repeated operation
10	Weight3 <- mean(RPMA2GrowthSub$Weight[RPMA2GrowthSub$Age == 3], na.rm = TRUE)	select column with $, filter rows using [] and relational statement (==), calculate data summary	repeated operation

Showing 1 to 10 of 25 entries

Previous1 2 3Next

Workflow

Definition: A statement of code which facilitates a smooth execution of a working process

Show entries

Search:

	R Code	Descriptive Code	Additional Notes
1	str(PADataNoOutlier)	inspect data	inspect object
2	str(PADataNoOutlierMultMeasure)	inspect data	inspect object
3	names(RPMA2Growth)	inspect data	inspect object
4	#upper anterior measurement Linear model	code comment	comment on actions taken in code below
5	summary(linearAnterior)	views summary of lm object	inspect object
6	linearAnterior	inspects lm object	inspect object
7	#Exponential function	code comment	code comment on change in model
8	summary (expAnterior)	views summary of lm object	inspect object
9	expAnterior	inspects lm object	inspect object
10	#Tanner's code/help	comment on where the code was acquired	code comment on where code came from

Showing 1 to 10 of 28 entries

Previous1 2 3Next