Many researchers, students, and consumers of empirical research have poor understandings of probability distributions, calculus, and key concepts necessary for mathematical statistics. At the same time, even researchers with PhD in quantitative fields can have difficulty understanding and interpreting concepts like p-values and confidence intervals. Over time, I’ve found that the best way to help individuals think through their quantitative problems and understand the logic of statistical inference is by focusing on the data-generating process as the concept of interest.
Social scientists rarely provide explicit justification for choices that directly affect the suitability of their research designs for providing evidence for or against their hypotheses. While recent developments - such as the development of pre-registration plans - encourage researchers to think more carefully about the ability of their studies to precisely identify the sign and magnitude of the relationships between theoretical constructs, it still remains that case that few researchers justify the statistical power of their designs.
As a Fellow for the Program for Advanced Research in the Social Sciences, I have the opportunity to teach students, faculty, and staff at Duke how to develop research designs, chooose quantitative methods, and implement those methods with statistical software.
Recently, a student asked me for help in calculating average score scales from multiple survey items. Since this provided a good opportunity to teach the student that there are multiple approaches to any programming problem and that each approach faces different trade-offs in terms of computational cost, verbosity, generality, and the opportunity for making mistakes, I put together a short gist I thought I’d share.
RStudio is a popular, well-supported IDE for R programmers. While a number of text editors with steep learning curves and direct interaction with command line may offer more power and flexibility, RStudio facilitates completion of common tasks with minimal investment.
One reason to use RStudio is the ease with which researchers can embrace literate programming to create dynamic documents. Dynamic documents are attractive because they promise reductions in human error and time costs for researchers.
Many data analysts often wish to examine subsets of data or otherwise manipulate data using indicators of data missingness. Luckily, R features a number of different ways of designating a value as missing. Unluckily, some of the interactions with popular functions are not always intuitive and this can produce unintended results.
I wrote a demonstration of this awhile back. The below showcases behaviors of missing values many R programmers likely expect and also some surprising results.