Chapter 4 Working with Data

“When you have mastered numbers, you will no longer be reading numbers, any more than you read words when reading books. You will be reading meanings.”
— W.E.B. Du Bois

4.1 Overview

You have learned about developing a research question based on existing data by using a code book to guide you. You refined your research question by conducting a literature review based on primary source journal articles. Now you are ready to use statistical software to work with the data. Exploratory data analysis is the processes of converting raw data into a more useful form so that we can begin to discover important features and patterns in the data.

4.2 Lesson

Learn to examine frequency distributions for each of the variables you have selected. Determine what values a variable takes and how often it takes those values. Write the code or take the steps required to generate frequency distributions using your statistical software program. As you engage with your data, learn how to consider whether or not you want to create a subset of the larger sample in order to answer your question. Click on a video lesson below.


SAS                     R                     Python                       Stata                     SPSS


4.3 Syntax

4.3.1 loading a data set

SAS

libname mydata "C:/foldername-including-path";

data new;
    set mydata.filename;

R

load ("filename-including-path.Rdata")
myData <- name-of-object-loaded-in-your-workspace

Python

import pandas
import numpy
myData = pandas.read_csv('nesarc_pds.csv')

STATA

use "C:\path-and-folder-name\filename", clear

SPSS

GET FILE='C:\path-and-folder-name\filename.sav'.

4.3.2 sorting data

SAS

proc sort;
    by unique_id;

R

myData <- myData[order(myData$unique_id, decreasing = FALSE),]

Python

myData = myData.sort_values(by='unique_id')

STATA

sort unique_id

SPSS

SORT CASES BY unique_id. 

4.3.3 displaying frequency tables

SAS

proc freq;
    tables VAR1 VAR2 VAR3;

R

library(descr)
freq(as.ordered(myData$VAR1))
freq(as.ordered(myData$VAR2))
freq(as.ordered(myData$VAR3))

Python

c1 = myData['VAR1'].value_counts(sort=False, dropna=False)
print(c1)

STATA

tab1 VAR1 VAR2 VAR3

SPSS

FREQUENCY VARIABLES=var1 var2 var3. 
/ORDER=ANALYSIS.

4.4 Assignment

Submit your program and the corresponding results that display at least three of your variables as frequency tables. Write a few sentences that describe what you see in each frequency table.