Ryan Mcvay/The Image Bank/Getty Images
14
Quantitative Data Analysis
What You’ll Learn in This Chapter
Often, social data are converted to numerical form for statistical analyses.
In this chapter, we’ll begin with the process of quantifying data, then turn
to analysis. Quantitative analysis may be descriptive or explanatory; it
may involve one, two, or several variables. We begin our examination of
quantitative analyses with some simple but powerful ways of manipulating
data in order to attain research conclusions.
9781111222697, The Basics of Social Research, Earl Babbie – © Cengage Learning
What do you think?
In this chapter . . .
Introduction
Quantification of Data
Developing Code Categories
Codebook Construction
Data Entry
Distributions
Central Tendency
Dispersion
Continuous and Discrete Variables
Detail versus Manageability
Earl Babbie
Univariate Analysis
Subgroup Comparisons
“Collapsing” Response Categories
Handling “Don’t Knows”
Numerical Descriptions in Qualitative Research
?
In Chapter 13, we
saw several inherent shortcomings
in quantitative data.
These shortcomings
centered primarily on
standardization and
superficiality in the
face of a social reality
that is varied and deep. Can anything meaningful be learned from data that sacrifice
meaningful detail in order to permit numerical manipulations?
See the “What do you think? Revisited” box
toward the end of the chapter.
Bivariate Analysis
Percentaging a Table
Constructing and Reading Bivariate Tables
read and manipulated by computers and similar
machines used in quantitative analysis.
The rest of the chapter will present the logic
and some of the techniques of quantitative data
analysis—starting with the simplest case, univariate analysis, which involves one variable, then
discussing bivariate analysis, which involves two
variables. We’ll move on to a brief introduction
to multivariate analysis, or the examination of
several variables simultaneously, such as age,
Introduction to Multivariate Analysis
Sociological Diagnostics
Ethics and Quantitative Data Analysis
INTRODUCTION
quantitative analysis The numerical representation and
manipulation of observations for the purpose of describing
and explaining the phenomena that those observations
reflect.
Aaron Babbie
In Chapter 13, we saw some of the logic and
techniques by which social researchers analyze
the qualitative data they’ve collected. This chapter will examine quantitative analysis, or the
techniques by which researchers convert data
to a numerical form and subject it to statistical
analyses.
To begin, we’ll look at quantification—the process of converting data to a numerical format.
This involves converting social science data into
a machine-readable form—a form that can be
Some students take to statistics more readily than do
others.
448
9781111222697, The Basics of Social Research, Earl Babbie – © Cengage Learning
QUANTIFICATION OF DATA
education, and prejudice, and then we’ll move
to a discussion of sociological diagnostics. Finally, we’ll look at the ethics of quantitative data
analysis.
Before we can do any sort of analysis, we need
to quantify our data. Let’s turn now to the basic
steps involved in converting data into machinereadable forms amenable to computer processing and analysis.
QUANTIFICATION OF DATA
Today, quantitative analysis is almost always done
by computer programs such as SPSS and MicroCase. For those programs to work their magic,
they must be able to read the data you’ve collected
in your research. If you’ve conducted a survey, for
example, some of your data are inherently numerical: age or income, for instance. Whereas the
writing and check marks on a questionnaire are
qualitative in nature, a scribbled age is easily converted to quantitative data.
Other data are also easily quantified: Transforming male and female into “1” and “2” is hardly
rocket science. Researchers can also easily assign numerical representations to such variables
as religious affiliation, political party, and region
of the country.
Some data are more challenging, however. If a
survey respondent tells you that he or she thinks
the biggest problem facing Woodbury, Vermont,
is “the disintegrating ozone layer,” the computer
can’t process that response numerically. You
must translate by coding the responses. We’ve
already discussed coding in connection with
content analysis (Chapter 11) and again in connection with qualitative data analysis (Chapter 13). Now we look at coding specifically for
quantitative analysis, which differs from the
other two primarily in its goal of converting raw
data into numbers.
As with content analysis, the task of quantitative coding is to reduce a wide variety of idiosyncratic items of information to a more limited
set of attributes composing a variable. Suppose, for example, that a survey researcher asks
respondents, “What is your occupation?” The
responses to such a question will vary considerably. Although it will be possible to assign each
reported occupation a separate numerical code,
this procedure will not facilitate analysis, which
typically depends on several subjects having the
same attribute.
The variable occupation has many preestablished coding schemes. One such scheme
distinguishes professional and managerial occupations, clerical occupations, semiskilled occupations, and so forth. Another scheme distinguishes
different sectors of the economy: manufacturing,
health, education, commerce, and so forth. Still
others combine both of these schemes. Using
an established coding scheme gives you the advantage of being able to compare your research
results with those of other studies.
To learn more about preestablished coding
schemes, visit the Bureau of Labor Statistics
to learn about their Standard Occupational
Classification: stats.bls.gov/soc/soc_majo.htm.
The occupational coding scheme you choose
should be appropriate for the theoretical concepts being examined in your study. For some
studies, coding all occupations as either whitecollar or blue-collar might suffice. For others,
self-employed and not self-employed might do.
Or a peace researcher might wish to know only
whether the occupation depended on the defense establishment or not.
Although the coding scheme should be tailored to meet particular requirements of the
analysis, you should keep one general guideline
in mind. If the data are coded to maintain a
great deal of detail, code categories can always
be combined during an analysis that does not require such detail. If the data are coded into relatively few, gross categories, however, you’ll have
no way during analysis to recreate the original
detail. To keep your options open, it’s a good idea
to code your data in greater detail than you plan
to use in the analysis.
9781111222697, The Basics of Social Research, Earl Babbie – © Cengage Learning
449
450
CHAPTER 14 QUANTITATIVE DATA ANALYSIS
Developing Code Categories
There are two basic approaches to the coding
process. First, you may begin with a relatively
well-developed coding scheme, derived from
your research purpose. Thus, as suggested previously, the peace researcher might code occupations in terms of their relationship to the defense
establishment. You might also use an existing
coding scheme so that you can compare your
findings with those of previous research.
The alternative method is to generate codes
from your data, as discussed in Chapter 13. Let’s
say we’ve asked students in a self-administered
campus survey to say what they believe is the biggest problem facing their college today. Here are
a few of the answers they might have written in.
Tuition is too high
Not enough parking spaces
Faculty don’t know what they are doing
Advisors are never available
Not enough classes offered
Cockroaches in the dorms
Too many requirements
Cafeteria food is infected
Books cost too much
Not enough financial aid
Take a minute to review these responses and
see whether you can identify some categories
represented. Realize that there is no right answer; several coding schemes might be generated from these answers.
Let’s start with the first response: “Tuition is
too high.” What general areas of concern does that
response reflect? One obvious possibility is “Financial Concerns.” Are there other responses that
would fit into that category? Table 14-1 shows
which of the questionnaire responses could fit.
In more general terms, the first answer can
also be seen as reflecting nonacademic concerns.
This categorization would be relevant if your research interest included the distinction between
academic and nonacademic concerns. If that
were the case, the responses might be coded as
shown in Table 14-2.
TABLE 14-1 Student Responses That Can Be
Coded “Financial Concerns”
Financial Concerns
Tuition is too high
X
Not enough parking spaces
Faculty don’t know what they
are doing
Advisors are never available
Not enough classes offered
Cockroaches in the dorms
Too many requirements
Cafeteria food is infected
Books cost too much
X
Not enough financial aid
X
Notice that I didn’t code the response “Books
cost too much” in Table 14-2, because this concern
could be seen as representing both of the categories. Books are part of the academic program, but
their cost is not. This signals the need to refine
the coding scheme we’re developing. Depending
on our research purpose, we might be especially
interested in identifying any problems that had
an academic element; hence we’d code this one
TABLE 14-2 Student Concerns Coded as
“Academic” and “Nonacademic”
Academic
Nonacademic
Tuition is too high
X
Not enough parking spaces
X
Faculty don’t know what
they are doing
X
Advisors are never available
X
Not enough classes offered
X
Cockroaches in the dorms
Too many requirements
X
X
Cafeteria food is infected
X
Books cost too much
Not enough financial aid
9781111222697, The Basics of Social Research, Earl Babbie – © Cengage Learning
X
QUANTIFICATION OF DATA
“Academic.” Just as reasonably, however, we might
be more interested in identifying nonacademic
problems and would code the response accordingly. Or, as another alternative, we might create
a separate category for responses that involved
both academic and nonacademic matters.
As yet another alternative, we might want to
separate nonacademic concerns into those involving administrative matters and those dealing with campus facilities. Table 14-3 shows how
the first ten responses would be coded in that
event.
As these few examples illustrate, there are
many possible schemes for coding a set of
data. Your choices should match your research
purposes and reflect the logic that emerges from
the data themselves. Often, you’ll find yourself
modifying the code categories as the coding process proceeds. Whenever you change the list of
categories, however, you must review the data already coded to see whether changes are in order.
TABLE 14-3 Nonacademic Concerns Coded as
“Administrative” or “Facilities”
Academic Administrative Facilities
Tuition is too high
X
Not enough
parking spaces
X
Faculty don’t
know what they
are doing
X
Advisors are
never available
X
Not enough
classes offered
X
Cockroaches
in the dorms
Too many
requirements
X
X
Cafeteria food
is infected
Books cost
too much
Not enough
financial aid
X
Like the set of attributes composing a variable, and like the response categories in a closedended questionnaire item, code categories should
be both exhaustive and mutually exclusive. Every
piece of information being coded should fit into
one and only one category. Problems arise whenever a given response appears to fit equally into
more than one code category or whenever it fits
into no category: Both signal a mismatch between
your data and your coding scheme.
If you’re fortunate enough to have assistance
in the coding process, you’ll need to train your
coders in the definitions of code categories and
show them how to use those categories properly.
To do so, explain the meaning of the code categories and give several examples of each. To make
sure your coders fully understand what you have in
mind, code several cases ahead of time. Then ask
your coders to code the same cases without knowing how you coded them. Finally, compare your
coders’ work with your own. Any discrepancies
will indicate an imperfect communication of your
coding scheme to your coders. Even with perfect
agreement between you and your coders, however,
it’s best to check the coding of at least a portion of
the cases throughout the coding process.
If you’re not fortunate enough to have assistance in coding, you should still obtain some
verification of your own reliability as a coder. Nobody’s perfect, especially a researcher hot on the
trail of a finding. Suppose that you’re studying an
emerging cult and that you have the impression
that people who do not have a regular family will
be the most likely to regard the new cult as a family substitute. The danger is that whenever you
discover a subject who reports no family, you’ll
unconsciously try to find some evidence in the
subject’s comments that the cult is a substitute for
family. If at all possible, then, get someone else to
code some of your cases to see whether that person makes the same assignments you made.
Codebook Construction
X
X
The end product of the coding process in quantitative analysis is the conversion of data items
9781111222697, The Basics of Social Research, Earl Babbie – © Cengage Learning
451
452
CHAPTER 14 QUANTITATIVE DATA ANALYSIS
into numerical codes. These codes represent
attributes composing variables, which, in turn,
are assigned locations within a data file. A codebook is a document that describes the locations
of variables and lists the assignments of codes to
the attributes composing those variables.
A codebook serves two essential functions.
First, it is the primary guide used in the coding
process. Second, it is your guide for locating
variables and interpreting codes in your data file
during analysis. If you decide to correlate two
variables as a part of your analysis of your data,
the codebook tells you where to find the variables and what the codes represent.
Figure 14-1 is a partial codebook created from
two variables from the General Social Survey.
Though there is no one right format for a codebook, this example presents some of the common elements.
Notice first that each variable is identified
by an abbreviated variable name: POLVIEWS,
ATTEND. We can determine the religious service attendance of respondents, for example, by
referencing ATTEND. This example uses the format established by the General Social Survey,
which has been carried over into SPSS. Other
data sets and/or analysis programs might format variables differently. Some use numerical
codes in place of abbreviated names, for example. You must, however, have some identifier
that will allow you to locate and use the variable in question.
Next, every codebook should contain the full
definition of the variable. In the case of a questionnaire, the definition consists of the exact
wordings of the questions asked, because, as
we’ve seen, the wording of questions strongly
influences the answers returned. In the case of
POLVIEWS, you know that respondents were
codebook The document used in data processing and
analysis that tells the location of different data items in a
data file. Typically, the codebook identifies the locations of
data items and the meaning of the codes used to represent
different attributes of variables.
given the several political categories and asked
to pick the one that best fit them.
The codebook also indicates the attributes
composing each variable. In POLVIEWS, for example, the political categories just mentioned
serve as these attributes: “Extremely liberal,”
“Liberal,” “Slightly liberal,” and so forth.
Finally, notice that each attribute also has a
numeric label. Thus, in POLVIEWS, “Extremely
liberal” is code category 1. These numeric codes
are used in various manipulations of the data. For
example, you might decide to combine categories
1 through 3 (all the “liberal” responses). It’s easier
to do this with code numbers than with lengthy
names.
Data Entry
In addition to transforming data into quantitative form, researchers interested in quantitative
analysis also need to convert data into a machine-readable format, so that computers can
read and manipulate the data. There are many
ways of accomplishing this step, depending on
the original form of your data and also the computer program you’ll use for analyzing the data.
I’ll simply introduce you to the process here.
If you find yourself undertaking this task, you
should be able to tailor your work to the particular data source and program you’re using.
If your data have been collected by questionnaire, you might do your coding on the questionnaire itself. Then, data-entry specialists
(including yourself) could enter the data into,
say, an SPSS data matrix or into an Excel spreadsheet that would later be imported into SPSS.
Sometimes, social researchers use optical scan
sheets for data collection. These sheets can be fed
into machines that will convert the black marks
into data, which can be imported into the analysis
program. This procedure only works with subjects
who are comfortable using such sheets, and it’s
usually limited to closed-ended questions.
Sometimes, data entry occurs in the process
of data collection. In computer-assisted telephone interviewing (CATI), for example, the
9781111222697, The Basics of Social Research, Earl Babbie – © Cengage Learning
UNIVARIATE ANALYSIS
POLVIEWS
ATTEND
We hear a lot of talk these days about liberals and
conservatives. I’m going to show you a seven-point
scale on which the political views that people might
hold are arranged from extremely liberal — point 1—
to extremely conservative — point 7. Where would
you place yourself on this scale?
How often do you attend religious services?
1. Extremely liberal
2. Liberal
3. Slightly liberal
4. Moderate, middle of the road
5. Slightly conservative
6. Conservative
7. Extremely conservative
8. Don’t know
9. No answer
0. Never
1. Less then once a year
2. About once or twice a year
3. Several times a year
4. About once a month
5. 2–3 times a month
6. Nearly every week
7. Every week
8. Several times a week
9. Don’t know, No answer
FIGURE 14-1 Partial Codebook.
interviewer keys responses directly into the
computer, where the data are compiled for analysis (see Chapter 9). Even more effortlessly, online
surveys can be constructed so that the respondents enter their own answers directly into the
accumulating database, without the need for an
intervening interviewer or data-entry person.
Once data have been fully quantified and entered into the computer, researchers can begin
quantitative analysis. Let’s look at the three cases
mentioned at the start of this chapter: univariate, bivariate, and multivariate analyses.
UNIVARIATE ANALYSIS
The simplest form of quantitative analysis, univariate analysis, involves describing a case in
terms of a single variable—specifically, the distribution of attributes that compose it. For example, if sex were measured, we would look at how
many of the subjects were men and how many
were women.
Distributions
The most basic format for presenting univariate data is to report all individual cases, that
is, to list the attribute for each case under
study in terms of the variable in question. Let’s
take as an example the General Social Survey
(GSS) data on attendance at religious services,
ATTEND.
Figure 14-2 shows how you could request
these data, using the Berkeley SDA online analysis program introduced earlier in the book. You
can access this program at sda.berkeley.edu/
cgi-bin32/hsda?harcsda+gss06.
In the figure you’ll see that ATTEND has been
entered as the Row variable, and I have specified
a Selection Filter to limit the analysis to the data
collected in the 2006 GSS. Notice, also, that I’ve
selected Bar Chart as the Type of Chart, have
asked for 3-D effects and have asked to see the
percentages. The consequence of this will be
apparent shortly.
Table 14-4 represents the tabular response to
our request. We see, for example, that 1,009 of
univariate analysis The analysis of a single variable, for
purposes of description. Frequency distributions, averages,
and measures of dispersion are examples of univariate
analysis, as distinguished from bivariate and multivariate
analysis.
9781111222697, The Basics of Social Research, Earl Babbie – © Cengage Learning
453
454
CHAPTER 14 QUANTITATIVE DATA ANALYSIS
FIGURE 14-3 Bar Chart of GSS ATTEND, 2006.
FIGURE 14-2 Requesting a Univariate Analysis
of ATTEND.
the 4,492 respondents, or 22.5 percent, say they
never attend worship services. As we move down
the table, we see that 19 percent say they attend
every week. To simplify the results, we might
want to combine the last three categories and
say that 31.1 percent attend “About weekly.”
A description of the number of times that the
various attributes of a variable are observed in
a sample is called a frequency distribution.
Sometimes it’s easiest to see a frequency distribution in a graph. Figure 14-3 was created
by SDA based on the specifications in the chart
options section of Figure 14-2. The vertical scale
on the left side of the graph indicates the percentages selecting each of the answers that are
displayed along the horizontal axis of the graph.
Take a minute to notice how the percentages in
frequency distribution A description of the number of
times the various attributes of a variable are observed in a
sample. The report that 53 percent of a sample were men
and 47 percent were women would be a simple example of
a frequency distribution.
average An ambiguous term generally suggesting typical
or normal—a central tendency. The mean, median, and
mode are specific examples of mathematical averages.
Table 14-4 correspond to the heights of the bars
in Figure 14-3.
This program also offers other graphical possibilities. In Figure 14-2, you could have specified
“Pie Chart” instead of “Bar Chart” as the type of
chart desired. Figure 14-4 shows the way the data
would have been presented in that case.
Central Tendency
Beyond simply reporting the overall distribution
of values, sometimes called the marginal frequencies or just the marginals, you may choose
to present your data in the form of an average,
or measure of central tendency. You’re already
familiar with the concept of central tendency
from the many kinds of averages you use in everyday life to express the “typical” value of a variable. For instance, in baseball a batting average
of .300 says that a batter gets a hit three out of
every ten opportunities on average. Over the
course of a season, a hitter might go through extended periods without getting any hits at all and
go through other periods when he or she gets a
bunch of hits all at once. Over time, though, the
central tendency of the batter’s performance
can be expressed as getting three hits in every
ten chances. Similarly, your grade point average
expresses the “typical” value of all your grades
taken together, even though some of them might
9781111222697, The Basics of Social Research, Earl Babbie – © Cengage Learning
UNIVARIATE ANALYSIS
TABLE 14-4 Attendance at Worship Services, 2006
ATTEND
Value Label
How Often R Attends Religious Services
Value
Frequency
Percent
0
1,009
22.5
NEVER
LT ONCE A YEAR
1
305
6.8
ONCE A YEAR
2
571
12.7
SEVRL TIMES A YR
3
522
11.6
ONCE A MONTH
4
307
6.8
2–3X A MONTH
5
378
8.4
NRLY EVERY WEEK
6
224
5.0
EVERY WEEK
7
856
19.0
MORE THN ONCE WK
8
321
7.1
Total
4,492
100.0
be A’s, others B’s, and one or two might be C’s
(I know you never get anything lower than a C).
Averages like these are more properly called
the arithmetic mean (the result of dividing the
sum of the values by the total number of cases).
The mean is only one way to measure central
tendency or “typical” values. Two other options
are the mode (the most frequently occurring attribute) and the median (the middle attribute in
the ranked distribution of observed attributes).
Here’s how the three averages would be calculated from a set of data.
Suppose you’re conducting an experiment
that involves teenagers as subjects. They range in
age from 13 to 19, as indicated in the following
table:
Age
Number
13
14
15
16
17
18
19
3
4
6
8
4
3
3
mean An average computed by summing the values of
several observations and dividing by the number of observations. If you now have a grade point average of 4.0 based
on 10 courses, and you get an F in this course, your new
grade point (mean) average will be 3.6.
mode An average representing the most frequently
observed value or attribute. If a sample contains 1,000
Protestants, 275 Catholics, and 33 Jews, “Protestant” is the
modal category.
median An average representing the value of the “middle”
FIGURE 14-4 Pie Chart of GSS ATTEND, 2006.
case…