INFERENTIAL STATISTICS
Many, if not most, social scientific research
projects involve the examination of data collected from a sample drawn
from a larger population. A sample of people may be interviewed in
a survey; a sample of divorce records may be coded and analyzed; a
sample of newspapers may be examined through content analysis. Researchers
seldom if ever study samples just to describe the samples per se;
in most instances, their ultimate purpose is to make assertions about
the larger population from which the sample has been selected. Frequently,
then, you'll wish to interpret your univariate and multivariate sample
findings as the basis for inferences about some population.
This section examines inferential statistics, the statistical measures
used for making inferences from findings based on sample observations
to a larger population. We'll begin with univariate data and move
to multivariate.
Univariate Inferences
-no indent Your textbook dealt with methods of
presenting univariate data. Each summary measure was intended as a
method of describing the sample studied. Now we'll use such measures
to make broader assertions about a population. This section addresses
two univariate measures percentages and means.
If 50 percent of a sample of people say they
had colds during the past year, 50 percent is also our best estimate
of the proportion of colds in the total population from which the
sample was drawn. (This estimate assumes a simple random sample, of
course.) It's rather unlikely, however, that precisely 50 percent
of the population had colds during the year. If a rigorous sampling
design for random selection has been followed, however, we'll be able
to estimate the expected range of error when the sample finding is
applied to the population.
Your textbook's discussion of sampling theory, covered the procedures
for making such estimates, so I'll only review them here. In the case
of a percentage, the quantity
**[Set p times q, over n, all in a square root
radical, as in 8e p. 421]** where p is a proportion, q equals (1 ? p), and n is the sample size,
is called the standard error. As noted in your textbook, this quantity
is very important in the estimation of sampling error. We may be 68
percent confident that the population figure falls within plus or
minus one standard error of the sample figure; we may be 95 percent
confident that it falls within plus or minus two standard errors;
and we may be 99.9 percent confident that it falls within plus or
minus three standard errors.
Any statement of sampling error, then, must contain two essential
components the confidence level (for example, 95 percent) and the
confidence interval (for example, 62.5 percent). If 50 percent of
a sample of 1,600 people say they had colds during the year, we might
say we're 95 percent confident that the population figure is between
47.5 percent and 52.5 percent.
In this example we've moved beyond simply describing the sample
into the realm of making estimates (inferences) about the larger population.
In doing so, we must take care in several ways.
First, the sample must be drawn from the population about which
inferences are being made. A sample taken from a telephone directory
cannot legitimately be the basis for statistical inferences about
the population of a city, but only about the population of telephone
subscribers with listed numbers.
Second, the inferential statistics assume several
things. To begin with, they assume simple random sampling, which is
virtually never the case in sample surveys. The statistics also assume
sampling with replacement, which is almost never done, but this is
probably not a serious problem. Although systematic sampling is used
more frequently than random sampling, it, too, probably presents no
serious problem if done correctly. Stratified sampling, because it
improves representativeness, clearly presents no problem. Cluster
sampling does present a problem, however, because the estimates of
sampling error may be too small. Quite clearly, street-corner sampling
does not warrant the use of inferential statistics. Finally, this
standard error sampling technique assumes a 100-percent completion
rate, that is, that everyone in the sample completed the survey. This
problem increases in seriousness as the completion rate decreases.
Third, inferential statistics are addressed to sampling error only,
not nonsampling error such as coding errors or misunderstandings of
questions by respondents. Thus, although we might state correctly
that between 47.5 and 52.5 percent of the population (95 percent confidence)
would report having colds during the previous year, we couldn't so
confidently guess the percentage who had actually had them. Because
nonsampling errors are probably larger than sampling errors in a respectable
sample design, we need to be especially cautious in generalizing from
our sample findings to the population.
Tests of Statistical
Significance
There is no scientific answer to the question
of whether a given association between two variables is significant,
strong, important, interesting, or worth reporting. Perhaps the ultimate
test of significance rests with your ability to persuade your audience
(present and future) of the association's significance. At the same
time, there is a body of inferential statistics to assist you in this
regard called parametric tests of significance. As the name suggests,
parametric statistics are those that make certain assumptions about
the parameters describing the population from which the sample is
selected. They allow us to determine the statistical significance
of associations. "Statistical significance" does not imply "importance"
or "significance" in any general sense. It refers simply to the likelihood
that relationships observed in a sample could be attributed to sampling
error alone.
Although tests of statistical significance are
widely reported in social scientific literature, the logic underlying
them is rather subtle and often misunderstood. Tests of significance
are based on the same sampling logic discussed elsewhere in this book.
To understand that logic, let's return for a moment to the concept
of sampling error in regard to univariate data.
Recall that a sample statistic normally provides the best single
estimate of the corresponding population parameter, but the statistic
and the parameter seldom correspond precisely. Thus, we report the
probability that the parameter falls within a certain range (confidence
interval). The degree of uncertainty within that range is due to normal
sampling error. The corollary of such a statement is, of course, that
it is improbable that the parameter would fall outside the specified
range only as a result of sampling error. Thus, if we estimate that
a parameter (99.9 percent confidence) lies between 45 percent and
55 percent, we say by implication that it is extremely improbable
that the parameter is actually, say, 90 percent if our only error
of estimation is due to normal sampling. This is the basic logic behind
tests of statistical significance.
The Logic of Statistical
Significance
I think I can illustrate the logic of statistical
significance best in a series of diagrams representing the selection
of samples from a population. Here are the elements in the logic
1. Assumptions regarding the independence of
two variables in the population study
2. Assumptions regarding the representativeness of samples selected
through conventional probability sampling procedures
3. The observed joint distribution of sample elements in terms of
the two variables
Figure 17-5 represents a hypothetical population
of 256 people; half are women, half are men. The diagram also indicates
how each person feels about women enjoying equality to men. In the
diagram, those favoring equality have open circles, those opposing
it have their circles filled in.
**[Figure 17-5 about here; pickup from 8e p. 423]**
Figure 17-5 @FT A Hypothetical Population of
Men and Women Who Either Favor or Oppose Sexual Equality
The question we'll be investigating is whether
there is any relationship between gender and feelings about equality
for men and women. More specifically, we'll see if women are more
likely to favor equality than are men, since women would presumably
benefit more from it. Take a moment to look at Figure 17-5 and see
what the answer to this question is.
The illustration in the figure indicates no relationship
between gender and attitudes about equality. Exactly half of each
group favors equality and half opposes it. Recall the earlier discussion
of proportionate reduction of error. In this instance, knowing a person's
gender would not reduce the "errors" we'd make in guessing his or
her attitude toward equality. The table at the bottom of Figure 17-5
provides a tabular view of what you can observe in the graphic diagram.
Figure 17-6 represents the selection of a one-fourth
sample from the hypothetical population. In terms of the graphic illustration,
a "square" selection from the center of the population provides a
representative sample. Notice that our sample contains 16 of each
type of person Half are men and half are women; half of each gender
favors equality, and the other half opposes it.
**[Figure 17-6 about here; pickup from 8e p. 424]**
Figure 17-6: A Representative Sample
The sample selected in Figure 17-6 would allow
us to draw accurate conclusions about the relationship between gender
and equality in the larger population. Following the sampling logic
used in the textbook, we'd note there was no relationship between
gender and equality in the sample; thus, we'd conclude there was similarly
no relationship in the larger population, since we've presumably selected
a sample in accord with the conventional rules of sampling.
Of course, real-life samples are seldom such perfect
reflections of the populations from which they are drawn. It would
not be unusual for us to have selected, say, one or two extra men
who opposed equality and a couple of extra women who favored it, even
if there was no relationship between the two variables in the population.
Such minor variations are part and parcel of probability sampling.
Figure 17-7, however, represents a sample that falls far short of
the mark in reflecting the larger population. Notice it includes far
too many supportive women and opposing men. As the table shows, three-fourths
of the women in the sample support equality, but only one-fourth of
the men do so. If we had selected this sample from a population in
which the two variables were unrelated to each other, we'd be sorely
misled by our sample.
**[Figure 17-7 about here; pickup from 8e p. 425]**
Figure 17-7: An Unrepresentative Sample
As you'll recall, it's unlikely that a properly
drawn probability sample would ever be as inaccurate as the one shown
in Figure 17-7. In fact, if we actually selected a sample that gave
us the results this one does, we'd look for a different explanation.
Figure 17-8 illustrates the more likely situation.
**[Figure 17-8 about here; pickup from 8e p. 426]**
Figure 17-8:A Representative Sample from a Population
in Which the Variables Are Related
Notice that the sample selected in Figure 17-8
also shows a strong relationship between gender and equality. The
reason is quite different this time. We've selected a perfectly representative
sample, but we see that there is actually a strong relationship between
the two variables in the population at large. In this latest figure,
women are more likely to support equality than are men That's the
case in the population, and the sample reflects it.
In practice, of course, we never know what's so
for the total population; that's why we select samples. So if we selected
a sample and found the strong relationship presented in Figures 17-7
and 17-8, we'd need to decide whether that finding accurately reflected
the population or was simply a product of sampling error.
The fundamental logic of tests of statistical
significance, then, is this Faced with any discrepancy between the
assumed independence of variables in a population and the observed
distribution of sample elements, we may explain that discrepancy in
either of two ways (1) we may attribute it to an unrepresentative
sample, or (2) we may reject the assumption of independence. The logic
and statistics associated with probability sampling methods offer
guidance about the varying probabilities of varying degrees of unrepresentativeness
(expressed as sampling error). Most simply put, there is a high probability
of a small degree of unrepresentativeness and a low probability of
a large degree of unrepresentativeness.
The statistical significance of a relationship observed in a set
of sample data, then, is always expressed in terms of probabilities.
"Significant at the .05 level (p £ .05)" simply means that the probability
that a relationship as strong as the observed one can be attributed
to sampling error alone is no more than 5 in 100. Put somewhat differently,
if two variables are independent of one another in the population,
and if 100 probability samples are selected from that population,
no more than 5 of those samples should provide a relationship as strong
as the one that has been observed.
There is, then, a corollary to confidence intervals in tests of
significance, which represents the probability of the measured associations
being due only to sampling error. This is called the level of significance.
Like confidence intervals, levels of significance are derived from
a logical model in which several samples are drawn from a given population.
In the present case, we assume that there is no association between
the variables in the population, and then we ask what proportion of
the samples drawn from that population would produce associations
at least as great as those measured in the empirical data. Three levels
of significance are frequently used in research reports .05, .01,
and .001. These mean, respectively, that the chances of obtaining
the measured association as a result of sampling error are 5/100,
1/100, and 1/1,000.
Researchers who use tests of significance normally
follow one of two patterns. Some specify in advance the level of significance
they'll regard as sufficient. If any measured association is statistically
significant at that level, they'll regard it as representing a genuine
association between the two variables. In other words, they're willing
to discount the possibility of its resulting from sampling error only.
Other researchers prefer to report the specific
level of significance for each association, disregarding the conventions
of .05, .01, and .001. Rather than reporting that a given association
is significant at the .05 level, they might report significance at
the .023 level, indicating the chances of its having resulted from
sampling error as 23 out of 1,000.
Chi Square
Chi square (c2) is a frequently used test of
significance in social science. It is based on the null hypothesis
the assumption that there is no relationship between the two variables
in the total population. Given the observed distribution of values
on the two separate variables, we compute the conjoint distribution
that would be expected if there were no relationship between the two
variables. The result of this operation is a set of expected frequencies
for all the cells in the contingency table. We then compare this expected
distribution with the distribution of cases actually found in the
sample data, and we determine the probability that the discovered
discrepancy could have resulted from sampling error alone. An example
will illustrate this procedure.
Let's assume we're interested in the possible
relationship between church attendance and gender for the members
of a particular church. To test this relationship, we select a sample
of 100 church members at random. We find that our sample is made up
of 40 men and 60 women and that 70 percent of our sample say they
attended church during the preceding week, whereas the remaining 30
percent say they did not.
If there is no relationship between gender and church attendance,
then 70 percent of the men in the sample should have attended church
during the preceding week, and 30 percent should have stayed away.
Moreover, women should have attended in the same proportion. Table
17-7 (part I) shows that, based on this model, 28 men and 42 women
would have attended church, with 12 men and 18 women not attending.
**[Table 17-7 about here; pickup from 8e p. 428]**
Part II of Table 17-7 presents the observed attendance
for the hypothetical sample of 100 church members. Note that 20 of
the men report having attended church during the preceding week, and
the remaining 20 say they did not. Among the women in the sample,
50 attended church and 10 did not. Comparing the expected and observed
frequencies (parts I and II), we note that somewhat fewer men attended
church than expected, whereas somewhat more women attended than expected.
Chi square is computed as follows. For each cell
in the tables, the researcher (1) subtracts the expected frequency
for that cell from the observed frequency, (2) squares this quantity,
and (3) divides the squared difference by the expected frequency.
This procedure is carried out for each cell in the tables, and the
several results are added together. (Part III of Table 17-7 presents
the cell-by-cell computations.) The final sum is the value of chi
square 12.70 in the example.
This value is the overall discrepancy between
the observed conjoint distribution in the sample and the distribution
we would expect if the two variables were unrelated to each other.
Of course, the mere discovery of a discrepancy does not prove that
the two variables are related, since normal sampling error might produce
discrepancies even when there is no relationship in the total population.
The magnitude of the value of chi square, however, permits us to estimate
the probability of that having happened.
Degrees of Freedom:
To determine the statistical significance of the observed relationship,
we must use a standard set of chi square values. This will require
the computation of the degrees of freedom, which refers to the possibilities
for variation within a statistical model. Suppose I challenge you
to find three numbers whose mean is 11. There is an infinite number
of solutions to this problem (11, 11, 11), (10, 11, 12), (?11, 11,
33), etc. Now, suppose I require that one of the numbers be 7. There
would still be an infinite number of possibilities for the other two
numbers.
If I told you one number had to be 7 and another
10, there would be only one possible value for the third. If the average
of three numbers is 11, their sum must be 33. If two of the numbers
total 17, the third must be 16. In this situation, we say there are
two degrees of freedom. Two of the numbers could have any values we
choose, but once they are specified, the third number is determined.
More generally, whenever we are examining the
mean of N values, we can see that the degrees of freedom is N ? 1.
Thus in the case of the mean of 23 values, we could make 22 of them
anything we liked, but the 23rd would then be determined.
A similar logic applies to bivariate tables, such
as those analyzed by chi square. Consider a table reporting the relationship
between two dichotomous variables gender (men/women) and abortion
attitude (approve/disapprove). Notice that the table provides the
marginal frequencies of both variables.
Abortion Attitude Men Women Total
Approve 500
Disapprove 500
Total 500 500 1,000
Despite the conveniently round numbers in this hypothetical example,
notice that there are numerous possibilities for the cell frequencies.
For example, it could be the case that all 500 men approve and all
500 women disapprove, or it could be just the reverse. Or there could
be 250 cases in each cell. Notice there are numerous other possibilities.
Now the question is, How many cells could we fill in pretty much
as we choose before the remainder are determined by the marginal frequencies?
The answer is only one. If we know that 300 men approved, for example,
then 200 men would have had to disapprove, and the distribution would
need to be just the opposite for the women.
In this instance, then, we say the table has one
degree of freedom. Now, take a few minutes to construct a three-by-three
table. Assume you know the marginal frequencies for each variable,
and see if you can determine how many degrees of freedom it has.
For chi square, the degrees of freedom are computed
as follows the number of rows in the table of observed frequencies,
minus 1, is multiplied by the number of columns, minus 1. This may
be written as (r ? 1)(c ? 1). For a three-by-three table, then, there
are four degrees of freedom (3 ? 1)(3 ? 1) = (2)(2) = 4.
In the example of gender and church attendance,
we have two rows and two columns (discounting the totals), so there
is one degree of freedom. Turning to a table of chi square values
(see Appendix F), we find that for one degree of freedom and random
sampling from a population in which there is no relationship between
two variables, 10 percent of the time we should expect a chi square
of at least 2.7. Thus, if we selected 100 samples from such a population,
we should expect about 10 of those samples to produce chi squares
equal to or greater than 2.7. Moreover, we should expect chi square
values of at least 6.6 in only 1 percent of the samples and chi square
values of 7.9 in only half a percent (.005) of the samples. The higher
the chi square value, the less probable it is that the value could
be attributed to sampling error alone.
In our example, the computed value of chi square
is 12.70. If there were no relationship between gender and church
attendance in the church member population and a large number of samples
had been selected and studied, then we would expect a chi square of
this magnitude in fewer than 1/10 of 1 percent (.001) of those samples.
Thus, the probability of obtaining a chi square of this magnitude
is less than .001, if random sampling has been used and there is no
relationship in the population. We report this finding by saying the
relationship is statistically significant at the .001 level. Because
it is so improbable that the observed relationship could have resulted
from sampling error alone, we're likely to reject the null hypothesis
and assume that there is a relationship between the two variables
in the population of church members.
Most measures of association can be tested for
statistical significance in a similar manner. Standard tables of values
permit us to determine whether a given association is statistically
significant and at what level. Any standard statistics textbook provides
instructions on the use of such tables.
Some Words of Caution Tests of significance provide
an objective yardstick that we can use to estimate the statistical
significance of associations between variables. They help us rule
out associations that may not represent genuine relationships in the
population under study. However, the researcher who uses or reads
reports of significance tests should remain wary of several dangers
in their interpretation.
First, we have been discussing tests of statistical
significance; there are no objective tests of substantive significance.
Thus, we may be legitimately convinced that a given association is
not due to sampling error, but we may be in the position of asserting
without fear of contradiction that two variables are only slightly
related to each other. Recall that sampling error is an inverse function
of sample size, the larger the sample, the smaller the expected error.
Thus, a correlation of, say, .1 might very well be significant (at
a given level) if discovered in a large sample, whereas the same correlation
between the same two variables would not be significant if found in
a smaller sample. This makes perfectly good sense given the basic
logic of tests of significance In the larger sample, there is less
chance that the correlation could be simply the product of sampling
error. In both samples, however, it might represent an essentially
zero correlation.
The distinction between statistical and substantive
significance is perhaps best illustrated by those cases where there
is absolute certainty that observed differences cannot be a result
of sampling error. This would be the case when we observe an entire
population. Suppose we were able to learn the ages of every public
official in the United States and of every public official in Russia.
For argument's sake, let's assume further that the average age of
U.S. officials was 45 years old compared with, say, 46 for the Russian
officials. Because we would have the ages of all officials, there
would be no question of sampling error. We would know with certainty
that the Russian officials were older than their U.S. counterparts.
At the same time, we would say that the difference was of no substantive
significance. We'd conclude, in fact, that they were essentially the
same age.
Second, lest you be misled by this hypothetical
example, realize that statistical significance should not be calculated
on relationships observed in data collected from whole populations.
Remember, tests of statistical significance measure the likelihood
of relationships between variables being only a product of sampling
error; if there's no sampling, there's no sampling error.
Third, tests of significance are based on the same sampling assumptions
we used in computing confidence intervals. To the extent that these
assumptions are not met by the actual sampling design, the tests of
significance are not strictly legitimate.
While we have examined statistical significance
here in the form of chi square, there are several other measures commonly
used by social scientists. Analysis of variance and t-tests are two
examples you may run across in your studies.
As is the case for most matters covered in this
book, I have a personal prejudice. In this instance, it is against
tests of significance. I don't object to the statistical logic of
those tests, because the logic is sound. Rather, I'm concerned that
such tests seem to mislead more than they enlighten. My principal
reservations are the following
1. Tests of significance make sampling assumptions
that are virtually never satisfied by actual sampling designs.
2. They depend on the absence of nonsampling errors, a questionable
assumption in most actual empirical measurements.
3. In practice, they are too often applied to measures of association
that have been computed in violation of the assumptions made by those
measures (for example, product-moment correlations computed from ordinal
data).
4. Statistical significance is too easily misinterpreted
as "strength of association," or substantive significance.
These concerns are underscored by a recent study
(Sterling, Rosenbaum, and Weinkam 1995) examining the publication
policies of nine psychology and three medical journals. As the researchers
discovered, the journals were quite unlikely to publish articles that
did not report statistically significant correlations among variables.
They quote the following from a rejection letter
Unfortunately, we are not able to publish this
manuscript. The manuscript is very well written and the study was
well documented. Unfortunately, the negative results translates into
a minimal contribution to the field. We encourage you to continue
your work in this area and we will be glad to consider additional
manuscripts that you may prepare in the future.
(STERLING ET AL. 1995 109)
Let's suppose a researcher conducts a scientifically
excellent study to determine whether X causes Y. The results indicate
no statistically significant correlation. That's good to know. If
we're interested in what causes cancer, war, or juvenile delinquency,
it's good to know that a possible cause actually does not cause it.
That knowledge would free researchers to look elsewhere for causes.
As we've seen, however, such a study might very
well be rejected by journals. As such, other researchers would continue
testing whether X causes Y, not knowing that previous studies found
no causal relationship. This would produce many wasted studies, none
of which would see publication and draw a close to the analysis of
X as a cause of Y.
From what you've learned about probabilities,
however, you can understand that if enough studies are conducted,
one will eventually measure a statistically significant correlation
between X and Y. If there is absolutely no relationship between the
two variables, we would expect a correlation significant at the .05
level five times out of a hundred, since that's what the .05 level
of significance means. If a hundred studies were conducted, therefore,
we could expect five to suggest a causal relationship where there
was actually none, and those five studies would be published!
There are, then, serious problems inherent in
too much reliance on tests of statistical significance. At the same
time (perhaps paradoxically) I would suggest that tests of significance
can be a valuable asset to the researcher, useful tools for understanding
data. Although many of my comments suggest an extremely conservative
approach to tests of significance, that you should use them only when
all assumptions are met, my general perspective is just the reverse.
I encourage you to use any statistical technique,
any measure of association or test of significance, if it will help
you understand your data. If the computation of product-moment correlations
among nominal variables and the testing of statistical significance
in the context of uncontrolled sampling will meet this criterion,
then I encourage such activities. I say this in the spirit of what
Hanan Selvin, another pioneer in developing the elaboration model,
referred to as "data-dredging techniques." Anything goes, if it leads
ultimately to the understanding of data and of the social world under
study.
The price of this radical freedom, however, is
the giving up of strict, statistical interpretations. You will not
be able to base the ultimate importance of your finding solely on
a significant correlation at the .05 level. Whatever the avenue of
discovery, empirical data must ultimately be presented in a legitimate
manner, and their importance must be argued logically.