The correlation coefficient is less than the correlation coefficient and causal relationship: the formulas and their interpretation. Restrictions of correlation analysis

The correlation coefficient is the degree of communication between the two variables. His calculation gives an idea of \u200b\u200bwhether there is a dependence between two data arrays. In contrast to regression, the correlation does not allow to predict the values \u200b\u200bof the values. However, the calculation of the coefficient is an important step in preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and the growth rate of GDP is high. This gives us an idea that to ensure welfare you need to create a favorable climate for foreign entrepreneurs. Not such an obvious conclusion at first glance!

Correlation and causality

Perhaps there is not a single sphere of statistics that would so firmly entered our lives. The correlation coefficient is used in all areas of public knowledge. Its main danger is that it is often speculated by its high values \u200b\u200bin order to convince people and make them believe in some conclusions. However, in fact, the strong correlation does not indicate the causal relationship between values.

Correlation ratio: Pearson and Spirman Formula

There are several basic indicators that characterize the relationship between two variables. Historically, the first Pearson linear correlation coefficient is. He is still held at school. It was designed by K. Pearson and J. Julia based on works of FR. Galton. This coefficient allows you to see the relationship between rational numbers that change rationally. It is always more -1 and less than 1. A negative number indicates back proportional dependence. If the coefficient is zero, there is no connection between the variables. It is equal to a positive number - there is a direct proportional relationship between the values \u200b\u200bunder study. The coefficient of rank correlation of Spirman allows you to simplify the calculations by building the hierarchy of variable values.

Relationship between variables

The correlation helps to find a response to two questions. First, whether the connection between variables is positive or negative. Secondly, how strong is the dependence. Correlation analysis is a powerful tool with which you can get this important information. It is easy to see that family income and expenses fall and grow proportionally. This connection is considered positive. On the contrary, with growth in price for goods, the demand for it falls. Such a connection is called negative. The values \u200b\u200bof the correlation coefficient are in the range between -1 and 1. Zero means that there are no dependencies between the studied values. The closer the resulting indicator to extreme values, the stronger the connection (negative or positive). The absence of dependence indicates the coefficient from -0.1 to 0.1. It should be understood that this value indicates only the lack of linear communication.

Features of application

The use of both indicators is associated with certain assumptions. First, the presence of a strong connection does not cause the fact that one value determines the other. It may well exist a third value that defines each of them. Secondly, the high coefficient of the Pearson correlation does not indicate the causal relationship between the studied variables. Thirdly, it shows an extremely linear dependence. The correlation can be used to evaluate significant quantitative data (for example, atmospheric pressure, air temperature), and not such categories as the floor or beloved color.

Multiple correlation coefficient

Pearson and Spirman investigated the relationship between two variables. But how to act if there are three or even more. A multiple correlation coefficient comes to the rescue. For example, not only direct foreign investments, but also a monetary and fiscal policy of the state, as well as the level of exports affect the gross national product. The growth rate and the volume of GDP is the result of the interaction of a number of factors. However, it is necessary to understand that the multiple correlation model is based on a number of simplifications and assumptions. First, the multicollinearity between the values \u200b\u200bis excluded. Secondly, the relationship between dependent and affect the influence of variables is considered linear.

Areas of using correlation and regression analysis

This method of finding the relationship between values \u200b\u200bis widely used in statistics. It is most often resorted to three basic cases:

  1. To test the causal relationships between the values \u200b\u200bof two variables. As a result, the researcher hopes to detect a linear dependence and withdraw the formula that describes these relations between values. The units of their measurement may be different.
  2. To check the availability between values. In this case, no one determines which variable is dependent. It may turn out that the value of both values \u200b\u200bcauses some other factor.
  3. For the output of the equation. In this case, you can simply substitute the numbers into it and find out the values \u200b\u200bof an unknown variable.

Man in search of causal relationship

Consciousness is arranged in such a way that we must explain the events that occur around. A person is always looking for a relationship between the picture of the world in which he lives and received by the information. Often the brain creates order from chaos. It can easily see the causal relationship where it is not. Scientists have to specifically learn to overcome this trend. The ability to evaluate links between data objectively is necessary in an academic career.

The bias of the media

Consider how the presence of correlation may be incorrectly interpreted. A group of British students, distinguished by poor behavior, interviewed whether their parents smoke. Then the test was published in the newspaper. The result showed a strong correlation between the smoking of the parents and the offenses of their children. The professor who conducted this study, even proposed a warning about this for packs of cigarettes. However, there are a number of problems with such an outcome. First, the correlation does not show which of the values \u200b\u200bis independent. Therefore, it is quite possible to assume that the destructive habit of parents is caused by the disobedience of children. Secondly, it is impossible to say with confidence that both problems did not appear because of some third factor. For example, low-income families. An emotional aspect of the initial findings of the professor who conducted a study should be noted. He was a jar enemy smoking. Therefore, there is nothing surprising that he interpreted the results of his research this way.

conclusions

Incorrect interpretation of the correlation as a causal relationship between two variables may cause shameful errors in research. The problem is that it lies at the basis of human consciousness. Many marketing tricks are built on this feature. Understanding the differences between causal bond and correlation allows you to rationally analyze information both in everyday life and in a professional career.

Correlation coefficient - This is a magnitude that can vary from +1 to -1. In the case of a complete positive correlation, this coefficient is plus 1 (they suggest that with an increase in the value of one variable, the value of another variable increases), and with a complete negative - minus 1 (indicate a feedback, i.e. with an increase in the values \u200b\u200bof one variable, The different values \u200b\u200bare reduced).

PR1.:

The graph of the dependence of shyness and dipresis. As you can see, the points (subjects) are not chaotic, but built around one line, and looking at this line we can say that the higher the person is expressed in the person, the more depressiveness, i.e., these phenomena are interconnected.

PR2.: Schedule for shyness and sociability. We see that with an increase in shyness, sociability decreases. Their correlation coefficient is -0.43. Thus, the correlation coefficient greater from 0 to 1 speaks of direct-proportional communication (the greater ... the more ...), and the coefficient from -1 to 0 about the disgraceful (the more ... the less ...)

If the correlation coefficient is 0, both variables are completely independent of each other.

Correlation - This is a connection where the exposure of individual factors is manifested only as a tendency (on average) with mass observation of actual data. Examples of correlation dependences can be dependencies between the size of the bank's assets and the amount of the Bank's profits, the growth of labor productivity and experience of employees.

Two systems of classification of correlation ties in their strength are used: general and private.

General classification of correlations: 1) Strong, or close with the correlation coefficient R\u003e 0.70; 2) average at 0.500.70, and not just a correlation of a high level of significance.

The following table wrote the names of the correlation coefficients for various types of scales.

Dichotomic scale (1/0) Rank (ordinal) scale
Dichotomic scale (1/0) The coefficient of the Pearson Association, the Four-Board Pearson Coefficient. Beiserial correlation
Rank (ordinal) scale Rank biserial correlation. Range coefficient of correlation of spirit or kendalla.
Interval and absolute scale Beiserial correlation The values \u200b\u200bof the interval scale are translated into ranks and is used rank coefficient Pearson correlation coefficient (linear correlation coefficient)

For r.=0 linear correlation is missing. At the same time, group average variables coincide with their shared averages, and the regression lines are parallel to the coordinate axes.

Equality r.=0 it speaks only about the absence of a linear correlation dependence (non-corrosion of variables), but not at all about the absence of correlation, and even more so, statistical dependence.

Sometimes the conclusion about the absence of correlation is more important than the presence of a strong correlation. The zero correlation of two variables may indicate that there is no effect of one variable to another, provided that we trust the measurement results.

In SPSS: 11.3.2 Correlation coefficients

Until now, we figured out only the fact of the existence of a statistical dependence between the two signs. Next, we will try to find out which conclusions can be done about the strength or weakness of this dependence, as well as about its form and orientation. The criteria for a quantitative assessment of the relationship between variables are called correlation coefficients or connectedness measures. Two variables correlate with each other positively, if there is a direct, unidirectional ratio between them. With a unidirectional ratio, small values \u200b\u200bof one variable correspond to small values \u200b\u200bof another variable, large values \u200b\u200bare large. Two variables correlate with each other negatively, if there is a reverse, multidirectional ratio between them. With a multidirectional ratio, small values \u200b\u200bof one variable correspond to the large values \u200b\u200bof another variable and vice versa. The values \u200b\u200bof the correlation coefficients are always lying in the range from -1 to +1.

The coefficient of correlation between the variables belonging to the ordinal scale is used by the coefficient of the coefficient, and for variables belonging to the interval - the Pearson correlation coefficient (the moment of works). It should be noted that each dichotomous variable, that is, a variable belonging to the nominal scale and having two categories can be considered as ordinal.

To begin with, we will check if the correlation between the SEX and PSYCHE variables from the studium.sav file. At the same time, we take into account that the dichotomic variable of Sex can be considered ordinary. Follow these steps:

· Select DESCRIPTIVE STATISTICS (descriptive statistics) commands in the Analyze command menu ... (conjugacy tables)

· Transfer the SEX variable to the string list, and the PSYCHE variable is in the column list.

· Click on the Statistics ... (statistics) button. In the Crosstabs: Statistics dialog, select the Correlations checkbox (correlation). Confirm the contact with the Continue button.

· In the Crosstabs dialog, refuse to output the tables by checking the Supress Tables check box. Click on the OK button.

The coefficients of the correlation coefficients of the spirote and Pearson will be calculated, and their significance is checked:

/ SPSS 10.

Task number 10 Correlation analysis

Concept of correlation

Correlation or correlation coefficient is a statistical indicator probabilisticcommunication between two variables measured by quantitative scales. In contrast to the functional connection, in which each value of one variable corresponds to strictly definedthe value of another variable, probabilistic communicationit is characterized by the fact that each value of one variable corresponds to many valuesanother variable, an example of probabilistic communication is the relationship between the growth and weight of people. It is clear that the same height can be in people of different weight and vice versa.

The correlation is the value concluded from -1 to + 1, and is indicated by the letter R. Moreover, if the value is closer to 1, this means the presence of a strong connection, and if closer to 0, then weak. The correlation value of less than 0.2 is considered as a weak correlation, over 0.5 - high. If the correlation coefficient is negative, this means the presence of feedback: the higher the value of one variable, the lower the different value.

Depending on the value of the coefficient values, it is possible to allocate various types of correlation:

Strict positive correlationdetermined by the value ofr \u003d 1. The term "strict" means that the value of one variable is uniquely determined by the values \u200b\u200bof another variable, and the term " positive "- that with an increase in the values \u200b\u200bof one variable, the value of another variable is also increasing.

Strict correlation is mathematical abstraction and practically does not occur in real studies.

Positive correlation Corresponds to the values \u200b\u200bof 0.

Lack of correlationdetermined by the value ofr \u003d 0. The correlation zero coefficient suggests that the values \u200b\u200bof the variables are not related to each other.

Lack of correlation H. o. : 0 r. xY. =0 formulated as reflection null Hypothesis in correlation analysis.

Negative correlation: -1

Strict negative correlationdetermined by the value ofr \u003d -1. It also, as well as a strict positive correlation, is abstraction and does not find an expression in practical research.

Table 1

Types of correlation and their definitions

The method for calculating the correlation coefficient depends on the type of scale on which the values \u200b\u200bof the variable are measured.

Correlation coefficient r.Pearsonit is the main and can be used for variables with nominal and partially ordered, interval scales, the distribution of values \u200b\u200bby which corresponds to normal (correlation of the moments of the work). The Pearson correlation coefficient gives fairly accurate results and in cases of abnormal distributions.

For distributions that are not normal, it is preferable to use the coefficients of the ranking correlation of spirmen and Kendalla. Range they are because the program is pre-ranking correlated variables.

The correlation of the RPRMAN programSPSSS provides as follows: First, variables are transferred to ranks, and then the formulason is used to rank.

At the heart of the correlation proposed by M. Kendalla, there is an idea that the direction of communication can be judged, in pairwise comparing the subjects among themselves. If a pair of tested changes in x coincide in the direction with a change in paying, this indicates a positive connection. If it does not coincide - then a negative connection. This coefficient is used mainly by psychologists working with small samples. Since sociologists work with large data arrays, it is difficult to identify the difference in the relative frequencies and inversions of all pairs of subjects in the sample. The most common is the coefficient. Pearson.

Since the RPirson correlation coefficient is main and can be used (with a certain error depending on the type of scale and level of abnormality in the distribution) for all variables measured by quantitative scales, consider examples of its use and compare the results obtained with measurement results according to other correlation coefficients.

Formula for calculating the coefficient r.- Pearson:

r xy \u003d σ (xi-xcp) ∙ (yi-ycr) / (n - 1) ∙ Σ x ∙ Σ y ∙

Where: xi, yi- the values \u200b\u200bof two variables;

XSR, YCR - average values \u200b\u200bof two variables;

σ x, σ y - standard deviations,

N- Number of observations.

Paired correlations

For example, we would like to find out how the answers between the various types of traditional values \u200b\u200brelate in the idea of \u200b\u200bstudents about the ideal place of work (variables: A9.1, A9.3, A9.5, A9.7), and then the ratio of liberal values \u200b\u200b(A9 .2, a9.4. A9.6, A9.8). These variables are measured by 5 - membered ordered scales.

We use the procedure: "Analysis",  "Correlation",  "Paired". Default Coeff. Pearson is set in the dialog box. We use coefficients. Pearson

Test variables are transferred to the selection window: A9.1, A9.3, A9.5, A9.7

By pressing ok, we get the calculation:

Correlation

a9.1.T. How important is it enough time for family and personal life?

Pearson correlation

ZNch. (2 sides)

a9.3.t. How important is not to be afraid of losing your work?

Pearson correlation

ZNch. (2 sides)

a9.5.T. How important is it to have such a boss who will advise with you, accepting this or that decision?

Pearson correlation

ZNch. (2 sides)

a9.7.T. How important is it to work in a coherent team, feel part of it?

Pearson correlation

ZNch. (2 sides)

** Correlation is meaningful at the level of 0.01 (2 sides.).

Table of quantitative values \u200b\u200bof the constructed correlation matrix

Private correlations:

To begin with, we construct a pair correlation between the two variables specified:

Correlation

c8. Feel intimacy with those who live near you, neighbors

Pearson correlation

ZNch. (2 sides)

c12. Feel intimacy with her family

Pearson correlation

ZNch. (2 sides)

**. The correlation is meaningful at the level of 0.01 (2-sides.).

Then use the procedure for building a private correlation: "Analysis",  "Correlation",  "Private".

Suppose that the value "It is important to determine and change the order of your work" in relation to the specified variables will be the decisive factor, under the influence of which the previously identified connection will disappear, or will be unfounded.

Correlation

Excluded variables

c8. Feel intimacy with those who live near you, neighbors

c12. Feel intimacy with her family

c16. Feel intimacy with people who have the same wealth as you

c8. Feel intimacy with those who live near you, neighbors

Correlation

Significance (2nd.)

c12. Feel intimacy with her family

Correlation

Significance (2nd.)

As can be seen from the table under the influence of the control variable, the connection has slightly decreased: from 0, 120 to 0, 102. However, this slightly decrease does not allow to assert that the wound is a reflection of false correlation, because It remains high enough and allows with a zero error to refute the zero hypothesis.

Correlation coefficient

The most accurate way to determine the crosses and the nature of the correlation is to find the correlation coefficient. The correlation coefficient is the number defined by the formula:


where R hu is the correlation coefficient;

x i -thodics of the first feature;

at the i-apposition of the second feature;

The average arithmetic values \u200b\u200bof the first sign

The average arithmetic values \u200b\u200bof the second feature

To use the formula (32), we construct a table that will provide the necessary sequence in the preparation of numbers to find the numerator and the denominator of the correlation coefficient.

As can be seen from formula (32), the sequence of actions is this: we find the average arithmetic of both signs x and y, we find the difference between the values \u200b\u200bof the feature and its average (x i -) and і -), then find their work (x і) ( I І -) - the sum of the priest gives the correlation coefficient numerator. To find his denominator, it follows the difference (X I -) and (at І -) to build a square, find them amounts and extract the square root from their work.

So for example 31, the correlation coefficient in accordance with formula (32) can be represented as follows (Table 50).

The resulting number of correlation coefficient makes it possible to establish the presence, tightness and nature of communication.

1. If the correlation coefficient is zero, there is no connection between the signs.

2. If the correlation coefficient is equal to one, the relationship between the signs is so large, which turns into a functional one.

3. The absolute value of the correlation coefficient does not go beyond the range from zero to one:

This makes it possible to navigate the tightness of the connection: the value of the coefficient closer to zero, the connection is weaker, and the closer to one, the connection is closer.

4. The "Plus" correlation coefficient sign means a direct correlation, a "minus" sign.

Table 50

x і. І. (x i -) (y -) (x і -) (y -) (x і -) 2 (y і -) 2
14,00 12,10 -1,70 -2,30 +3,91 2,89 5,29
14,20 13,80 -1,50 -0,60 +0,90 2,25 0,36
14,90 14,20 -0,80 -0,20 +0,16 0,64 0,04
15,40 13,00 -0,30 -1,40 +0,42 0,09 1,96
16,00 14,60 +0,30 +0,20 +0,06 0,09 0,04
17,20 15,90 +1,50 +2,25 2,25
18,10 17,40 +2,40 +2,00 +4,80 5,76 4,00
109,80 101,00 12,50 13,97 13,94


Thus, calculated in Example 31, the correlation coefficient R xy \u003d +0.9. Allows you to make such conclusions: there is a correlation bond between the magnitude of the muscular power of the right and left brushes in the studied schoolchildren (the coefficient R xy \u003d + 0.9 is different from zero), the connection is very close (the coefficient R xy \u003d + 0.9 is close to one), correlation is straightforward (R Coefficient XY \u003d +0.9 positive), i.e., with an increase in the muscular strength of one of the brushes, the strength of another brush increases.

When calculating the correlation coefficient and using its properties, it should be noted that the conclusions give correct results in the case when the signs are distributed normally and when the relationship between the large number of values \u200b\u200bof both signs is considered.

In the considered example 31, only 7 values \u200b\u200bof both signs were analyzed, which, of course, is not enough for such studies. We remind here again that examples in this book in general and in this chapter in particular, are the nature of the methods of methods, and not a detailed presentation of any scientific experiments. As a result, a small number of signs of signs is considered, measuring is rounded - all this is done in order for bulky calculations to not dow the idea of \u200b\u200bthe method.

Special attention should be paid to the essence of the relationship under consideration. The correlation coefficient cannot lead to the correct results of the study, if the analysis of the relationship between the signs is carried out formally. Let's return again for example 31. Both considered features were the meaning of muscular power of the right and left brushes. Imagine that under the sign of XI in example 31 (14.0; 14.2; 14.9 ... ... 18.1) we understand the length of accidentally caught fish in centimeters, and under the sign of І (12,1 ; 13.8; 14.2 ... ... 17.4) -At devices in the laboratory in kilograms. Formally, using the computing apparatus to find the correlation coefficient and obtaining in this case also R xy \u003d + 0\u003e 9, we had to conclude that there is a close relationship between fish and the weight of the instruments. The meaninglessness of this conclusion is obvious.

To avoid a formal approach to using the correlation coefficient, it follows by any other method - mathematical, logical, experimental, theoretical - to identify the possibility of the existence of a correlation between the signs, that is, to detect the organic unity of the signs. Only after that you can proceed to the use of correlation analysis and set the value and nature of the relationship.

In mathematical statistics there is still a concept multiple correlation - Relationships between three and more signs. In these cases, use the multiple correlation coefficient consisting of paired correlation coefficients described above.

For example, the correlation coefficient of three signs-x І, y і, z і - there:

where R xyz is a multiple correlation-cylinder, expressing as a sign X I depends on the signs of І and Z I;

r xy-cell correlation between the signs X i and Y i;

r xz -Coeffer correlation between signs XI and Zi;

r yz. - correlation coefficient between signs Y i, Z I

Correlation analysis is:

Correlation analysis

Correlation - the statistical relationship between two or several random variables (or values \u200b\u200bthat can be considered as such with some permissible accuracy). At the same time, changes in one or more of these values \u200b\u200blead to a systematic change in other or other values. The mathematical measure of the correlation of two random variables is the correlation coefficient.

The correlation can be positive and negative (there is also a situation of lack of statistical relationships - for example, for independent random variables). Negative correlation - Correlation at which an increase in one variable is associated with a decrease in another variable, while the correlation coefficient is negative. Positive correlation - Correlation at which an increase in one variable is associated with an increase in another variable, while the correlation coefficient is positive.

Autocorrelation - The statistical relationship between random values \u200b\u200bfrom one row, but to be taken with a shift, for example, for a random process - with a shift in time.

The method of processing statistical data consisting in the study of coefficients (correlation) between variables, is called correlation analysis.

Correlation coefficient

Correlation coefficient or farmer correlation coefficient In probability and statistics theory, this is an indicator of the nature of the change in two random variables. The correlation coefficient is indicated by the Latin letter R and can take values \u200b\u200bbetween -1 and +1. If the value of the module is closer to 1, then this means the presence of a strong bond (with a correlation coefficient, the unit is talking about the functional connection), and if closer to 0, then weak.

Pearson correlation coefficient

For metric values, the Pearson correlation coefficient is applied, the exact formula of which was introduced by Francis Galton:

Let be X.,Y. - Two random variables defined on one probabilistic space. Then their correlation coefficient is set by the formula:

,

where cov means covariance, and D is a dispersion, or that the same thing

,

where the symbol refers to a mathematical expectation.

You can use a rectangular coordinate system with axes that correspond to both variables. Each pair of values \u200b\u200bis marked using a specific symbol. Such a chart is called a "scattering diagram".

The method of calculating the correlation coefficient depends on the type of scale to which variables relate. Thus, to measure variables with interval and quantitative scales, it is necessary to use the Pearson correlation coefficient (correlation of the moments of works). If at least one of two variables has a sequence scale, or is not normally distributed, it is necessary to use the rank correlation of the alcoholic or τ (Tau) kendale. In the case when one of two variables is dichotomous, a point double-row correlation is used, and if both variables are dichotomous: four-way correlation. Calculation of the correlation coefficient between two non-fractional variables is not deprived of meaning only then, the link is bond between them of linear (unidirectional).

Correlation coefficient Kendella

Used to measure mutual disorder.

Spearman correlation coefficient

Properties of the correlation coefficient

  • The inequality of Cauchy - Bunyakovsky:
if you take as a scalar product of two random covariances, then the rate of random variable will be equal to And the consequence of Cauchy inequality - Bunyakovsky will be :. where. Moreover in this case signs and k. match up: .

Correlation analysis

Correlation analysis - method of processing statistical data consisting in the study of coefficients ( correlation) Between variables. In this case, the correlation coefficients are compared between one pair or a variety of pairs of features to establish statistical relationships between them.

purpose correlation analysis - Provide some information about one variable using another variable. In cases where it is possible to achieve the goal, it is said that variables correlate. In the most general form, the adoption of the hypothesis on the presence of correlation means that the change in the value of the variable A will occur simultaneously with a proportional change in the value of B: if both variables grow correlation positiveif one variable grows, and the second decreases, negative correlation.

The correlation reflects only a linear dependence of quantities, but does not reflect their functional connectedness. For example, if calculating the correlation coefficient between values A. = s.i.n.(x.) I. B. = c.o.s.(x.), then it will be close to zero, i.e., the dependence between the values \u200b\u200bis absent. Meanwhile, the values \u200b\u200ba and b are obviously associated functionally by law s.i.n.2(x.) + c.o.s.2(x.) = 1.

Restrictions of correlation analysis



Couples (x, y) distribution graphs with appropriate X and Y correlation coefficients for each of them. Note that the correlation coefficient reflects the linear dependence (top line), but does not describe the dependence curve (average line), and is not at all suitable for describing complex, nonlinear dependencies (lower line).
  1. Application is possible in the case of a sufficient number of cases for study: for a specific type of correlation coefficient ranges from 25 to 100 surveillance pairs.
  2. The second limitation follows from the hypothesis of the correlation analysis in which it is laid linear dependence of variables. In many cases, when it is reliably known that the dependence exists, the correlation analysis may not give results simply due to the fact that the relationship is nonlinear (expressed, for example, in the form of a parabola).
  3. By itself, the fact of correlation dependences does not provide grounds to argue which of the variables precedes or is the cause of changes, or that the variables are generally causally connected with each other, for example, due to the actions of the third factor.

Application area

This method of processing statistical data is very popular in the economy and social sciences (in particular in psychology and sociology), although the scope of the correlation coefficients is extensive: quality control of industrial products, metal studies, agrochemistry, hydrobiology, biometrics and others.

The popularity of the method is due to two moments: the correlation coefficients are relatively simple in counting, their use does not require special mathematical training. In combination with the simplicity of interpretation, the simplicity of the coefficient has led to its widespread in the scope of analysis of statistical data.

False correlation

Often tempting simplicity of correlation research pushes the researcher to make false intuitive conclusions about the presence of a causal relationship between couples of signs, while the correlation coefficients establish only statistical relationships.

In the modern quantitative methodology of social sciences, in fact, there was a refusal to attempt to establish causal relations between the observed variables of empirical methods. Therefore, when researchers in social sciences speak of the establishment of interrelations between the variables studied, it means either a general-relevant assumption or statistical dependence.

see also

  • Autocorrelation function
  • Corresponduring function
  • Covariator
  • Coefficient of determination
  • Regression analysis

Wikimedia Foundation. 2010.

Where x · y, x, y - the average sample values; σ (x), σ (y) is the standard deviations.
Moreover, pearson Linear Steaming Coefficient It can be determined through the regression coefficient B:, where σ (x) \u003d s (x), σ (y) \u003d s (y) is the standard deviations, B coefficient before x in the regression equation y \u003d a + bx.

Other options for formulas:
or

To XY - correlation moment (coefficient of covariance)

To find the linear coefficient of Pearson correlation, it is necessary to find selective average x and y, and their rms deviations σ x \u003d s (x), σ y \u003d s (y):

The linear correlation coefficient indicates the presence of communication and takes values \u200b\u200bfrom -1 to +1 (see Chaddock scale). For example, when analyzing the timidate of the linear correlation between two variables, a pair linear correlation coefficient was obtained equal to -1. This means that there is an accurate reverse linear dependence between variables.

Calculate the correlation coefficient value by specified middle sample, or directly.

Xy # x #y # Σ x # Σ y "data-id \u003d" a; b; c; d; e "Data-Formul \u003d" (ab * c) / (D * E) "Data-R \u003d" R XY "\u003e Calculate your value

Geometric meaning of the correlation coefficient: R xy shows how much the tilt of two regression lines are different: y (x) and x (y), how much the results of minimizing deviations by x and via y are different. The greater the angle between the lines, then the greater R xy.
The correlation coefficient sign coincides with the regression coefficient sign and determines the inclination of the regression line, i.e. The general focus of dependence (increase or decrease). The absolute value of the correlation coefficient is determined by the proximity of the points to the regression line.

Properties of the correlation coefficient

  1. | R xy | ≤ 1;
  2. if x and y are independent, then R xy \u003d 0, the opposite is not always true;
  3. if | R xy | \u003d 1, then y \u003d ax + b, | r xy (x, ax + b) | \u003d 1, where a and b are constant, a ≠ 0;
  4. | R xy (x, y) | \u003d | R xy (a 1 x + b 1, a 2 x + b 2) |, where a 1, a 2, b 1, b 2 are constant.

Therefore for communication checks The hypothesis check is selected using the Pearson correlation coefficient with further verification of reliability using t-criterion (example, see below).

Typical tasks (see also nonlinear regression)

Typical tasks
The dependence of labor productivity y on the level of mechanization of the work X (%) according to 14 industrial enterprises is investigated. Statistical data are shown in the table.
Requires:
1) Find estimates of the parameters of linear regression y at x. Build a scattering diagram and apply direct regression to the scattering diagram.
2) at the level of significance α \u003d 0.05, check the hypothesis about the consent of the linear regression with the results of observations.
3) with the reliability of γ \u003d 0.95 to find confidence intervals for linear regression parameters.

Together with this calculator also use the following:
Multiple regression equation

Example. Based on the data provided in Appendix 1 and the corresponding to your option (Table 2), it is required:

  1. Calculate the coefficient of linear pair correlation and construct the equation of linear steamy regression of one attribution from the other. One of the signs corresponding to your option will play the role of factor (x), the other - efficient (Y). Causal relationships between signs to establish themselves on the basis of economic analysis. Calculate the meaning of the parameters of the equation.
  2. Determine the theoretical coefficient of determination and residual (unexplained regression equation) dispersion. Make a conclusion.
  3. Assess the statistical significance of the regression equation as a whole at a five percent level using the Fisher's F-Criterody. Make a conclusion.
  4. Forecast of the expected value of the sign-result y during the forecast value of the factors of the factor x 105% of the mean level x. Assess the accuracy of the forecast, having calculated the error of the forecast and its confidence interval with a probability of 0.95.
Decision. The equation has the form Y \u003d AX + B
Average values



Dispersion


Radial deviation



The relationship between the sign Y is a strong and straightforward (determined on the sampler).
Regression equation

Recession coefficient: k \u003d a \u003d 4.01
Coefficient of determination
R 2 \u003d 0.99 2 \u003d 0.97, i.e. In 97% of cases of changes, lead to a change y. In other words - the accuracy of the selection of the regression equation is high. Residual dispersion: 3%.
x.y.x 2y 2.x · y.y (x)(y i -y) 2(Y-y (x)) 2(x - x p) 2
1 107 1 11449 107 103.19 333.06 14.5 30.25
2 109 4 11881 218 107.2 264.06 3.23 20.25
3 110 9 12100 330 111.21 232.56 1.47 12.25
4 113 16 12769 452 115.22 150.06 4.95 6.25
5 120 25 14400 600 119.23 27.56 0.59 2.25
6 122 36 14884 732 123.24 10.56 1.55 0.25
7 123 49 15129 861 127.26 5.06 18.11 0.25
8 128 64 16384 1024 131.27 7.56 10.67 2.25
9 136 81 18496 1224 135.28 115.56 0.52 6.25
10 140 100 19600 1400 139.29 217.56 0.51 12.25
11 145 121 21025 1595 143.3 390.06 2.9 20.25
12 150 144 22500 1800 147.31 612.56 7.25 30.25
78 1503 650 190617 10343 1503 2366.25 66.23 143

Note: The values \u200b\u200by (x) are from the obtained regression equation:
y (1) \u003d 4.01 * 1 + 99.18 \u003d 103.19
y (2) \u003d 4.01 * 2 + 99.18 \u003d 107.2
... ... ...

The significance of the correlation coefficient

We put forward hypothesis:
H 0: R xy \u003d 0, there is no linear relationship between variables;
H 1: R xy ≠ 0, there is a linear relationship between variables;
In order to check the zero hypothesis of the equality to zero of the general coefficient of correlation of the normal two-dimensional random variable with a competing hypothesis H 1 ≠ 0, it is necessary to calculate the observed value of the criterion (random error value):

On the table of Student we find T Table (N-M-1; α / 2) \u003d (10; 0.025) \u003d 2.228
Since TNBB\u003e T Table, we deflect the hypothesis about the equality 0 of the correlation coefficient. In other words, the correlation coefficient is statistically meaning.
Interval estimate for correlation coefficient (confidence interval)


r - Δ r ≤ r ≤ r + Δ r
Δ r \u003d ± T Table M r \u003d ± 2.228 0.0529 \u003d 0.118
0.986 - 0.118 ≤ r ≤ 0.986 + 0.118
Trust interval for correlation coefficient: 0.868 ≤ r ≤ 1

Analysis of the accuracy of determining estimates of regression coefficients





S a \u003d 0.2152

Trust intervals for dependent variable

Calculate the boundaries of the interval in which 95% of the possible values \u200b\u200bof y will be concentrated with an unlimited large number of observations and x \u003d 7
(122.4;132.11)
Checking the hypothesis relative to the coefficients of the regression linear equation

1) T-statistics




The statistical significance of the regression coefficient is confirmed
Trust interval for the coefficients of the regression equation
We define the confidence intervals of regression coefficients, which with a reliability of 95% will be as follows:
(A - T A S A; A + T A S A)
(3.6205;4.4005)
(b - t b s b; b + t b s b)
(96.3117;102.0519)

The correlation coefficient is the degree of communication between the two variables. His calculation gives an idea of \u200b\u200bwhether there is a dependence between two data arrays. In contrast to regression, the correlation does not allow to predict the values \u200b\u200bof the values. However, the calculation of the coefficient is an important step in preliminary statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and the growth rate of GDP is high. This gives us an idea that to ensure welfare you need to create a favorable climate for foreign entrepreneurs. Not such an obvious conclusion at first glance!

Correlation and causality

Perhaps there is not a single sphere of statistics that would so firmly entered our lives. The correlation coefficient is used in all areas of public knowledge. Its main danger is that it is often speculated by its high values \u200b\u200bin order to convince people and make them believe in some conclusions. However, in fact, the strong correlation does not indicate the causal relationship between values.

Correlation ratio: Pearson and Spirman Formula

There are several basic indicators that characterize the relationship between two variables. Historically, the first Pearson linear correlation coefficient is. He is still held at school. It was designed by K. Pearson and J. Julia based on works of FR. Galton. This coefficient allows you to see the relationship between rational numbers that change rationally. It is always more -1 and less than 1. A negative number indicates back proportional dependence. If the coefficient is zero, there is no connection between the variables. It is equal to a positive number - there is a direct proportional relationship between the values \u200b\u200bunder study. The coefficient of rank correlation of Spirman allows you to simplify the calculations by building the hierarchy of variable values.

Relationship between variables

The correlation helps to find a response to two questions. First, whether the connection between variables is positive or negative. Secondly, how strong is the dependence. Correlation analysis is a powerful tool with which you can get this important information. It is easy to see that family income and expenses fall and grow proportionally. This connection is considered positive. On the contrary, with growth in price for goods, the demand for it falls. Such a connection is called negative. The values \u200b\u200bof the correlation coefficient are in the range between -1 and 1. Zero means that there are no dependencies between the studied values. The closer the resulting indicator to extreme values, the stronger the connection (negative or positive). The absence of dependence indicates the coefficient from -0.1 to 0.1. It should be understood that this value indicates only the lack of linear communication.

Features of application

The use of both indicators is associated with certain assumptions. First, the presence of a strong connection does not cause the fact that one value determines the other. It may well exist a third value that defines each of them. Secondly, the high coefficient of the Pearson correlation does not indicate the causal relationship between the studied variables. Thirdly, it shows an extremely linear dependence. The correlation can be used to evaluate significant quantitative data (for example, atmospheric pressure, air temperature), and not such categories as the floor or beloved color.

Multiple correlation coefficient

Pearson and Spirman investigated the relationship between two variables. But how to act if there are three or even more. A multiple correlation coefficient comes to the rescue. For example, not only direct foreign investments, but also a monetary and fiscal policy of the state, as well as the level of exports affect the gross national product. The growth rate and the volume of GDP is the result of the interaction of a number of factors. However, it is necessary to understand that the multiple correlation model is based on a number of simplifications and assumptions. First, the multicollinearity between the values \u200b\u200bis excluded. Secondly, the relationship between dependent and affect the influence of variables is considered linear.

Areas of using correlation and regression analysis

This method of finding the relationship between values \u200b\u200bis widely used in statistics. It is most often resorted to three basic cases:

  1. To test the causal relationships between the values \u200b\u200bof two variables. As a result, the researcher hopes to detect a linear dependence and withdraw the formula that describes these relations between values. The units of their measurement may be different.
  2. To check the availability between values. In this case, no one determines which variable is dependent. It may turn out that the value of both values \u200b\u200bcauses some other factor.
  3. For the output of the equation. In this case, you can simply substitute the numbers into it and find out the values \u200b\u200bof an unknown variable.

Man in search of causal relationship

Consciousness is arranged in such a way that we must explain the events that occur around. A person is always looking for a relationship between the picture of the world in which he lives and received by the information. Often the brain creates order from chaos. It can easily see the causal relationship where it is not. Scientists have to specifically learn to overcome this trend. The ability to evaluate links between data objectively is necessary in an academic career.

The bias of the media

Consider how the presence of correlation may be incorrectly interpreted. A group of British students, distinguished by poor behavior, interviewed whether their parents smoke. Then the test was published in the newspaper. The result showed a strong correlation between the smoking of the parents and the offenses of their children. The professor who conducted this study, even proposed a warning about this for packs of cigarettes. However, there are a number of problems with such an outcome. First, the correlation does not show which of the values \u200b\u200bis independent. Therefore, it is quite possible to assume that the destructive habit of parents is caused by the disobedience of children. Secondly, it is impossible to say with confidence that both problems did not appear because of some third factor. For example, low-income families. An emotional aspect of the initial findings of the professor who conducted a study should be noted. He was a jar enemy smoking. Therefore, there is nothing surprising that he interpreted the results of his research this way.

conclusions

Incorrect interpretation of the correlation as a causal relationship between two variables may cause shameful errors in research. The problem is that it lies at the basis of human consciousness. Many marketing tricks are built on this feature. Understanding the differences between causal bond and correlation allows you to rationally analyze information both in everyday life and in a professional career.

The purpose of correlation analysis It is identifying an assessment of the connection force between random values \u200b\u200b(features), which characterizes some real process.
Tasks of correlation analysis:
a) Measurement of the degree of connectivity (tightness, forces, rigor, intensity) of two or more phenomena.
b) Selection of factors that have the most significant impact on the productive basis, on the basis of measuring the degree of connectivity between phenomena. The factors are essential in this aspect further in regression analysis.
c) Detection of unknown causal connections.

Forms of manifestation of relationships are very diverse. As the most common species, the functional (complete) and correlation (incomplete) communication.
Correlation It is manifested on average, for mass observations, when a certain number of probabilistic values \u200b\u200bof an independent variable correspond to the specified values \u200b\u200bof the dependent variable. Communication is called correlationIf each value of the factor feature corresponds to a completely definite non-random value of the resulting feature.
The correlation field serves as a clear image of the correlation table. It is a graph where the values \u200b\u200bx are postponed on the abscissa axis, along the ordinate axis, and the points are shown by the combinations x and y. At the location of the points, one can judge the presence of communication.
RESENCE INVERSARY INDICATORS It is possible to characterize the dependence of the variation of the effective feature from the variation of the sign factor.
More perfect indicator of the degree of closeness correlation bond is an linear correlation coefficient. When calculating this indicator, not only the deviations of the individual values \u200b\u200bof the feature from the average, but also the magnitude of these deviations itself is taken into account.

The key issues of this topic are the regression equations between the effective sign and explaining the variable, the method of least squares to estimate the parameters of the regression model, the analysis of the quality of the obtained regression equation, the construction of confidence intervals of the prediction of the values \u200b\u200bof the effective sign on the regression equation.

Example 2.


System of normal equations.
a n + bσx \u003d σy
aσx + bσx 2 \u003d σy x
For our data, the system of equations has the form
30A + 5763 B \u003d 21460
5763 A + 1200261 B \u003d 3800360
From the first equation express but and substitute for the second equation:
We get b \u003d -3.46, a \u003d 1379.33
Regression equation:
y \u003d -3.46 x + 1379.33

2. Calculation of parameters of the regression equation.
Selective average.



Selective dispersions:


Radial deviation


1.1. Correlation coefficient
Covariator.

Calculate the indicator of the tightness of the communication. This indicator is the selective linear correlation coefficient, which is calculated by the formula:

The linear correlation coefficient takes values \u200b\u200bfrom -1 to +1.
Communication between features may be weak and strong (close). Their criteria are estimated on the sampler scale:
0.1 < r xy < 0.3: слабая;
0.3 < r xy < 0.5: умеренная;
0.5 < r xy < 0.7: заметная;
0.7 < r xy < 0.9: высокая;
0.9 < r xy < 1: весьма высокая;
In our example, the relationship between the sign Y is high and reverse factor.
In addition, the coefficient of linear pairing correlation can be determined through the regression coefficient B:

1.2. Regression equation (Evaluation of the regression equation).

Linear regression equation has the form y \u003d -3.46 x + 1379.33

The coefficient B \u003d -3.46 shows the average change in the effective indicator (in units of measurement y) with an increase or decrease in the value of the factor x per unit of its measurement. In this example, with an increase of 1 unit y, it drops on average to -3.46.
The coefficient A \u003d 1379.33 formally shows the projected level of y, but only if x \u003d 0 is close to selective values.
But if x \u003d 0 is far from selective values \u200b\u200bX, then the literal interpretation can lead to incorrect results, and even if the regression line is quite accurately describes the values \u200b\u200bof the observed sample, there is no warranty, which will also be with extrapolation to the left or right.
Substituting the corresponding values \u200b\u200bof x to the regression equation, you can define aligned (predicted) values \u200b\u200bof the effective indicator Y (x) for each observation.
The relationship between y and x determines the sign of the regression coefficient B (if\u003e 0 is a direct connection, otherwise - reverse). In our example, the connection is reverse.
1.3. The coefficient of elasticity.
Regression coefficients (in Example b) undesirable to use for the direct assessment of the influence of factors on a productive basis in the event that there is a difference in units of measurement of the effective indicator and factor of X.
For these purposes, the coefficients of elasticity and beta are calculated - coefficients.
The average elasticity coefficient e shows how much percent on average the result will change the result w. from its average value when changing the factor x. 1% of its average.
The coefficient of elasticity is by the formula:


The coefficient of elasticity is less than 1. Consequently, with a change in x by 1%, y will change in less than 1%. In other words - the effect of X on y is not essential.
Beta - coefficient It shows which part of the value of its average quadratic deviation will change on average the value of the effective feature when the factor of the sign is changed by the value of its standard deviation at a fixed level at a constant value of the remaining independent variables:

Those. The increase in x by the value of the RMS deviation S X will reduce the average value of Y at 0.74 of the rms deviation S y.
1.4. Approximation error.
We estimate the quality of the regression equation using an absolute approximation error. The average approximation error is the average deviation of the estimated values \u200b\u200bfrom the actual:


Since the error is less than 15%, then this equation can be used as a regression.
Dispersion analysis.
The dispersion analysis task is to analyze the dispersion of the dependent variable:
Σ (y i - y cp) 2 \u003d σ (y (x) - y cp) 2 + σ (y - y (x)) 2
where
Σ (y i - y cp) 2 - the total amount of the squares of deviations;
Σ (y (x) - y cp) 2 - the sum of the squares of deviations due to regression ("explained" or "factor");
Σ (y - y (x)) 2 - residual sum of the squares of deviations.
Theoretical correlation relationship For a linear connection is equal to the correlation coefficient R xy.
For any form of the dependence of the tone of the connection is determined by multiple correlation coefficient:

This coefficient is universal, as it reflects the tightness of the communication and the accuracy of the model, and can also be used with any form of communication of variables. When constructing a single-factor correlation model, the multiple correlation coefficient is equal to the coefficient of pair correlation R xy.
1.6. The coefficient of determination.
The quadrant (multiple) correlation coefficient is called a determination coefficient, which shows the share of the variation of the productive feature explained by the variation of a factor of the sign.
Most often, giving the interpretation of the determination coefficient, it is expressed as a percentage.
R 2 \u003d -0.74 2 \u003d 0.5413
those. In 54.13% of cases of changes, it leads to a change in Y. In other words - the accuracy of the selection of the regression equation is average. The remaining 45.87% change y is explained by factors not accountable in the model.

Bibliography

  1. Econometrics: Tutorial / Ed. I.I. Eliseeva. - M.: Finance and statistics, 2001, p. 34..89.
  2. Magnus Ya.R., Katyshev P.K., Recipers A.A. Econometrics. Start rate. Tutorial. - 2nd ed., Act. - M.: Case, 1998, p. 17..42.
  3. Workshop on econometric: studies. Manual / I.I. Eliseeva, S.V. Kuryscheva, N.M. Gordenko et al.; Ed. I.I. Eliseeva. - M.: Finance and statistics, 2001, p. 5..48.
Share with friends or save for yourself:

Loading...