To come in
All computer secrets for beginners and professionals
  • For a novice user: differences between software products of the 1C:Enterprise program system
  • Program 1s 8.3 demo version. Mobile application "UNF" NEW
  • Setting up 1C management of our company from scratch
  • Warface free registration
  • Registration in the game World Of Tanks – what do you need to know?
  • Starcraft II Strategy and Tactics
  • Sco formula. Statistical parameters. Average linear and square deviation

    Sco formula.  Statistical parameters.  Average linear and square deviation

    An approximate method for assessing the variability of a variation series is to determine the limit and amplitude, but the values ​​of the variant within the series are not taken into account. The main generally accepted measure of the variability of a quantitative characteristic within a variation series is standard deviation (σ - sigma). The larger the standard deviation, the higher the degree of fluctuation of this series.

    The method for calculating the standard deviation includes the following steps:

    1. Find the arithmetic mean (M).

    2. Determine the deviations of individual options from the arithmetic mean (d=V-M). In medical statistics, deviations from the average are designated as d (deviate). The sum of all deviations is zero.

    3. Square each deviation d 2.

    4. Multiply the squares of the deviations by the corresponding frequencies d 2 *p.

    5. Find the sum of the products å(d 2 *p)

    6. Calculate the standard deviation using the formula:

    When n is greater than 30, or when n is less than or equal to 30, where n is the number of all options.

    Standard deviation value:

    1. The standard deviation characterizes the spread of the variant relative to the average value (i.e., the variability of the variation series). The larger the sigma, the higher the degree of diversity of this series.

    2. The standard deviation is used for a comparative assessment of the degree of correspondence of the arithmetic mean to the variation series for which it was calculated.

    Variations of mass phenomena obey the law of normal distribution. The curve representing this distribution looks like a smooth bell-shaped symmetrical curve (Gaussian curve). According to the theory of probability, in phenomena that obey the law of normal distribution, there is a strict mathematical relationship between the values ​​of the arithmetic mean and the standard deviation. The theoretical distribution of a variant in a homogeneous variation series obeys the three-sigma rule.

    If in a system of rectangular coordinates the values ​​of a quantitative characteristic (variants) are plotted on the abscissa axis, and the frequency of occurrence of a variant in a variation series is plotted on the ordinate axis, then variants with larger and smaller values ​​are evenly located on the sides of the arithmetic mean.



    It has been established that with a normal distribution of the trait:

    68.3% of the variant values ​​are within M±1s

    95.5% of the variant values ​​are within M±2s

    99.7% of the variant values ​​are within M±3s

    3. The standard deviation allows you to establish normal values ​​for clinical and biological parameters. In medicine, the interval M±1s is usually taken as the normal range for the phenomenon being studied. The deviation of the estimated value from the arithmetic mean by more than 1s indicates a deviation of the studied parameter from the norm.

    4. In medicine, the three-sigma rule is used in pediatrics for individual assessment of the level of physical development of children (sigma deviation method), for the development of standards for children's clothing

    5. The standard deviation is necessary to characterize the degree of diversity of the characteristic being studied and to calculate the error of the arithmetic mean.

    The value of the standard deviation is usually used to compare the variability of series of the same type. If two series with different characteristics are compared (height and weight, average duration of hospital treatment and hospital mortality, etc.), then a direct comparison of sigma sizes is impossible , because standard deviation is a named value expressed in absolute numbers. In these cases, use coefficient of variation (Cv), which is a relative value: the percentage ratio of the standard deviation to the arithmetic mean.

    The coefficient of variation is calculated using the formula:

    The higher the coefficient of variation , the greater the variability of this series. It is believed that a coefficient of variation of more than 30% indicates the qualitative heterogeneity of the population.

    In statistical testing of hypotheses, when measuring a linear relationship between random variables.

    Standard deviation:

    Standard deviation(estimate of the standard deviation of the random variable Floor, the walls around us and the ceiling, x relative to its mathematical expectation based on an unbiased estimate of its variance):

    where is the dispersion; - The floor, the walls around us and the ceiling, i th element of the selection; - sample size; - arithmetic mean of the sample:

    It should be noted that both estimates are biased. In the general case, it is impossible to construct an unbiased estimate. However, the estimate based on the unbiased variance estimate is consistent.

    Three sigma rule

    Three sigma rule() - almost all values ​​of a normally distributed random variable lie in the interval. More strictly - with no less than 99.7% confidence, the value of a normally distributed random variable lies in the specified interval (provided that the value is true and not obtained as a result of sample processing).

    If the true value is unknown, then we should use not, but the Floor, the walls around us and the ceiling, s. Thus, the rule of three sigma is transformed into the rule of three Floor, walls around us and the ceiling, s .

    Interpretation of the standard deviation value

    A large value of the standard deviation shows a large spread of values ​​in the presented set with the average value of the set; a small value, accordingly, shows that the values ​​in the set are grouped around the middle value.

    For example, we have three number sets: (0, 0, 14, 14), (0, 6, 8, 14) and (6, 6, 8, 8). All three sets have mean values ​​equal to 7, and standard deviations, respectively, equal to 7, 5 and 1. The last set has a small standard deviation, since the values ​​in the set are grouped around the mean value; the first set has the largest standard deviation value - the values ​​within the set diverge greatly from the average value.

    In a general sense, standard deviation can be considered a measure of uncertainty. For example, in physics, standard deviation is used to determine the error of a series of successive measurements of some quantity. This value is very important for determining the plausibility of the phenomenon under study in comparison with the value predicted by the theory: if the average value of the measurements differs greatly from the values ​​​​predicted by the theory (large standard deviation), then the obtained values ​​or the method of obtaining them should be rechecked.

    Practical use

    In practice, standard deviation allows you to determine how much the values ​​in a set may differ from the average value.

    Climate

    Suppose there are two cities with the same average maximum daily temperature, but one is located on the coast and the other is inland. It is known that cities located on the coast have many different maximum daytime temperatures that are lower than cities located inland. Therefore, the standard deviation of the maximum daily temperatures for a coastal city will be less than for the second city, despite the fact that the average value of this value is the same, which in practice means that the probability that the maximum air temperature on any given day of the year will be higher differ from the average value, higher for a city located inland.

    Sport

    Let's assume that there are several football teams that are rated on some set of parameters, for example, the number of goals scored and conceded, scoring chances, etc. It is most likely that the best team in this group will have better values ​​on more parameters. The smaller the team’s standard deviation for each of the presented parameters, the more predictable the team’s result is; such teams are balanced. On the other hand, a team with a large standard deviation is difficult to predict the result, which in turn is explained by an imbalance, for example, a strong defense but a weak attack.

    Using the standard deviation of team parameters makes it possible, to one degree or another, to predict the result of a match between two teams, assessing the strengths and weaknesses of the teams, and therefore the chosen methods of fighting.

    Technical analysis

    see also

    Literature

    * Borovikov, V. STATISTICS. The art of data analysis on a computer: For professionals / V. Borovikov. - St. Petersburg. : Peter, 2003. - 688 p. - ISBN 5-272-00078-1.

    Dispersion is the arithmetic mean of the squared deviations of each attribute value from the overall average. Depending on the source data, the variance can be unweighted (simple) or weighted.

    The variance is calculated using the following formulas:

    · for ungrouped data

    · for grouped data

    The procedure for calculating the weighted variance:

    1. determine the arithmetic weighted average

    2. deviations of the variant from the average are determined

    3. square the deviation of each option from the average

    4. multiply the squares of deviations by weights (frequencies)

    5. summarize the resulting products

    6. the resulting amount is divided by the sum of the scales

    The formula for determining variance can be converted into the following formula:

    Simple

    The procedure for calculating variance is simple:

    1. determine the arithmetic mean

    2. square the arithmetic mean

    3. square each option in the row

    4. find the sum of squares option

    5. divide the sum of squares by their number, i.e. determine the mean square

    6. determine the difference between the mean square of the characteristic and the square of the mean

    Also, the formula for determining the weighted variance can be converted into the following formula:

    those. the dispersion is equal to the difference between the average of the squared values ​​of the attribute and the square of the arithmetic mean. When using the transformed formula, the additional procedure for calculating deviations of individual values ​​of a characteristic from x is eliminated and the error in the calculation associated with rounding of deviations is eliminated

    Dispersion has a number of properties, some of which make it easier to calculate:

    1) the variance of a constant value is zero;

    2) if all variants of attribute values ​​are reduced by the same number, then the variance will not decrease;

    3) if all variants of attribute values ​​are reduced by the same number of times (fold), then the variance will decrease by a factor

    Standard deviation S- represents the square root of the variance:

    · for ungrouped data:

    · for the variation series:

    The range of variation, linear mean and standard deviation are named quantities. They have the same units of measurement as the individual characteristic values.

    Variance and standard deviation are the most widely used measures of variation. This is explained by the fact that they are included in most theorems of probability theory, which serves as the foundation of mathematical statistics. In addition, the variance can be decomposed into its component elements, allowing one to evaluate the influence of various factors that determine the variation of a trait.

    The calculation of variation indicators for banks grouped by profit margin is shown in the table.

    Profit amount, million rubles. Number of banks calculated indicators
    3,7 - 4,6 (-) 4,15 8,30 -1,935 3,870 7,489
    4,6 - 5,5 5,05 20,20 - 1,035 4,140 4,285
    5,5 - 6,4 5,95 35,70 - 0,135 0,810 0,109
    6,4 - 7,3 6,85 34,25 +0,765 3,825 2,926
    7,3 - 8,2 7,75 23,25 +1,665 4,995 8,317
    Total: 121,70 17,640 23,126

    The average linear and standard deviation show how much the value of a characteristic fluctuates on average among units and the population under study. So, in this case, the average fluctuation in profit is: according to the average linear deviation, 0.882 million rubles; by standard deviation - 1.075 million rubles. The standard deviation is always greater than the mean linear deviation. If the distribution of the characteristic is close to normal, then there is a relationship between S and d: S=1.25d, or d=0.8S. The standard deviation shows how the bulk of the population units are located relative to the arithmetic mean. Regardless of the shape of the distribution, 75 values ​​of the attribute fall into the interval x 2S, and at least 89 of all values ​​fall into the interval x 3S (P.L. Chebyshev’s theorem).

    average value- this is a general indicator of a statistical population that eliminates individual differences in the values ​​of statistical quantities, allowing you to compare different populations with each other.

    Exists 2 classes average values: and .

    Structural averages include fashion And median, but most often used power averages various types.

    Power averages

    Power averages can be simple And weighted.

    Simple average calculated if there are two or more ungrouped statistical quantities arranged in random order according to the following general formula:

    Weighted average calculated by grouped statistical values ​​using the following general formula:

    Where X are the values ​​of individual statistical values ​​or the middle of grouping intervals;
    m is an exponent, the value of which determines the following types of power averages:
    at m = -1 ;
    at m = 0;
    when m = 1;
    at m = 2;
    at m = 3.

    Using general formulas for simple and weighted averages for different exponents m, we obtain particular formulas of each type, which will be discussed in detail below.

    Arithmetic mean

    Arithmetic mean- this is the most commonly used average value, which is obtained by substituting m=1 into the general formula. Arithmetic mean simple has the following form:

    Where X are the values ​​of the quantities for which the average value must be calculated; N is the total number of X values ​​(the number of units in the population being studied).

    For example, a student passed 4 exams and received the following grades: 3, 4, 4 and 5. Let's calculate the average score using the simple arithmetic average formula: (3+4+4+5)/4 = 16/4 = 4.

    Arithmetic mean weighted has the following form:

    Where f is the number of quantities with the same value X (frequency).

    For example, a student passed 4 exams and received the following grades: 3, 4, 4 and 5. Let's calculate the average score using the weighted arithmetic average formula: (3*1 + 4*2 + 5*1)/4 = 16/4 = 4.

    If the X values ​​are specified as intervals, then the midpoints of the X intervals are used for calculations, which are defined as the half-sum of the upper and lower boundaries of the interval. And if the interval X does not have a lower or upper boundary (open interval), then to find it, use the range (the difference between the upper and lower boundary) of the adjacent interval X.

    For example, an enterprise has 10 employees with up to 3 years of experience, 20 with 3 to 5 years of experience, 5 employees with more than 5 years of experience. Then we calculate the average length of service of employees using the weighted arithmetic average formula, taking as X the midpoint of the length of service intervals (2, 4 and 6 years):
    (2*10+4*20+6*5)/(10+20+5) = 3.71 years.

    The arithmetic average is used most often, but there are times when it is necessary to use other types of averages. Let's consider such cases further.

    Harmonic mean

    Harmonic mean is used when the source data does not contain frequencies f for individual X values, but is presented as their product Xf. Having designated Xf=w, we express f=w/X, and, substituting these notations into the formula for the arithmetic weighted average, we obtain the formula for the harmonic weighted average:

    Thus, the weighted harmonic average is used when the frequencies f are unknown and w=Xf is known. In cases where all w = 1, that is, individual values ​​of X occur once, the average harmonic prime formula is applied:

    For example, a car was traveling from point A to point B at a speed of 90 km/h, and back at a speed of 110 km/h. To determine the average speed, we apply the formula for the average harmonic simple, since in the example the distance w 1 =w 2 is given (the distance from point A to point B is the same as from B to A), which is equal to the product of speed (X) and time ( f). Average speed = (1+1)/(1/90+1/110) = 99 km/h.

    Geometric mean

    Geometric mean used in determining average relative changes, as discussed in the topic Dynamic series. The geometric average gives the most accurate averaging result if the task is to find a value of X that would be equidistant from both the maximum and minimum values ​​of X.

    For example, between 2005 and 2008 inflation index in Russia was: in 2005 - 1.109; in 2006 - 1,090; in 2007 - 1,119; in 2008 - 1,133. Since the inflation index is a relative change (dynamic index), the average value must be calculated using the geometric mean: (1.109*1.090*1.119*1.133)^(1/4) = 1.1126, that is, for the period from 2005 to 2008 annually prices grew by an average of 11.26%. An erroneous calculation using the arithmetic mean would give an incorrect result of 11.28%.

    Mean square

    Mean square used in cases where the initial values ​​of X can be both positive and negative, for example, when calculating average deviations.

    The main application of the quadratic average is to measure the variation of X values, which will be discussed.

    Average cubic

    Average cubic is used extremely rarely, for example, when calculating poverty indices for developing countries (TIN-1) and for developed ones (TIN-2), proposed and calculated by the UN.

    Structural averages

    To the most frequently used structural average include and .

    Statistical mode

    Statistical mode is the most frequently repeated value of X in a statistical population.

    If X is given discretely, then the mode is determined without calculation as the value of the feature with the highest frequency. In a statistical population there are 2 or more modes, then it is considered bimodal(if there are two modes) or multimodal(if there are more than two modes), and this indicates the heterogeneity of the population.

    For example, the company employs 16 people: 4 of them have 1 year of experience, 3 people have 2 years of experience, 5 have 3 years of experience, and 4 people have 4 years of experience. Thus, modal experience Mo = 3 years, since the frequency of this value is maximum (f = 5).

    If X is given at equal intervals, then the modal interval is first defined as the interval with the highest frequency f. Within this interval, the conditional value of the mode is found using the formula:

    Where Mo is fashion;
    X NMo – lower limit of the modal interval;
    h Mo is the range of the modal interval (the difference between its upper and lower boundaries);
    f Mo – frequency of the modal interval;
    f Mo-1 – frequency of the interval preceding the modal one;
    f Mo+1 – frequency of the interval following the modal one.

    For example, an enterprise has 10 employees with up to 3 years of experience, 20 with 3 to 5 years of experience, 5 employees with more than 5 years of experience. Let's calculate the modal work experience in the modal interval from 3 to 5 years: Mo = 3 + 2*(20-10)/(2*20-10-5) = 3.8 (years).

    If the range of intervals h is different, then instead of frequencies f it is necessary to use interval densities, calculated by dividing the frequencies f by the range of the interval h.

    Statistical median

    Statistical median– this is the value of the quantity X, which divides a statistical population ordered in ascending or descending order into 2 equal parts. As a result, one half has a value greater than the median, and the other half has a value less than the median.

    If X is given discretely, then to determine the median, all values ​​are numbered from 0 to N in ascending order, then the median for an even number N will lie in the middle between X with numbers 0.5N and (0.5N+1), and for an odd number N it will correspond to the value of X with number 0.5(N+1).

    For example, there is data on the age of part-time students in a group of 10 people - X: 18, 19, 19, 20, 21, 23, 23, 25, 28, 30 years. These data are already ordered in ascending order, and their number N=10 is even, so the median will be between X with numbers 0.5*10=5 and (0.5*10+1)=6, which correspond to the values ​​X 5 = 21 and X 6 =23, then the median: Me = (21+23)/2 = 22 (years).

    If X is given in the form equal intervals, then first the median interval is determined (the interval in which one half of the frequencies f ends and the other half begins), in which the conditional value of the median is found using the formula:

    Where Me is the median;
    X НМе – lower limit of the median interval;
    h Ме – the range of the median interval (the difference between its upper and lower boundaries);
    f Ме – frequency of the median interval;
    f Ме-1 – sum of frequencies of intervals preceding the median.

    In the previously discussed example, when calculating modal length of service (the enterprise has 10 employees with up to 3 years of experience, 20 with 3 to 5 years of experience, 5 employees with more than 5 years of experience), we calculate the median length of service. Half of the total number of workers is (10+20+5)/2 = 17.5 and is in the interval from 3 to 5 years, and in the first interval up to 3 years there are only 10 workers, and in the first two - (10+20) =30, which is more than 17.5, means the interval from 3 to 5 years is the median. Inside it, we determine the conditional value of the median: Me = 3+2*(0.5*30-10)/20 = 3.5 (years).

    Just as in the case of mode, when determining the median, if the range of intervals h is different, then instead of frequencies f it is necessary to use interval densities, calculated by dividing the frequencies f by the range of the interval h.

    Variation indicators

    Variation is the difference in the values ​​of X values ​​for individual units of the statistical population. To study the strength of variation, the following are calculated indicators of variation: , , , , .

    Range of variation

    Range of variation is the difference between the maximum and minimum values ​​of X available in the statistical population under study:

    The disadvantage of H is that it only shows the maximum difference in X values ​​and cannot measure the strength of variation in the entire population.

    Average linear deviation

    Average linear deviation is the average modulus of deviations of X values ​​from the arithmetic mean. It can be calculated using the arithmetic mean formula simple- we get :

    For example, a student passed 4 exams and received the following grades: 3, 4, 4 and 5. = 4. Let's calculate the simple average linear deviation: L = (|3-4|+|4-4|+|4-4|+| 5-4|)/4 = 0.5.

    If the source data X are grouped (there are frequencies f), then the average linear deviation is calculated using the arithmetic mean formula weighted- we get :

    Let's return to the example of a student who passed 4 exams and received the following grades: 3, 4, 4 and 5. = 4 and = 0.5. Let's calculate the weighted average linear deviation: L = (|3-4|*1+|4-4|*2+|5-4|*1)/4 = 0.5.

    Linear coefficient of variation

    Linear coefficient of variation is the ratio of the average linear deviation to the arithmetic mean:

    Using the linear coefficient of variation, you can compare the variation of different populations because, unlike the average linear deviation, its value does not depend on the units of measurement X.

    In the example under consideration about a student who passed 4 exams and received the following grades: 3, 4, 4 and 5, the linear coefficient of variation will be 0.5/4 = 0.125 or 12.5%.

    Dispersion

    Dispersion is the average square of the deviations of the X values ​​from the arithmetic mean. Dispersion can be calculated using the arithmetic mean formula simple- we get simple variance:

    In the example already familiar to us about a student who passed 4 exams and received grades: 3, 4, 4 and 5, = 4. Then the variance is simple D = ((3-4) 2 +(4-4) 2 +(4- 4) 2 +(5-4) 2)/4 = 0.5.

    If the original data X are grouped (there are frequencies f), then the variance is calculated using the arithmetic mean formula weighted- we get variance weighted:

    In the example under consideration about a student who passed 4 exams and received the following grades: 3, 4, 4 and 5, we calculate the weighted variance: D = ((3-4) 2 *1+(4-4) 2 *2+(5 -4) 2 *1)/4 = 0.5.

    If you transform the variance formula (open the parentheses in the numerator, divide term by term by the denominator and give similar ones), then you can get another formula for calculating it as the difference between the mean squares and the squared mean:

    It's even easier to find standard deviation, if the variance is pre-calculated as the square root of it:

    In the example about the student, in which above, we find the standard deviation as the square root of it: .

    Quadratic coefficient of variation

    Quadratic coefficient of variation is the most popular relative measure of variation:

    Criterion value The quadratic coefficient of variation V is 0.333 or 33.3%, that is, if V is less than or equal to 0.333, the variation is considered weak, and if it is greater than 0.333, it is considered strong. In case of strong variation, the studied statistical population is considered heterogeneous, and the average value is atypical and it cannot be used as a general indicator of this population.

    In the example about a student, in which above , we find the quadratic coefficient of variation V = 0.707/4 = 0.177, which is less than the criterion value of 0.333, which means the variation is weak and equal to 17.7%.

    The square root of the variance is called the standard deviation from the mean, which is calculated as follows:

    An elementary algebraic transformation of the standard deviation formula leads it to the following form:

    This formula often turns out to be more convenient in calculation practice.

    The standard deviation, just like the average linear deviation, shows how much on average specific values ​​of a characteristic deviate from their average value. The standard deviation is always greater than the mean linear deviation. There is the following relationship between them:

    Knowing this ratio, you can use the known indicators to determine the unknown, for example, but (I calculate a and vice versa. The standard deviation measures the absolute size of the variability of a characteristic and is expressed in the same units of measurement as the values ​​of the characteristic (rubles, tons, years, etc.). It is an absolute measure of variation.

    For alternative signs, for example, the presence or absence of higher education, insurance, the formulas for dispersion and standard deviation are as follows:

    Let us show the calculation of the standard deviation according to the data of a discrete series characterizing the distribution of students in one of the university faculties by age (Table 6.2).

    Table 6.2.

    The results of auxiliary calculations are given in columns 2-5 of table. 6.2.

    The average age of a student, years, is determined by the weighted arithmetic mean formula (column 2):

    The squared deviations of the student's individual age from the average are contained in columns 3-4, and the products of the squared deviations and the corresponding frequencies are contained in column 5.

    We find the variance of students’ age, years, using formula (6.2):

    Then o = l/3.43 1.85 *oda, i.e. Each specific value of a student’s age deviates from the average by 1.85 years.

    The coefficient of variation

    In its absolute value, the standard deviation depends not only on the degree of variation of the characteristic, but also on the absolute levels of options and the average. Therefore, it is impossible to directly compare the standard deviations of variation series with different average levels. To be able to make such a comparison, you need to find the share of the average deviation (linear or quadratic) in the arithmetic average, expressed as a percentage, i.e. calculate relative measures of variation.

    Linear coefficient of variation calculated by the formula

    The coefficient of variation determined by the following formula:

    In coefficients of variation, not only the incomparability associated with different units of measurement of the characteristic being studied is eliminated, but also the incomparability that arises due to differences in the value of arithmetic means. In addition, the indicators of variation characterize the homogeneity of the population. The population is considered homogeneous if the coefficient of variation does not exceed 33%.

    According to the table. 6.2 and the calculation results obtained above, we determine the coefficient of variation, %, according to formula (6.3):

    If the coefficient of variation exceeds 33%, then this indicates the heterogeneity of the population being studied. The value obtained in our case indicates that the population of students by age is homogeneous in composition. Thus, an important function of generalizing indicators of variation is to assess the reliability of averages. The less c1, a2 and V, the more homogeneous the resulting set of phenomena and the more reliable the resulting average. According to the “three sigma rule” considered by mathematical statistics, in normally distributed or close to them series, deviations from the arithmetic mean not exceeding ±3st occur in 997 cases out of 1000. Thus, knowing X and a, you can get a general initial idea of ​​the variation series. If, for example, the average salary of an employee in a company is 25,000 rubles, and a is equal to 100 rubles, then with a probability close to certainty, we can say that the wages of the company’s employees fluctuate within the range (25,000 ± ± 3 x 100 ) i.e. from 24,700 to 25,300 rubles.