5 Measures in statistics
- The measure of central tendency
- Measure of dispersion
- Gaussian Distribution
- Z-score
- Standard Normal Distribution
Central Tendency– Refers to the measure used to determine the center of distribution of data. To measure there are 3 terms Mean, Median, and Mode.
- Mean is the average of all the data. That is the sum of data by the number of data
- Median in statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution.
- Mode is the most frequent number—that is, the number that occurs the highest number of times.
The measure of Dispersion refers to how data is been scattered around the central tendency. In order to measure dispersion. We calculate two quantities: variance and standard deviation.
- Standard Deviation in statistics is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.
- Variance is the expectation of the squared deviation from the mean of a random variable. The standard deviation is obtained as the square root of the variance.
Outliers- In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement, an indication of novel data, or maybe the result of experimental error; the latter are sometimes excluded from the data set.
For example, there are 10 numbers
{1,1,2,2,3,3,4,5,5,6}
Mean=Sum of observations/Number of observations=3.2
Now suppose we add a number that is very large to the given set of numbers let’s say 100
{1,1,2,2,3,3,4,4,5,6,100}
Now Mean comes out to be 12
Previous Mean =3.2
Mean due to the presence of an outlier=12
As we can see due to presence of an outlier, mean value is signicantly changed. So in order to make correct calcualtions on data, outliers should be removed as far as possible. However the middle value or median won’t have any effect due to presence of an outlier.
Percentile It is a value below which a certain percentage of observation lie.
for example->
Dataset:2,2,3,4,5,5,5,6,7,8,8,8,8,9,9,10,11,11,12
What is the percentile of 10?
Percentile rank of 10 =16/20 * 100=80%ile
In order to remove outliers there is Five number summary
- Minimimum
- First Quartile(Q1)
- Median
- Third Quartile(Q3)
- Maximum
Minimum and Maximum values can define the range of data set.
While, The lower quartile, or first quartile (Q1), is the value under which 25% of data points are found when they are arranged in increasing order. The upper quartile, or third quartile (Q3), is the value under which 75% of data points are found when arranged in increasing order.
In order to remove outlier’s we follow following steps
- IQR(Interquartile Range)=Q3-Q1
- Lower Fence=Q1-1.5(IQR)
- Upper Fence=Q3+1.5(IQR)
So any value which above and below Lower and Upper Fence is an outlier. Which could be removed