This is a tutorial about the introduction to Business Analytics offered by Simplilearn. The tutorial is part of the Data Science with R Language Certification Training course.
After completing the Statistical Concepts And Their Application In Business Tutorial, you will be able to understand:
Statistical Methods overview
Population and Samples
How to develop a sampling plan and sampling methods
What Descriptive Statistics is and its components
The business usage of Descriptive Statistics via a Case Study
Probability theory and distributions
Confidence Interval
The concept of tests of significance
One-sided and two-sided hypothesis testing
The various tests of significance
Nonparametric testing
We will start the lesson with an introduction to statistical methods.
Statistics is a branch of applied or business mathematics where we collect, organize, analyze and interpret numerical facts.Statistical methods are the concepts, models, and formulas of mathematics used in the statistical analysis of data.
They can be subdivided into two main categories - Descriptive Statistics and Inferential Statistics.
Descriptive statistics further consists of measure of central tendency and measure of dispersion and inferential statistics consists of estimation and hypothesis testing. Descriptive statistics methods involve summarizing or describing the sample of data in various forms to get an overall gist of the data.
Most often the results are shown in qualitative form as the name suggest.
In contrast, inferential statistics try to make assumptions about the population of the data, given the sample; or in predicting various outcomes.
In the next section, we will talk about population and samples.
Interested in taking up Data Science Certification with r Programming? Check out our Course.
A population is an entire collection of objects or observations from which we may collect data. It is the entire group we are interested in, which we wish to describe or draw conclusions about.
For each population, there are many possible samples.It is important that the investigator carefully and completely defines the population before collecting the sample, including a description of the members to be included.
A sample is a group of units selected from a larger group (the population). By studying the sample, it is hoped to draw valid conclusions about the larger group. A sample is generally selected for study because the population is too large to study in its entirety. The sample should be representative of the general population. This is often best achieved by random sampling.
In the next section, we will look at how to develop a sample plan. Let’s look at the steps in developing a sampling plan.
Let us now look at the steps involved in developing a sampling plan:
We define the target population regarding the number of elements, sampling unit, extent and time. Sampling element is the object from which the information is desired. Then, depending on the objective, financial results, time limit, and nature of the problem we have two types of sampling – probability sampling and non-probability sampling.
Probability and non-probability sampling is explained below as-
In probability sampling, every element has an equal probability of being chosen in the sample, but, in non-probability sampling, every element doesn’t have an equal probability of being chosen.
Next, we obtain the sampling frame.
It is a list of elements from which we select the units to be sampled. The frame must be easy to administer and it must take into account all the potential factors that could bias the sample. After this, we determine a sample size.
The sample size is decided based on the level of accuracy that one is looking for. It also depends on other factors like the amount of variability in the population and confidence level required for the estimation of population value. After this step, we have the data collection method.
Data collection can be done by various methods of observation, experimentation, and surveying. Next, we develop an operational procedure that has to be followed during the framework. This would specify whether it is probability or non-probability procedure and also which type of sampling technique has to be chosen. After this step, we have the operational plan.
Finally, we execute the operational plan and verify the procedure.
In the next section, we will look at sampling techniques.
Sample, as we have just learned, is a subgroup of the population from which information is collected. In this section, we will look at the classification of sampling methods with a short description. Sampling is broadly classified into probability and non-probability sampling.
Let’s look at different probability sampling techniques.
It’s the purest form of sampling technique where every element has an equal opportunity for participation. All the elements are selected independent of each other.
It involves the selection of elements from an ordered sampling frame.
It is the process of dividing the members of the population into homogeneous subgroups or strata before sampling.
This technique is used when natural but relatively homogeneous groupings are evident in a statistical population.
Non probability sampling can be further classified as-
In convenience sampling, attempts are made to obtain a sample that is convenient to collect, like interviewing people in a railway station.
In judgemental sampling, the population element is selected with a specific attribute based on the judgment of the research. If drawing inferences is not that necessary these samples are quite useful.
Quota sampling may be viewed as two stages of judgmental sampling-
The first stage is to develop control categories or quotas of population element so that different groups are represented in the total sample.
In the second stage, the sample elements are selected based on convenience or judgment to fill within the quota.
In this, the initial group of respondents is selected, usually at random or from contacts of the existing customers. They are used in studies that involve customers that are hard to find.
In the next section, we will look at descriptive statistics.
Descriptive statistics is the method where the data is analyzed to extract meaningful information from the data, like deducing patterns from the data.
The descriptive methods show the important characteristics or summarize the data. However, these methods do not help us in making any conclusions from the data – it simply describes the data.
Let’s look at an example of descriptive methods.
The table given shows the number of students belonging to a particular score range as mentioned in the first column. Histograms are a graphical way of describing data.
From the plotted histogram, it is easier to deduce that the maximum number of students fall in the 50-60 mark category.
In the next section, we will discuss the measures of central tendency.
Measure of central tendency is a method of descriptive statistics which identifies with a single value. The single value is generally the central position of the distribution, and hence they are also known as measures of central location. This falls into the category of summary statistics.
There are three measures of central tendency –
Mean
Median and
Mode
With mean being the most commonly used statistic.
All three are different but valid measures of central tendency, and different measures are well suited for different sets of data. For example, for a data with outliers, the median is a better measure when compared to the mean.
Going forward, we will learn about mean, mode, median and how to calculate them.
Mean is the sum of all numbers in a given data set divided by the total number of numbers. It can also be interpreted as a calculated central value of a set of numbers. The chart shows us how a mean is calculated.
The four cubes placed on different numbers represent numbers of a given data set. In our case, we have four numbers in our set. (2,2,6,10). The mean comes out to be twenty divided by four. Hence five is the mean here.
In the next section, we will calculate the median value.
Median is the central most number in a sorted data set. Total numbers of values above and below the median are the same. The chart shows us how median is calculated. We have a dataset which has 6 numbers (3,3,4,5,7,8).
The median is calculated by arranging the data set numbers in ascending or descending order. Here the value of the median is (4+5) divided by 2 which equal 4.5. The blocks in blue are below the median and blocks in white are above the median.
In the next section, we will discuss the mode of a distribution.
For Mode, we calculate all the numbers with their frequency of occurrence in a data set. The number with the highest frequency is our mode. The chart here shows that the number 2 has occurred thirty five times in our dataset. Hence the mode for this data set is 2.
In the next section, we will look at measures of dispersion.
Measures of dispersion describe the amount of heterogeneity or variation within a given distribution. They are spread or dispersed around some central value, mostly the mean. The most commonly used methods of dispersion are variance and standard deviation. We will discuss them in detail in the next section.
Variance is the average of squared deviations about the mean. Mathematically, it is the sum of the difference between the mean and every number in the distribution divided by the number of samples.
In the given formula, x is every sample, x? is the mean, and n is the number of samples.
Standard deviation is the square root of variance. It is the most commonly used statistic and is a measure of how “spread” the distribution is. Let’s look at an example – the data consists of the numbers 2,5,5,4,6,8. So here the number of data, n = 6. Mean is the sum of all the numbers divided by n, which equals (2+5+5+4+6+8)/6 = 5.
Now, variance is calculated as the sum of squared differences between each number in the data and the mean, divided by n, equaling 3.8. Standard deviation is calculated as the square root of variance which equals 1.94.
In the next section, we will look at these concepts with the help of a business case study.
Let’s look at the case study to illustrate the concepts of descriptive statistics and how to make inferences based on the statistics.
The business case is given as follows -
A telecommunications company maintains a customer database that includes, among other things, information on how much each customer spent on long distance, toll-free, equipment rental, calling card, and wireless services in the previous month.
The telecom company surveyed a sample of 1000 of its customers on all the above services.
In the next section, we will use descriptive analysis to study customer spending to determine which services are most profitable.
We will be using descriptive analysis to study customer spending to determine which services are most profitable and here we have a table of the number of customers, the minimum, and maximum value, mean and standard deviation of various telecom services. Valid N refers to the number of non-empty records since everyone might not use all services.
From the given table, we can make out the following insights –
From the third row, we can see that on an average, customers spend the most on equipment rental, but there is a lot of variation in the amount spent.
Looking at the fourth row, we can infer that customers with calling card service spend only slightly less, on average than equipment rental customers, and there is much less variation in the values.
As seen from the valid N column, the problem here is that most customers don't have all services, so a lot of 0's are being counted. One solution to this problem is to treat 0's as missing values so that the analysis for each service becomes conditional on having that service.
Next, we will look at probability theory.
Probability, as the name suggests, is the likelihood of a particular event happening.
It’s a branch of mathematics that deals with the uncertainty of an event happening in the future.
Probability value always occurs within a range of 0 to 1.
Probability of an event to happen is given by P(E) = Number of favorable occurrences divided by the number of possible occurrences.
Let’s take a simple example of tossing an unbiased coin:
If an unbiased coin is flipped then what is the probability that its resultant is head? As we all know a coin will have two faces one is head other is tail. As a result, we will have two possible outcomes.
Probability of head is equal to probability of the tail is equal to one by two that is 0.5 or 50%. So we can say there is a 50% chance for the resultant of the toss to be a head.
In the next section, we will see how to assign probabilities.
Probability of an event can be determined based on the kind of data they are. One of the types is classical probability wherein the condition is that the data have an equal probability of happening.
Let’s take the example of rolling a dice -
So there is an equal probability for the outcome to be 1,2,3,4,5,6. Another type is relative frequency method wherein the probability is determined based on the previously occurred data.
For example, let’s see this example of a car agency guy who has 5 cars whose use in last 60 days is given in the table.
According to the past record,
There were no cars used on for 3 days,
There was 1 car used for 10 days,
There were 2 cars used for 16 days,
3 cars for 15 days,
4 cars for 9 days and
5 cars for 7 days.
Now you determine the relative frequency of each occurrence by dividing each of the cases by the total of 60 days.
The last type is a subjective method where it is intuitively decided based on judgment – like in this example there is a 75% chance that England will adopt to Euro currency by 2020.
In the next section, we will look at probability distributions.
Now, let’s talk about probability distribution. Probability distribution for a random variable gives information about how the probabilities are distributed over the values of that random variable. Its defined by f(x) which gives the probability of each value.
Now let’s take an example to understand the topic better.
Suppose we have sales data for AC sale in last 300 days. We see the data given in the table. It has the number of units of AC sold and the number of days.
So we find the relative frequency first and then plot the probability against the number of units sold. Thus, the plot on the right shows how the variables are distributed.
Tests of Significance, generally known as hypothesis testing, is used in evaluating the evidence in favor of or against a given value of a population. We will discuss what refers to this hypothesis testing in the following few sections.
All tests of significance begin with a null hypothesis, H0. The tests validate the null hypothesis, or reject it in favor of an Alternate Hypothesis, Ha. The alternative hypothesis is generally the hypothesis that we are trying to prove.
The tests can be broadly classified into one sided and two sided tests, and the results are decided by calculating the “p-value”.
P-value can be defined as the probability that the calculated test statistic can take an extreme value as the observed value, given that the null hypothesis is true. This might sound complicated, but we will discuss this in simpler terms in the next sections.
The interpretation of the tests of significance takes place as follows –
If the calculated p-value is less than a significance level alpha, the null hypothesis is rejected in favor of the alternative hypothesis.
It is very important to understand here that the alternative hypothesis is not validated, the test declares that the alternative hypothesis MAY be true, rejecting the null hypothesis.
The general values of alpha are 0.05 and 0.01.
An alpha value of 0.05 specifies that this conclusion might be wrong 5% of the time.
There are general assumptions for the tests of significance. The two important assumptions are that –
The distribution is almost normal
The samples in the distribution have almost unequal variances.
In the next section, we will discuss one-sided tests in detail.
One-sided hypothesis testing proposes an alternative hypothesis that the sample statistic is greater OR lesser than the given statistic; but not both. Simply put, one-sided testing is used where the direction that the given value might take is known. In statistical terms, 0 is the null value or the given value.
Null hypothesis states that there is no difference; or = 0. The alternative hypothesis states that there is a difference – either positive or negative, but not both.
Let’s understand this with an example:
Suppose we are given a sample dataset of heights of 100 males in New York, and their average height has been given as 5 feet 9 inches – which is the null value.
We would like to know if, in the recent years, there has been an increase in the average height of males in New York – this translates into the alternative hypothesis.
Expressing this statement in statistical form, we have –
Null Value: 0 = 5 feet 9 inches
Null Hypothesis: = 5.9
Alternative Hypothesis: > 5.9
There are a variety of hypothesis tests, which can be used to calculate the p-value using a test statistic. Once the p-value is calculated, we can validate or reject the null hypothesis, based on the confidence level – which is generally 0.05.
In the next section, we will discuss two-sided hypothesis tests.
Two-sided hypothesis testing proposes an alternative hypothesis that the sample statistic is different compared to the given statistic; it does not matter if it is greater or lesser.
As seen previously, one-sided testing is used where the direction of the given value might be known; and two-sided testing is used where the direction is known.
Similar to the previous section, we have 0 - the null value, or the given value.
Null hypothesis states that there is no difference, or = 0.
The alternative hypothesis states that there is a difference – irrespective of the direction. Let’s understand this with the same example as earlier-
So, given a sample dataset of heights of 100 males in New York, and their average height of 5 feet 9 inches, the two-sided tests helps in validating if, in the recent years, there has been a difference in the average height of males in New York – either increase or decrease.
Expressing this statement in statistical form, we have –
Null Value: 0 = 5 feet 9 inches
Null Hypothesis: = 5.9
Alternative Hypothesis: > 5.9
As mentioned earlier, the p-value is calculated using a suitable test, and the null hypothesis is validated or rejected.
In the next section, we will look at various tests of significance.
This section lists out the various tests of significance that are used in calculating the p-value. They are broadly classified into z-test, t-test, and f-test based on the test statistic.
The z-test is used to compare the mean with a given standard (that is, one sample z test) or compare means of two groups (the two sample z test), whether the standard deviation is known or not. This test is generally used where the number of samples is greater than 30.
The t-test is used with mean as well, but the standard deviation must be known. This test is preferred where the number of samples is less than 30. As mentioned earlier, the t-test can be one sample, two sample or paired t-tests.
Two sample t-test is used when the compared groups are independent, and paired t-test is used when the compared groups are paired (example – the marks obtained by a student in the same subject before and after training).
The f-test is used to compare the variances of two or more groups. The most used f-test is the Analysis of variance or ANOVA, and a lesser used test is the regression analysis. In all the above tests, the null hypothesis states that there is no difference between means or variances, and the alternative hypothesis suggests otherwise. You are encouraged to explore the different tests in detail.
Next, we will talk about nonparametric testing.
Curious to know how R programming works? Check out the Data Science with R Programming Course Preview.
In the earlier sections, we learned about parametric testing like one sample, two sample and paired t-tests. They all had some or the other assumptions such as -
The individual samples were approximately normal,
Samples came from populations with approximately unequal variance,
The sample size was greater than 30, etc.
We might come across data which doesn’t satisfy the above assumptions. If the samples are still independent with approximately equal variance we use nonparametric tests in lieu of t-tests.
These are also referred to as “distribution-free”, as they don’t involve making assumptions of any data. They have lower power than the parametric tests and hence are always given the second preference after the parametric tests.
These tests are typically focused on median rather than mean. They involve straight-forward procedures like counting and ordering. Actually, there are at least one non-parametric test done for each parametric test and are classified into the following categories.
We will now see the parametric test and the nonparametric alternative for each category:
Usually, when we have two samples that we want to compare concerning their mean value for some variable of interest, we would use the t-test for independent samples or analysis of variance for multiple groups.
The nonparametric alternative for this test is Mann-Whitney U test.
If we want to compare two variables measured in the same sample we would customarily use the t-test for dependent samples.
A nonparametric alternative to this test is the Sign test.
To express a relationship between two variables one usually computes the correlation coefficient.
Nonparametric equivalents to the standard correlation coefficient are Spearman R, Kendall Tau, and coefficient Gamma.
Appropriate nonparametric statistics for testing the relationship between the two variables are the Chi-square test, the Phi coefficient, and the Fisher exact test.
In addition, a simultaneous test for relationships between multiple cases is available: Kendall coefficient of concordance.
This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli.
In the next section, we will look at a tabulation of the various tests.
Here you can see the tabulation of the nonparametric and parametric test for better understanding and comparison.
For One Qualitative Response Variable, one sample test is used for parametric and sign test is used for nonparametric.
For the test of One Quantitative Response Variable – Two Values from Paired Samples, paired sample t-test is used for parametric and Wilcoxon Signed Rank test is used for nonparametric.
For the test of One Quantitative Response Variable – One Qualitative Independent Variable with two groups two independent sample t-test is used in parametric and nonparametric uses Wilcoxon Rank Sum or Mann Whitney U Test.
And for One Quantitative Response Variable – One Qualitative Independent Variable with three or more groups uses ANOVA for parametric and Kruskal Wallis for Nonparametric.
To quickly summarize what we have learned in this statistical concepts and their application in business tutorial, we have discussed:
Overview of statistical methods
Descriptive statistics – Measures of Central Tendency and Measures of Dispersion
A business case study to understand the concepts of descriptive statistics
Probability distribution
What are tests of significance
The process flow of hypothesis testing
One-sided and two-sided hypothesis tests
Various tests used in calculating the p-value
What is nonparametric testing and why is it used
Nonparametric alternatives for the usual tests of significance
With this, we come to an end about the Statistical Concepts and Their Application In Business Tutorial. In the next chapter, we will discuss the Basic Analytic Techniques - Using R Tutorial.
A Simplilearn representative will get back to you in one business day.