Fundamentals of Statistics in Data Science
Evan Esar aptly stated, “Statistics is the only science that enables different experts using the same figures to draw different conclusions.” Undoubtedly, statistics provides a lot of scope for experimentation and learning.
Data science is a dynamic field and a recent love for all those who enjoy playing with numbers, algorithms and technology. Thus, it becomes crucial for an aspiring data scientist to equip himself or herself with all the fundamentals of data science.
A good hold over basic statistics is an armour for budding data scientists which will save them from errors and ensure hassle-free learning.
If you need more arrows in your bow, keep reading further.
The regression technique is used to determine the relationship between dependent and independent variables. It predicts how the value of variable changes when one of the independent variable changes. The regression technique finds use in prediction and forecasting methods. Different methods to carry out a regression analysis on the basis of given data have been devised.
- Linear Regression:
- This is the simplest form of regression and uses the equation of a straight line i.e. y=mx+c; where ‘m’ is the slope of a line and ‘c’ is the intercept. In simple terms, if we plot a graph between X-axis and Y-axis, the slope is a straight line. For instance, if the demand for smartphones in market increases, the supply must also increase to meet the demand and reach the equilibrium. In this case, as the value increases in X-axis, simultaneously it also increases on the Y-axis.
The pre-requisites for understanding the concept of linear regression are to make yourself aware of the basic terms such as Correlation, Variance, Standard deviation, Normal distribution, residual etc.
- Multiple regression:
- This technique is used to determine how a dependent variable varies if many other independent variables are altered or made to vary. For instance, a census is conducted to determine the quality of houses.
- There are various determinants to tell the quality of housing such as the size of the kitchen, number of bedrooms, space in balconies and price of a house. Now, in order to predict whether the house is worth inhabiting or not, the role of every factor is to be considered. Thus, in order to reach a target value, many independent variables need to be studied.
- Normal Distribution: This gives a bell-shaped curve when data is plotted. Most of the data values roam around the mean calculated from data. It is used on continuous values. The technique is used to determine average height, blood pressure, pulse and many other factors.
- Binomial distribution: it is used on discrete values where the answer is mostly ‘yes’ or ‘no’. For instance, while throwing a dice you can tell whether 6 will come or not or how many times 2 occurred.
- Poisson distribution: it is used to determine the number of times an event occurs in a continuous period of time or a given interval. It finds use in industries and real estates where the results are more in discrete numbers rather than based on probability. For example, it can be used to determine the number of calls received in an hour, number of births per hour, number of students absent in a day and many more.
Other than these basic statistics for data science there are other regression techniques and statistical distributions as well. In addition to this, there are theorems to deal with data such as Baye s theorem, K- nearest neighbor algorithm often termed as lazy algorithm because of the ease of using it and Bootstrap aggregating.