What is Data Science
The problem faced by many beginners in Data Science, especially those looking for data science jobs is the fact that the field has only recently received clear classification in regards to other disciplines. In the past, it has been suggested as an alternative name to computer science and statistics but it wasn’t until the 21st century that there were any agreed upon parameters for it. However, it still lacks any universally agreed upon definition and as such, sceptics have at times referred to it as nothing more than a buzzword.
However, it is generally agreed that data science utilises and unifies theories from multiple fields including but not limited to mathematics,statistics, information science, computer science, etc. It shares a lot of commonalities with the latter two while still having significant differences. Outside of this, to get an idea about the relevance of data science, it has been termed as the “fourth paradigm” science (outside of the pre-existing theoretical, empirical and computational) by Jim Gray, a Turing award winning computer scientist.
This article will give a basic walkthrough about the different aspects of data science, ranging from the various roles available to students of data science, the commonalities it has with related fields like information science and computer science, prerequisites for data science, applications of data science and other topics.
Prerequisites for Data Science
The prerequisites for Data Science vary in specificity depending upon the role in question. As such, listed below are the prerequisites for Data Science at the beginner level.
- 1. A working knowledge of statistics. Statistics is arguably the base foundational subject that forms the basis for data science. As such, those seeking a career in data science must possess knowledge about different methodologies that are utilised in data science ranging from bayes theorem to probability theorem.
- 2. A data science career possesses the prerequisite of a working knowledge of artificial intelligence, machine learning and neural networks. AI is the branch of computer science that is used to enable machines to mimic human behaviour and a superset of machine and deep learning and neural networks. Machine learning and its algorithm focuses on instructing machines how to learn from data. Machine Learning can be broadly classified into three groups: supervised, unsupervised and reinforcement learning. Neural networks are a processing unit architecture inspired from their biological equivalent. This is a component of AI, ML and DL and it is used to perform data processing through multiple layers of the neurons.
- 3. Knowledge of programming skills in high level languages like R and Python along with the associated libraries. They are useful for tasks such as data visualisation and statistical analysis and unlike computer science, only a working knowledge of the higher level languages instead of the entire range of languages expected for developers.
The Steps involved in Solving a Problem using Data Science:
As mentioned before, solving problems using Data Science is a relatively new concept. The idea however, of using data to make predictions using past precedents is hardly alien. As such, the broad steps for solving such problems have been coalesced below, in order to give a general idea about the steps involved:
- The Planning Phase
- This phase has often been termed as defining problem statements or defining business requirements depending upon the situation in question. Regardless of the situation, this phase in a Data Science project involves a clear definition of the objectives, whether it is in overcoming a problem or fulfilling a requirement. The relevance of this step is in how it allows for project managers to develop a clear plan in regards to the project, in regards to the expected lifetime, budget, resource allocation etc.
- Data Collection
- The initial and arguably the most fundamental task of data science projects. However, in many original research projects or similar high-end projects there are no clear databases where data comes pre-structured and cleansed. Instead, they have to rely on data scraping or surveying or similar tasks to get the relevant data.
- Data Cleaning
- This is arguably the most tedious process for Data Science projects and as such, the least liked for Data Scientists. It generally involves removing the redundant or missing data or any other form of data that is unsuitable for the database and its intended project. While it is a time-consuming stage, it is an essential stage in Data Science projects in order to remove the chance of wrong predictions.
- Data Analysis
- This is the phase of the data science project where relevant information is gathered from the cleaned data. Based on the observation of the patterns and trends gathered from analysing the data, useful information and insights can be gathered. This culminates in the formation for a hypotheses from the data.
- Data Modelling
- This stage of the Data Science project involves creating a model such as a machine algorithm model to solve the problem. The model is trained and tested using the data and it involves a section of the data. The sections are obtained through the process of Data Splicing. This involves splitting or dividing the entire data into two separate portions for training the model and in testing the efficiency of the data. The two segments are termed as the training data and the testing data set respectively.
- Optimisation and Development
- The optimisation and deployment are the final steps of the Data Science project’s life cycle. It essentially consists of final tweaking before deployment.
Common Jobs to Find in Data Science
Despite Data Science lacking a clear universally accepted definition as described before, it has seen rapid growth in terms of people employed who consider themselves to be working under it. Some of the reasons behind it is due to the rapid growth of data generated in many industries, the growth of Big Data and related fields and with a better understanding of what it consists of, allowing for better discoveries of possible applications for Data Science. Some of the most common jobs of Data Science are:
1. Data Scientists
Data Scientists are not limited to just analysing data, but also perform additional tasks such as information collection to analysis. This requires a thorough understanding about the data pipeline and data scientists today combine data analysis with data mining, computer science, statistical methods and machine learning.
2. Data Analysts
Data analysts often perform the role of entry-level jobs for data scientists. As such the requirements for knowledge fields like programming knowledge, and algorithm creation is much leaner for data analysts rather than for data scientists. They rely on the programming knowledge for collecting, organising and analysing data to gain insight and information to the data.
3. Data Engineers
Data engineers unlike the above two are not involved in analytical tasks. Instead. They focus on the architecture and operation of the data pipeline. As such, they are responsible for designing, building and managing the information systems that are responsible for collecting, storing and retrieving the information used by the above two.
4. Data Architects
As the name suggests, the relation between this and the former role is similar to that between engineers and architects. As such, dta architects are responsible for designing the data management pipeline, while data engineers are mostly responsible for building it. As such, there is a considerable overlap between the two and in smaller companies, a single person may hold both roles.
The Comparisons Between Data Science and Similar Fields
Previously, it was discussed that the three fields, data science, computer science and information science are similar in nature and not only do they have a lot of commonalities, but they even overlap in some situations. As such, in order to have a clear idea on what the differences between the different fields exist and so, depending on the role, what strengths and weaknesses that they might have.
Outside of this, it would also be reasonable to compare the relation between statistics and data science. This is because not only have there been arguments made that data science has descended from statistics, but that data science is nothing more than a synonym for statistics. The arguments against it is that unlike statistics, data science focuses on data that is unique to the digital medium. They both focus on quantitative data but statistics also emphasise data description whereas data science focuses on qualitative data, while emphasising prediction and action.
The primary characteristics of data science are as follows:
- – Data Science is the study of data, whether it is structured, semi-structured or unstructured in nature.
- – Data Scientists collect and analyse data in various forms for applications in order to solve specific questions or problems.
- – The primary objectives of data scientists is in data mining i.e. pattern finding in data and data transformation i.e. changing the structure of data.
The primary characteristics of computer science are as follows:
- – Computer Science focuses on the operation of computer hardware, software and systems.
- – Computer scientists focus on the functional workings of computer systems and other aspects of computers and computation.
- – The objectives of computer scientists can vary based on their specialisation. These specialisations can include UX/UI design, web or application design or cybersecurity.
The primary characteristics of information science are as follows:
- – The field of information science is the oldest field of the three, if we consider the older version. In the modern version, Information science can be broadly defined as tackling the problems using the perspective of the stakeholders involved and – applying the information and technology as required.
- – Some of the various career choices in information science are information scientists, systems analysts and information professionals.
Common Terms for Data Science
The field of Data Science has been referred to as being multidisciplinary in nature. As such, there are a lot of terms that those that are interested in the field of data science should know. As such, listed below are some of the terms that are commonly used in data science, often referring to some common methods and tools used there. Some of the examples of unique terms in Data Science include:
5Ps of Data Science: It is an acronym in Data Science that refers to the five factors used in retailers and enterprises to optimise them through Data Science. They refer to product, price, promotion, place and people and tweaking the balance between them is the main objective here.
Backpropagation: Backpropagation is a method used to minimise errors in a neural network if the estimated output varies highly from the actual output. It relies on updating weight and bias by determining the error at the output and propagating it back into the networks while updating the weights to minimise the error. It is useful in data science when the people there have to work with neural networks.
Bagging: Bagging or bootstrap averaging involves combining multiple predictions from multiple different models to create a final prediction.
Bayes Theorem: It is a probability theorem that is used to calculate the conditional probability. The theorem’s formula is as follows: P(A|B)={P(B|A).P(A)}/P(B). Where the left hand side refers to the probability of event type B from the complete set of event type A. This is one of the most common probability theories that one can find application in data science.
Data Mining: It is the study of extracting useful information from structured or unstructured data. It possesses obvious applications in data science in many occupational roles.
Dplyr: Dplyr is a data manipulator package of the R language. It can work with both local and remote datasets for data manipulation and cleaning.
Flume: It is a tool designed for streaming data logs into the Hadoop environment. Multiple Flume agents can be configured simultaneously to collect and aggravate the large amounts of data.
GGplot2: This is another data visualisation package for the R programming language. The primary purpose of this package is in creating plots.
Hadoop: It is an open source framework that uses parallel processing to handle large quantities of data.
Hive: Hive or Apache Hive is a data warehouse software. Hive is built on top of Apache’s previous framework hadoop. It processes structured data in Hadoop whale providing data summarization, query and analysis through an SQL like interface.
Keras: It is a neural network library written in and capable of working with Python. It is capable of running alongside Tensorflow and Theano to make dealing and working with neural networks easier.
Conclusion
Now, despite the confusion of Data Science’s definition in the early stage, that hasn’t stopped it from growing and with a better idea of what comes under it and what applications it may have, As such, as shown from the points above, there is now a clear idea of how it works, some of the common tools it may have and what it offers for those in the field and what prerequisites are there for data science.