One of the most important steps in gaining insight into the data is data visualization. It’s up to us to decide which of the many tactics and plot types we prefer. There are many different types of plots to choose from, including barplots, histograms, pie charts, scatterplots, boxplots, lineplots, time series plots, jointplots, bubble plots, geoplots, distplots, funnel charts, heatmaps, 3-D surface plots. It can be difficult to decide which one we need more, but the correct question is: Which plot does the data need? Personally, I often struggle with this step and am unsure which option to pick. I spent hours selecting the right layout.
Need of Data visualizations:
- Recognize the trends and patterns in the data.
- Analyze the occurrence and other relevant data properties.
- Understand how the data’s variables are distributed.
- Visualize any possible connections between the various variables.
The abovementioned four processes are extremely important for comprehending what the data means, what it wants to communicate, and what kinds of predictions we want from the dataset. Understanding the data is one thing; selecting the right model is quite another.
The data can be univariate, bivariate, or multivariate depending on how many factors are present in it. For instance, a data set is considered uni-variate if it contains just one important variable.
In this blog, we will be primarily focusing on different Univariate data visualizations as well as their implementations with the R programming language. Univariate essentially does nothing more than summarize the datasets and search for patterns.
DOT PLOT
Data points are represented as dots on a graph with an x- and y-axis in a dot plot commonly referred to as a strip plot or dot chart, which is a straightforward type of data visualization. These kinds of graphs are employed to visually represent particular data patterns or groupings. The dot plot represents the quantitative values in the data and just shows the frequency of the data points occurring. In simple terms, it can be considered equivalent to a bar graph, just that the representation here is in the form of points rather than rectangular blocks.
Advantages and Disadvantages
ADVANTAGES:
- They serve as an effective approach to depict frequencies and ratios.
- Because they display the outliers and population distributions in the data, they are more in-depth.
- Better results can be observed by using different colors and changing the shape and size of the marker for representations.
DISADVANTAGES:
- It is typically challenging to determine the frequency of a dataset from a dot plot. When dealing with a high frequency, the data will need to be numbered individually, this might not be possible.
- Not recommended for huge datasets because the points end up being congested and challenging to read.
Implementation of dot chart in Python
import matplotlib.pyplot as plt # Define the data data = [1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10] # Create the dot plot plt.plot(data, ‘o’, markersize=8) # Add x and y-axis labels plt.title(“Dot plot in python”) plt.xlabel(‘Data Point’) plt.ylabel(‘Frequency’) # Show the plot plt.show() |
Implementation of dot chart in R
data <- c(1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10) dotchart(data, main = “Dot Plot in R”) |
HISTOGRAMS
A histogram is a graphic description of a dataset’s distribution and an estimation of its probability distribution for a continuous variable. In a histogram, the data is divided into a number of bins, and the height of a bar indicates how many observations fall into each bin.
Advantages and Disadvantages
ADVANTAGES:
- It is easy to use and adaptable.
- In comparison to scatter or dot plots, histogram graphs tell the story better.
- Typically, a bell-shaped bar graph suggests a normal distribution. The graph has spikes that show the variation that must be dealt with.
DISADVANTAGES:
- The maximum and minimum of the variable are (too much) dependent.
- It heavily depends on how many bins there are.
- It is impossible to distinguish between continuous and discrete variables.
- It prevents the detection of essential values.
Implementation of histogram in Python
Using the seaborn library, you can make a histogram in Python. An illustration of a Python histogram can be seen here:
import seaborn as sns # Define the data data = [1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10] # Create the histogram sns.histplot(data) plt.title(“histogram in python”) # Show the plot plt.show() |
Implementation of histogram in R
The hist() function from the graphics package in R can be used to generate a histogram. Here is an example of a histogram made in R:
data <- c(1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10) hist(data, main = “Histogram in R”, xlab = “Data Point”, ylab = “Frequency”, col = “steelblue”) |
DENSITY PLOTS
It is like a smothered version of a histogram. It normalizes all the data points appearing in the curve.
Segmented univariate distribution, a statistical concept, is most useful when examining the distribution of one variable over groups of another variable.
Advantages and Disadvantages
ADVANTAGES:
- Since they are unaffected by the presence of a number of bins, they are efficient in identifying the distribution shapes.
- It is not heavily influenced by plot sizes.
DISADVANTAGES:
The plot depends on using the right bandwidth to display the data in the best manner possible; if the bandwidth is picked incorrectly, the data may be distorted by being over or under-smoothed.
Implementation of density chart in Python
By leveraging the seaborn library in Python, a density chart may be produced. Here is an illustration of a Python density chart:
import seaborn as sns import matplotlib.pyplot as plt # Define the data data = [1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10] # Create the density chart sns.kdeplot(data, color=’g’) plt.title(“Density chart in Python”) # Add x axis label plt.xlabel(‘Data Point’) # Show the plot plt.show() |
Implementation of density chart in R
The stats package’s density() function in R can be used to generate a density chart. Here is an illustration of how to use R’s density() function to make a density chart:
data <- c(1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10) plot(density(data), col = “steelblue”, main=”Density plot in R”) |
BOX PLOTS
Box plots show the distribution of the data on a five-number breakdown as such:
The minimum, first quartile, second quartile (median), third quartile, and maximum
When obtaining the distribution, center, and skewness of a quantitative characteristic in the dataset, these are preferable. When looking for outliers, these charts do pretty well.
Advantages and Disadvantages
ADVANTAGES:
- Provides a visual summary of variation in huge datasets.
- Reveals anomalies.
- Various distributions are compared.
- Hints at some skewness and symmetry.
DISADVANTAGES:
- It is often difficult to find mean.
- For some, it may seem to be confusing
- Conceals the multimodality in the dataset as well as other distributional characteristics.
Implementation of box plot in Python
Using the seaborn package, you can build a box plot in Python. Here is an illustration of a box plot written in Python:
import seaborn as sns # Define the data data = [1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10,21] # Create the box plot sns.boxplot(data, color = “steelblue”) plt.title(“Box plot in python”) # Show the plot plt.show() |
Implementation of box plot in R
Using the boxplot() function from the graphics package, you may make a box plot in R. Here is an illustration of how to use R’s boxplot() function to produce a box plot:
data <- c(1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10,21) boxplot(data, col = “steelblue”,main=”Boxplot in R”,horizontal = TRUE) |
VIOLIN PLOTS
A violin plot is a graphical representation of a probability density function that displays the distribution of a continuous variable using a kernel density estimate (KDE). The violin’s breadth represents the probability density of the data at various values, and it combines a box plot and a kernel density plot.
Advantages and Disadvantages
ADVANTAGES:
- Utilized to evaluate the dispersion of numerical data and are particularly helpful for comparing distributions between different groups.
- Violin graphs are visually simple and appealing.
- Â These graphs function well even for datasets that are not normally distributed, making this tool useful when the dataset is relatively small.
DISADVANTAGE:
The only potential negative that comes to me is that certain newcomers who are unfamiliar with its representations may find it difficult to absorb or understand the information from it.
Implementation of violin plot in Python
By leveraging the seaborn package in Python, a violin plot can be produced. Here is an illustration of a Python violin plot:
import seaborn as sns # Define the data data = [1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10] # Create the violin plot sns.violinplot(data, color = “steelblue”) plt.title(“Violin plot in Python”) # Show the plot plt.show() |
Implementation of violin plot in R
Using the violinplot() function from the vioplot package, you may make a violin plot in R. Here is an illustration of how to use the vioplot() function in R to produce a violin plot:
# install.packages(“vioplot”) library(vioplot) data <- c(1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10) vioplot(data,col=”steelblue”,horizontal = TRUE, main=”Violin plot in R”) |
DISTPLOTS
A distplot is a graph that combines a histogram and a density plot. It is used to display how a group of numerical values are distributed.
The Python graphing tool matplotlib is used in conjunction with seaborn. You may display a histogram with a line on it using Seaborn Distplot.
Advantages and Disadvantages
ADVANTAGES:
- It offers a visual depiction of a dataset’s distribution, which is helpful for spotting trends and outliers.
- Placing several datasets on top of one another, it makes it simple to compare them.
- It can be used to determine whether different modes are present in the data.
- For detecting skewness and kurtosis in a dataset, it is a potent tool.
- Finding outliers in your data may be done quickly and easily with this method.
DISADVANTAGE:
- When dealing with small datasets, it might be deceptive because it might not truly reflect the underlying distribution of the data.
- The interpretation of the plot may be sensitive to the selection of the bin width and number.
- Being limited to continuous variables, it is not appropriate for categorical data.
- Multiple distributions with varied scales and shapes can be challenging to compare.
- When working with huge datasets that contain several observations, it can be challenging to interpret the plot.
Implementation of distplot in Python
A distplot can be produced in Python by leveraging the seaborn library. Here is an illustration of a Python distplot:
import seaborn as sns # Define the data data = [1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 7, 8, 8, 9, 10] # Create the distplot sns.distplot(data, color = “steelblue”) plt.title(“Distplot in Python”) # Show the plot plt.show() |
Implementation of distplot in R
ggplot2 is the most used package in R for making distribution charts. The geom density() method can be used to generate a distribution plot. The following is an illustration of how to make a distplot in ggplot2:
ggdist(data) + geom_histogram(binwidth = 0.5, color = “steelblue”) + ggtitle(“Distplot in R”) |
COUNT PLOT
A count plot analyzes the frequency of various classes of a category feature. Consider a bar chart where the height of the bars represents the frequency with which each class appears in the data.
It can be compared with plotting the value_counts() function.
Advantages and Disadvantages
ADVANTAGES:
- It is a quick and straightforward approach to see how a category variable is distributed.
- It is a helpful tool for figuring out which categories are most or least prevalent in a dataset.
- It can be used to contrast how a categorical variable is distributed among several populations or subpopulations.
DISADVANTAGE:
- When dealing with small datasets, it might be deceptive because it might not truly reflect the underlying distribution of the data.
- It only functions with categorical data, hence it is not appropriate for continuous variables.
- When working with enormous datasets that contain numerous categories, it might be challenging to spot patterns or trends.
- Unbalanced or missing data may have an impact on it.
Implementation of count plot in Python
Using the seaborn package, you can build a count plot in Python. Here is an illustration of a Python count plot function for analyzing the categorical values:
import seaborn as sns # Define the data data = [“John”, “Jane”, “Mike”, “John”, “Jane”, “Mike”, “John”, “Mike”] # Create the count plot sns.countplot(data) plt.title(“Count plot in Python”) # Show the plot plt.show() |
Implementation of count plot in R
Combining the geom bar() and count() functions from the dplyr package in R will result in a count plot using the ggplot2 package. Here is an illustration showing how to make a count plot in R:
library(ggplot2) library(dplyr) df <- data.frame(Name = c(“John”, “Jane”, “Mike”, “John”, “Jane”, “Mike”, “John”, “Mike”), Age = c(25, 32, 45, 25, 32, 45, 25, 45), Salary = c(50000, 60000, 70000, 50000, 60000, 70000, 50000, 70000)) data_count <- count(df, Name) ggplot(data_count, aes(x=Name, y=n)) + geom_bar(stat=”identity”, color=”black”, fill=”steelblue”)+ ggtitle(“Countplot in R”) |
Pie CHART
In a circular graph, a pie chart illustrates the percentage distribution of a category variable.
Advantages and Disadvantages
ADVANTAGES:
- Show the relative ratios of various data classes.
- Visually compress a vast data set.
- Allow for visual verification of the precision of calculations or logic.
- The size of the circle can be easily adjusted according to the datasize.
DISADVANTAGES:
- When the groupings are more than four, the graph seems congested.
- In some cases, the widths of the slices are not immediately apparent.
Implementation of Pie chart in Python
You can also use plotly
the package to create interactive pie charts. By providing several arguments to the pie() function, you can alter the color, labels, legend, and other aspects of your pie chart.
import plotly.graph_objects as go # Create sample data labels = [‘Apples’, ‘Bananas’, ‘Oranges’] values = [40, 20, 40] # Create pie chart fig = go.Figure(data=[go.Pie(labels=labels, values=values)]) fig.show() |
Implementation of Pie chart in R
ggplot2 is one of the most widely used packages in R for making pie charts. Here is an illustration of a straightforward ggplot2 pie chart:
library(ggplot2) # Create sample data data <- data.frame( category = c(“A”, “B”, “C”), value = c(30, 20, 50) ) # Create pie chart ggplot(data, aes(x = “”, y = value, fill = category)) + geom_bar(width = 1, stat = “identity”) + coord_polar(“y”, start = 0) + theme_void() |
STRIP PLOTS
To summarise a univariate data collection, a strip plot is used. The response variable’s value is shown on the horizontal axis on the strip plot. All values on the vertical axis are set to 1.
=Strip plots are preferred for small datasets, unlike histograms which are preferred for larger ones.
Plotting the variable distribution for each category as a series of discrete data points is also useful.
Advantages and Disadvantages
ADVANTAGES:
- It is helpful for displaying a set of continuous data’s frequency distribution.
- Finding trends and outliers in the data is made easier by using it.
- It can be used to contrast how a variable is distributed throughout several populations or subpopulations.
- It is helpful for determining whether there are multiple modes present in the data.
DISADVANTAGES:
- When dealing with small datasets, it might be deceptive because it might not truly reflect the underlying distribution of the data.
- It can be sensitive to the jitter option, which can change how the plot is seen.
- Being limited to continuous variables, it is not appropriate for categorical data.
Implementation of strip plot in Python
import seaborn as sns import numpy as np import pandas as pd tips = sns.load_dataset(“tips”) sns.stripplot(data=tips, x=”total_bill”) plt.title(“stripplot in python”) |
Implementation of strip chart in R
library(ggplot2) # Create sample data set.seed(0) data <- data.frame(x = rnorm(100), y = rnorm(100)) # Create a strip plot ggplot(data, aes(x = x, y = y)) + geom_jitter()+ ggtitle(“Stripplot in R”) |
Conclusion
Only one component of data exploration is covered by univariate analysis as well as its implementations using Python and R. To assess a feature’s significance in the data, it looks at how it is distributed across the dataset. Understanding these connections and interactions, sometimes referred to as bivariate and multivariate analysis, is the next stage.