kaggle에서 data visualization 학습 (using seabon)
1. Hello. Seaborn
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Set up code checking
import os
if not os.path.exists("../input/fifa.csv"):
os.symlink("../input/data-for-datavis/fifa.csv", "../input/fifa.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_to_coder.ex1 import *
print("Setup Complete")
# Path of the file to read
fifa_filepath = "../input/fifa.csv"
# Read the file into a variable fifa_data
fifa_data = pd.read_csv(fifa_filepath, index_col="Date", parse_dates=True)
# Set the width and height of the figure
plt.figure(figsize=(16,6))
# Line chart showing how FIFA rankings evolved over time
sns.lineplot(data=fifa_data)
2. Line Charts
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Path of the file to read
museum_filepath = "../input/museum_visitors.csv"
# Fill in the line below to read the file into a variable museum_data
museum_data = pd.read_csv(museum_filepath, index_col="Date", parse_dates=True)
# Line chart showing the number of visitors to each museum over time
sns.lineplot(museum_data)
# Line plot showing the number of visitors to Avila Adobe over time
# ____ # Your code here
plt.figure(figsize=(12,6))
sns.lineplot(museum_data["Avila Adobe"])
3. Bar Charts and Heatmaps
bar type 차트와 heatmaps type 차트를 그릴줄 안다.
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Path of the file to read
ign_filepath = "../input/ign_scores.csv"
# Fill in the line below to read the file into a variable ign_data
ign_data = pd.read_csv(ign_filepath, index_col="Platform")
# Bar chart showing average score for racing games by platform
plt.figure(figsize=(10,6))
sns.barplot(y= ign_data.index, x = ign_data["Racing"])
# Heatmap showing average game score by platform and genre
sns.heatmap(data= ign_data, annot= True) # Your code here
4. Scatter Plots
example
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'])
sns.regplot(x=insurance_data['bmi'], y=insurance_data['charges'])
sns.scatterplot(x=insurance_data['bmi'], y=insurance_data['charges'], hue=insurance_data['smoker'])
sns.lmplot(x="bmi", y="charges", hue="smoker", data=insurance_data)
sns.swarmplot(x=insurance_data['smoker'], y=insurance_data['charges'])
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Path of the file to read
candy_filepath = "../input/candy.csv"
# Fill in the line below to read the file into a variable candy_data
candy_data = pd.read_csv(candy_filepath,index_col="id")
candy_data.head(5)
# Scatter plot showing the relationship between 'sugarpercent' and 'winpercent'
sns.scatterplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])
# Scatter plot w/ regression line showing the relationship between 'sugarpercent' and 'winpercent'
sns.regplot(x=candy_data['sugarpercent'], y=candy_data['winpercent']) # Your code here
# Scatter plot showing the relationship between 'pricepercent', 'winpercent', and 'chocolate'
sns.scatterplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'], hue=candy_data['chocolate'])
# Color-coded scatter plot w/ regression lines
sns.lmplot(x="sugarpercent", y="winpercent", hue="chocolate", data=candy_data)
# Scatter plot showing the relationship between 'chocolate' and 'winpercent'
sns.swarmplot(x=candy_data['chocolate'], y=candy_data['winpercent'])
5. Distributions
histograms and density plots.
# Histogram
sns.histplot(iris_data['Petal Length (cm)'])
# KDE plot (kernel density estimate (KDE))
sns.kdeplot(data=iris_data['Petal Length (cm)'], shade=True)
# 2D KDE plot
sns.jointplot(x=iris_data['Petal Length (cm)'], y=iris_data['Sepal Width (cm)'], kind="kde")
#color-coded (대충 다 넣으면 알아서 처리해 주는듯.)
# Histograms for each species
sns.histplot(data=iris_data, x='Petal Length (cm)', hue='Species')
# KDE plots for each species
sns.kdeplot(data=iris_data, x='Petal Length (cm)', hue='Species', shade=True)
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Path of the files to read
cancer_filepath = "../input/cancer.csv"
# Fill in the line below to read the file into a variable cancer_data
cancer_data = pd.read_csv(cancer_filepath, index_col="Id")
# Print the first five rows of the data
cancer_data.head(5)
# Histograms for benign and maligant tumors
sns.histplot(data=cancer_data, x='Area (mean)', hue='Diagnosis')
# KDE plots for benign and malignant tumors
sns.kdeplot(data=cancer_data, x='Radius (worst)', hue='Diagnosis', shade=True)
6. Choosing Plot Types and Custom Styles
Trends - A trend is defined as a pattern of change.
sns.lineplot - Line charts are best to show trends over a period of time, and multiple lines can be used to show trends in more than one group.
Relationship - There are many different chart types that you can use to understand relationships between variables in your data.
sns.barplot - Bar charts are useful for comparing quantities corresponding to different groups.
sns.heatmap - Heatmaps can be used to find color-coded patterns in tables of numbers.
sns.scatterplot - Scatter plots show the relationship between two continuous variables; if color-coded, we can also show the relationship with a third categorical variable.
sns.regplot - Including a regression line in the scatter plot makes it easier to see any linear relationship between two variables.
sns.lmplot - This command is useful for drawing multiple regression lines, if the scatter plot contains multiple, color-coded groups.
sns.swarmplot - Categorical scatter plots show the relationship between a continuous variable and a categorical variable.
Distribution - We visualize distributions to show the possible values that we can expect to see in a variable, along with how likely they are.
sns.histplot - Histograms show the distribution of a single numerical variable.
sns.kdeplot - KDE plots (or 2D KDE plots) show an estimated, smooth distribution of a single numerical variable (or two numerical variables).
sns.jointplot - This command is useful for simultaneously displaying a 2D KDE plot with the corresponding KDE plots for each individual variable.
# Path of the file to read
spotify_filepath = "../input/spotify.csv"
# Read the file into a variable spotify_data
spotify_data = pd.read_csv(spotify_filepath, index_col="Date", parse_dates=True)
# Change the style of the figure
sns.set_style("dark")
# Line chart
plt.figure(figsize=(12,6))
sns.lineplot(data=spotify_data)
try other type “darkgrid”
“whitegrid”
“dark”
“white”
“ticks”
7. Final Project
https://www.kaggle.com/datasets 참고
해당 dataset에서 검색후 다운받아서 확인한다.
8. Creating Your Own Notebooks
https://www.kaggle.com/code
import pandas as pd pd.plotting.register_matplotlib_converters() import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns print("Setup Complete")