[S][데싸]pandas 전처리

2 minute read

Filtering and sorting

24번 문제

item_price컬럼의 달러 표시를 제거하고 float 타입으로 저장하여 new_price 칼럼에 저장하라. $표시는 숫자 제일 앞에 있다.

df['new_price'] = df['item_price'].str[1:].astype('float')

28번 문제 : sort_values()

Ans = df.sort_values('new_price').reset_index(drop=True)

29번 문제 : str.contains()

Ans = df.loc[df.item_name.str.contains('Chips')]   

41번 문제 : str.startswith() (N으로 시작되는)

Ans = df.loc[df.item_name.str.startswith('N')]   

42번 문제 : str.len() 길이

lst =[1.69, 2.39, 3.39, 4.45, 9.25, 10.98, 11.75, 16.98]
Ans = df.loc[df.new_price.isin(lst)]

43번 문제 : isin() 포함되어 있는지

lst = [1.69.
Ans = df[df.item_name.str.len() >= 15]   

Grouping

45번 문제

데이터의 각 host_name의 빈도수를 구하고 host_name으로 정렬하여 상위 5개를 출력하라

Ans = df.groupby('host_name').size()
# or
Ans = df.host_name.value_counts().sort_index()

49번 문제

neighbourhood_group 값에 따른 price값의 평균, 분산, 최대, 최소 값을 구하여라

Ans = df[['neighbourhood_group','price']].groupby('neighbourhood_group').agg(['mean','var','max','min'])

50번 문제

neighbourhood_group 값에 따른 reviews_per_month 평균, 분산, 최대, 최소 값을 구하여라

Ans = df[['neighbourhood_group','reviews_per_month']].groupby('neighbourhood_group').agg(['mean','var','max','min'])

Apply, Map

57번 문제

Income_Category의 카테고리를 map 함수를 이용하여 다음과 같이 변경하여 newIncome 컬럼에 매핑하라 Unknown : N Less than $40K : a $40K - $60K : b $60K - $80K : c $80K - $120K : d $120K +’ : e

dic = {
    'Unknown'        : 'N',
    'Less than $40K' : 'a',
    '$40K - $60K'    : 'b',
    '$60K - $80K'    : 'c',
    '$80K - $120K'   : 'd',
    '$120K +'        : 'e'   
}

df['newIncome']  =df.Income_Category.map(lambda x: dic[x])

Ans = df['newIncome']

58번 문제

Income_Category의 카테고리를 apply 함수를 이용하여 다음과 같이 변경하여 newIncome 컬럼에 매핑하라 Unknown : N Less than $40K : a $40K - $60K : b $60K - $80K : c $80K - $120K : d $120K +’ : e

def changeCategory(x):
    if x =='Unknown':
        return 'N'
    elif x =='Less than $40K':
        return 'a'
    elif x =='$40K - $60K':   
        return 'b'
    elif x =='$60K - $80K':    
        return 'c'
    elif x =='$80K - $120K':   
        return 'd'
    elif x =='$120K +' :     
        return 'e'

df['newIncome']  =df.Income_Category.apply(changeCategory)

Ans = df['newIncome']

59번 문제

Customer_Age의 값을 이용하여 나이 구간을 AgeState 컬럼으로 정의하라. (0~9 : 0 , 10~19 :10 , 20~29 :20 … 각 구간의 빈도수를 출력하라

df['AgeState']  = df.Customer_Age.map(lambda x: x//10 *10)
Ans = df['AgeState'].value_counts().sort_index()

60번 문제

Education_Level의 값중 Graduate단어가 포함되는 값은 1 그렇지 않은 경우에는 0으로 변경하여 newEduLevel 컬럼을 정의하고 빈도수를 출력하라

df['newEduLevel'] = df.Education_Level.map(lambda x : 1 if 'Graduate' in x else 0)
Ans = df['newEduLevel'].value_counts()

62번 문제

Marital_Status 컬럼값이 Married 이고 Card_Category 컬럼의 값이 Platinum인 경우 1 그외의 경우에는 모두 0으로 하는 newState컬럼을 정의하라. newState의 각 값들의 빈도수를 출력하라

def check(x):
    if x.Marital_Status =='Married' and x.Card_Category =='Platinum':
        return 1
    else:
        return 0


df['newState'] = df.apply(check,axis=1)

Ans  = df['newState'].value_counts()

63번 문제

Gender 컬럼값 M인 경우 male F인 경우 female로 값을 변경하여 Gender 컬럼에 새롭게 정의하라. 각 value의 빈도를 출력하라

def changeGender(x):
    if x =='M':
        return 'male'
    else:
        return 'female'
df['Gender'] = df.Gender.apply(changeGender)
Ans = df['Gender'].value_counts()

apply 사용법

apply(lambda x == “xxxx”)
apply(lambda x: x.replace({‘Yes’: 1, ‘No’: 0}))

Time, Series

72번문제

모든 결측치는 컬럼기준 직전의 값으로 대체하고 첫번째 행에 결측치가 있을경우 뒤에 있는 값으로 대채하라

df = df.fillna(method='ffill').fillna(method='bfill')

74번문제

RPT 컬럼의 값을 일자별 기준으로 1차차분하라

Ans = df['RPT'].diff()

75번문제

RPT와 VAL의 컬럼을 일주일 간격으로 각각 이동평균한값을 구하여라

Ans= df[['RPT','VAL']].rolling(7).mean()

Pivot

85번문제 pivot

84번 문제에서 추출한 데이터로 아래와 같이 나라에 따른 년도별 사망률을 데이터 프레임화 하라

Ans = target.pivot(index='Location',columns='Period',values='First Tooltip')

90번문제 pivot_table

전체 데이터에서 Discipline종류에 따른 따른 Medal수를 구하여라

Ans = df.pivot_table(index='Discipline',columns='Medal',aggfunc='size')

Merge, Concat

94번문제

df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 두 데이터 모두 포함하는 데이터만 출력하라

Ans = pd.merge(df5,df6,on='Algeria',how='inner')

95번문제

df5과 df6 데이터를 하나의 데이터 프레임으로 merge함수를 이용하여 합쳐라. Algeria컬럼을 key로 하고 합집합으로 합쳐라

Ans =pd.merge(df5,df6,on='Algeria',how='outer')

Share on

Twitter Facebook LinkedIn

설악이