[Pandas] How to manipulate textual data

Tutorial

[Pandas] How to manipulate textual data

주댕이 2024. 2. 3. 17:50

https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html

How to manipulate textual data — pandas 2.2.0 documentation

This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns: PassengerId: Id of every passenger. Survived: Indication whether passenger survived. 0 for yes and 1 for no. Pclass: One out of the 3 ticket classes: C

pandas.pydata.org

# 데이터 불러오기

import pandas as pd 

titanic = pd.read_csv("https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv")
titanic.head()

# 소문자로 만들기

# Make all name characters lowercase

# titanic 데이터에서 "Name"열의 값을 소문자로 만들기
titanic["Name"].str.lower()

# 텍스트 일부를 추출하여 새로운 열 생성하기

# Create a new column Surname that contains the surname of the passengers by extracting the part before the comma

# titanic 데이터에서 "Name"의 값을 쉼표를 기준으로 분리
titanic["Name"].str.split(",")

# titanic 데이터에서 "Name"의 값을 쉼표로 기준으로 분리하고, 그중 첫 번째 부분만 추출하여 "Surname"이라는 열로 만들기 
titanic["Surname"] = titanic["Name"].str.split(",").str.get(0)
titanic["Surname"]

# 특정 텍스트를 포함한 데이터 추출하기

# Extract the passenger data about the countesses on board of the Titanic

# titanic 데이터에서 어떤 데이터가 "Name" 값에 'Countess'를 포함하고 있는지 확인
titanic["Name"].str.contains("Countess")

# titanic 데이터에서 "Name" 값에 'Countess'를 포함한 데이터만 추출
titanic[titanic["Name"].str.contains("Countess")]

# 가장 길이가 긴 텍스트를 값으로 가진 데이터 추출하기

# Which passenger of the Titanic has the longest name?

# titanic 데이터에서 "Name" 값의 길이 확인
titanic["Name"].str.len()

# titanic 데이터에서 "Name" 값의 길이가 가장 긴 데이터의 위치 확인
titanic["Name"].str.len().idxmax()

# titanic 데이터에서 "Name" 값의 길이가 가장 긴 데이터의 위치를 확인하고, 그 데이터의 "Name" 열 추출
titanic.loc[titanic["Name"].str.len().idxmax(), "Name"]

# 데이터의 값을 새로운 값으로 바꾸기

# In the “Sex” column, replace values of “male” by “M” and values of “female” by “F”

# titanic 데이터에서 "Sex" 열의 "male" 값을 "M"으로 바꾸고, "Female" 값을 "F"로 바꾸어 "Sex_short"라는 새로운 열 생성
titanic["Sex_short"] = titanic["Sex"].replace({"male": "M", "female": "F"})
titanic["Sex_short"]

728x90

'Tutorial' 카테고리의 다른 글

[Pandas] How do I select a subset of a DataFrame? (0)	2024.02.03

현재글[Pandas] How to manipulate textual data

주댕이의 공부 기록