[7일차] 문자열과 텍스트 파일 데이터 다루기

[7일차] 문자열과 텍스트 파일 데이터 다루기

2022. 12. 28. 16:11ㆍ데이터 엔지니어링 과정/python

목차
1. 문자열 다루기
2. 텍스트 파일의 데이터를 읽고 처리하기

1. 문자열 다루기

1. split() 문자열 분리하기

문자열을 분리하고 싶을 때 사용
사용방법
str.split([sep])

coffee_menu_str = "에스프레소,아메리카노,카페라떼,카푸치노"

coffee_menu_str.split(',')
>>> ['에스프레소', '아메리카노', '카페라떼', '카푸치노']

문자열에 직접 split()를 사용할 수도 있다.

# 반점(,)으로 구분
"에스프레소,아메리카노,카페라떼,카푸치노".split(',') 
>>> ['에스프레소', '아메리카노', '카페라떼', '카푸치노']

# 공백으로 구분
"에스프레소 아메리카노 카페라떼 카푸치노".split(' ') 
>>> ['에스프레소', '아메리카노', '카페라떼', '카푸치노']

문자열에 인자없이 splilt를 사용하면 모든 공백과 개행문자를 없앨 수 있다.

"에스프레소 아메리카노 카페라떼 카푸치노".split()
>>> ['에스프레소', '아메리카노', '카페라떼', '카푸치노']

"에스프레소 \n\n 아메리카노 \n 카페라떼 카푸치노 \n\n\n".split()
>>> ['에스프레소', '아메리카노', '카페라떼', '카푸치노']

원하는 횟수만큼만 문자열을 분리할 때
str.split([sep], maxsplit=원하는 횟수)

"에스프레소 아메리카노 카페라떼 카푸치노".split(maxsplit=2)
>>> ['에스프레소', '아메리카노', '카페라떼 카푸치노']

phone_number = "+82-10-2345-6789" # 국가 번호가 포함된 전화번호
split_num = phone_number.split("-",maxsplit=1) # 국가 번호와 나머지 번호 분리

print(split_num)
print("국내전화번호: {0}".format(split_num[1]))
>>> ['+82', '10-2345-6789']
>>> 국내전화번호: 10-2345-6789

2. strip() 필요없는 문자열 삭제하기

앞뒤 공백 또는 필요없는 문자 삭제
사용방법
str.strip([chars])
한꺼번에 지정해서 문자를 삭제하는 것도 가능하다.

test_str = "aaaabbPythonbbbaa"

test_str.strip('ab') #문자열 앞뒤의 'a', 'b' 제거
>>> 'Python'

test_str.strip('ba')
>>> 'Python'

➡문자열의 순서는 상관 없음!

더 많은 문자를 삭제하는 예

test_str_multi = "##***!!!##.... Python is powerful!.!...  %%!#.. "

test_str_multi.strip('*.#! %')
>>> 'Python is powerful'

test_str_multi.strip('%* !#.')
>>> 'Python is powerful'

.strip(' ') , .strip() 공백 제거하기

"   Python   " .strip(' ')
>>> 'Python'

.strip('\n ') , .strip() 개행문자 삭제하기

"\n Python \n\n".strip('\n ')
>>> 'Python'

"\n Python \n\n".strip()
>>> 'Python'

지정한 문자 외 다른 문자를 만날 때까지만 삭제한다.

"aaaBallaaa".strip('a')
>>> 'Ball'

➡문자열 사이는 없어지지 않는다.

"\n This is very \n fast. \n\n".strip()
>>> 'This is very \n fast.'

lstrip() 문자열 왼쪽만
rstrip() 문자열 오른쪽만

str_lr = "000Python is easy to learn. 000"
print(str_lr.strip('0'))
print(str_lr.lstrip('0'))
print(str_lr.rstrip('0'))

>>> Python is easy to learn. 
>>> Python is easy to learn. 000
>>> 000Python is easy to learn.

문자열안의 공백과 콤마를 깔끔하게 하려면?

coffee_menu = "  에스프레소, 아메리카노, 카페라떼, 카푸치노 "
coffee_menu_list = coffee_menu.split(',')

coffee_menu_list
>>> ['  에스프레소', ' 아메리카노', ' 카페라떼', ' 카푸치노 ']

coffee_list = [] #빈 리스트 생성
for coffee in coffee_menu_list:
    temp = coffee.strip() #문자열의 공백 제거
    coffee_list.append(temp) #리스트 변수에 공백이 제거된 문자열 추가
    
print(coffee_list) #최종 문자열 리스트 출력
>>> ['에스프레소', '아메리카노', '카페라떼', '카푸치노']

3. + , join() 문자열 연결

+ 문자열 연결

name1= "영미"
name2 = "철수"
hello="님, 주소와 전화 번호를 입력해 주세요."

print(name1 + hello)
print(name2 + hello)
>>> 영미님, 주소와 전화 번호를 입력해 주세요.
>>> 철수님, 주소와 전화 번호를 입력해 주세요.

join()
- 문자열을 항목으로 갖는 시퀀스(seq)의 항목 사이에 구분자 문자열 (str)을 모두 넣은 후에 문자열로 반환된다.
- 사용 방법 str.join(seq)

address_list = ["서울시", "서초구", "반포대로", "201(반포동)"]

address_list
>>> ['서울시', '서초구', '반포대로', '201(반포동)']

# 공백으로 연결하기
a = " "
a.join(address_list)
>>> '서울시 서초구 반포대로 201(반포동)'

" ".join(address_list)
>>> '서울시 서초구 반포대로 201(반포동)'

"*^-^*".join(address_list)
>>> '서울시*^-^*서초구*^-^*반포대로*^-^*201(반포동)'

4. 문자열 찾기

(1) find()

문자열에서 찾으려는 검색 문자열(search_str)과 첫번째로 일치하는 문자열(str)의 위치를 반환한다.
사용방법
str.find(search_str)
시작과 끝 위치를 추가로 지정 가능
str.find(search_str, start, end)

str_f_se = "Python is powerful. Python is easy to learn"

print(str_f_se.find("Python",0, 30)) #시작 위치(start)와 끝 위치(end) 지정
print(str_f_se.find("Python",35)) #찾기 위한 시작 위치 (start) 지정
>>> 0
>>> -1

(2) count()

해당 문자열이 몇 번 나오는지 알고 싶을 때 사용한다.
문자열(str)에서 찾고자 하는 문자열(search_str)과 일치하는 횟수를 반환하고, 없으면 0 반환
find()와 마찬가지로, start와 end로 검색 범위 지정 가능

str_c = "Python is powerful. Python is easy to learn. Python is open."

print("Python의 개수는?:", str_c.count("Python"))
print("powerful의 개수는?:", str_c.count("powerful"))
print("IPython의 개수는?:", str_c.count("IPython"))
>>> Python의 개수는?: 3
>>> powerful의 개수는?: 1
>>> IPython의 개수는?: 0

(3) startswith()와 endswith()

해당 문자열로 시작하는지와 끝나는지 검사한다.
시작하거나 끝나면 True를, 아니면 False를 출력한다.

str_se = "Python is powerful. Python is easy to learn"

print("Python으로 시작?:", str_se.startswith("Python"))
print("is으로 시작?:", str_se.startswith("is"))
print(".으로 끝?:", str_se.endswith("."))
print("learn으로 끝?:", str_se.endswith("learn"))
>>> Python으로 시작?: True
>>> is으로 시작?: False
>>> .으로 끝?: False
>>> learn으로 끝?: True

5. replace() 문자열 바꾸기

문자열에서 지정한 문자열을 찾아서 바꿔준다
사용방법
str.replace(old,new,maxcount)

str_a = 'Python is fast. Python is friendly. Python is open'

print(str_a.replace('Python','IPython'))
print(str_a.replace('Python','IPython',2))
>>> IPython is fast. IPython is friendly. IPython is open
>>> IPython is fast. IPython is friendly. Python is open

제거할 때도 사용된다.
str.replace('원하는제거대상', '')

str_b = '[Python] [is] [fast]'
str_b1 = str_b.replace('[','') #문자열에서 '['를 제거
str_b2 = str_b1.replace(']','') #결과 문자열에서 다시 ']'를 제거

print(str_b)
print(str_b1)
print(str_b2)
>>> [Python] [is] [fast]
>>> Python] is] fast]
>>> Python is fast

6. 문자열의 구성 확인하기

문자열이 어떻게 이루어져 있는지 확인하기 위한 함수
isalpha() : 문자열이 숫자, 특수문자, 공백이 아닌 문자로 구성되어 있을 때만 True

print('Python'.isalpha()) #문자열에 공백, 특수 문자, 숫자가 없음 
>>> True

print('Ver. e.x'.isalpha()) #문자열에 공백, 특수 문자, 숫자 중 하나가 있음
>>> False

isdigit() : 문자열이 모두 숫자로 구성되어 있을 때만 True

print('12345'.isdigit()) #문자열이 모두 숫자로 구성됨
>>> True

print('12345abc'.isdigit()) #문자열이 숫자로만 구성되지 않음
>>> False

isalnum() : 문자열이 특수 문자나 공백이 아닌 문자와 숫자로 구성되어 있을 때만 True

print('abc1234'.isalnum()) #특수 문자나 공백이 아닌 문자와 숫자로 구성됨
>>> True

print('        abc1234'.isalnum()) #문자열에 공백이 있음 
>>> False

isspace() : 문자열이 모두 공백 문자로 구성되어 있을 때만 True

print('     '.isspace()) #문자열이 공백으로만 구성됨
>>> True

print('  1  '.isspace()) #문자열에 공백 외에 다른 문자가 있음
>>> False

isupper() : 문자열이 모두 로마자 대문자로 구성되어 있을 때만 True

print('PYTHON'.isupper()) #문자열이 모두 대문자로 구성됨
>>> True

print('Python'.isupper()) #문자열에 대문자와 소문자가 있음
>>> False

islower() : 문자열이 모두 로마자 소문자로 구성되어 있을때만 True

print('python'.islower()) #문자열이 모두 소문자로 구성됨
>>> True

print('Python'.islower()) #문자열에 대문자와 소문자가 있음
>>> False

7. 대소문자로 변경하기

srt.lower() 모두 소문자로 변경
srt.upper() 모두 대문자로 변경

string1 = 'Python is powerful. PYTHON IS EASY TO LEARN'

print(string1.lower())
>>> python is powerful. python is easy to learn

print(string1.upper())
>>> PYTHON IS POWERFUL. PYTHON IS EASY TO LEARN

파이썬은 대소문자를 구분할까?

'Python' == 'python
>>> False

➡ 구분한다!

lower와 upper를 이용해 문자열 비교하기

print('Python'.lower() == 'python'.lower())
>>> True

print('Python'.upper() == 'python'.upper())
>>> True

2. 텍스트 파일의 데이터를 읽고 처리하기

1. 데이터 파일 준비 및 읽기

(1) 데이터 파일 저장할 공간으로 이동

cd c:\myPyCode\data\

(2) txt 파일 만들기

f=open('coffeeShopSales.txt', 'w')
f.write('날짜     에스프레소 아메리카노 카페라떼 카푸치노 \n')
f.write('10.15       10         50         45       20    \n')
f.write('10.16       12         45         41       18    \n')
f.write('10.17       11         53         32       25    \n')
f.write('10.18       15         49         38       22    \n')
f.close()

(3) 파일 이름 이용해서 내용 확인

!type c:\myPyCode\data\coffeeShopSales.txt
>>> 날짜     에스프레소   아메리카노   카페라떼   카푸치노 
    10.15       10            50          45         20    
    10.16       12            45          41         18    
    10.17       11            53          32         25    
    10.18       15            49          38         22

(4) 한 줄씩 읽고 출력

# file_name = 'c:\myPyCode\data\coffeeShopSales.txt'
file_name = 'c:/myPyCode/data/coffeeShopSales.txt'

f=open(file_name)        # 파일 읽기
for line in f:          # 한 줄씩 읽기
    print(line, end="") # 한 줄씩 출력
f.close()               # 파일 닫기 

>>> 날짜     에스프레소 아메리카노 카페라떼 카푸치노 
>>> 10.15       10         50         45       20    
>>> 10.16       12         45         41       18    
>>> 10.17       11         53         32       25    
>>> 10.18       15         49         38       22

2. 파일에서 읽은 문자열 데이터 처리

(1) 첫 번째 줄에 있는 항목 이름을 가져와 빈칸을 기준으로 나누고, 두번째 줄 이후의 항목값을 처리

f=open(file_name)      # 파일 열기
header = f.readline()  # 데이터의 첫 번째 줄을 읽음
f.close()              # 파일 닫기

header
>>> '날짜     에스프레소 아메리카노 카페라떼 카푸치노 \n'

(2) split를 이용해 리스트 만들기

header_list = header.split() #첫 줄의 문자열을 분리 후 리스트로 변환

header_list
>>> ['날짜', '에스프레소', '아메리카노', '카페라떼', '카푸치노']

(3) for문을 이용하여 각 항목을 data_list에 넣는 코드

f=open(file_name)            # 파일 열기
header = f.readline()        # 데이터의 첫 번째 줄을 읽음
header_list = header.split() # 첫 줄의 문자열을 분리 후 리스트로 변환

for line in f:               # 두 번째 줄부터 데이터를 읽어서 반복적으로 처리
    data_list = line.split() # 문자열을 분리해서 리스트로 변환
    print(data_list)         # 결과 확인을 위해 리스트 출력
    
f.close()                    # 파일 닫기
>>> ['10.15', '10', '50', '45', '20']
>>> ['10.16', '12', '45', '41', '18']
>>> ['10.17', '11', '53', '32', '25']
>>> ['10.18', '15', '49', '38', '22']

(4) 문자열로 되어 있는 리스트를 숫자로 변환 후, 커피 종류 별로 판매량 데이터를 분류해서 넣기

f=open(file_name)            # 파일 열기
header = f.readline()        # 데이터의 첫 번째 줄을 읽음
headerList = header.split() # 첫 줄의 문자열을 분리 후 리스트로 변환

espresso = []                # 커피 종류별로 빈 리스트 생성
americano = []
cafelatte = []
cappucino = []

for line in f:               # 두 번째 줄부터 데이터를 읽어서 반복적으로 처리
    dataList = line.split() # 문자열을 분리해서 리스트로 변환
    
    #커피 종류별로 정수로 변환한 후, 리스트의 항목으로 추가
    espresso.append(int(dataList[1]))
    americano.append(int(dataList[2]))
    cafelatte.append(int(dataList[3]))
    cappucino.append(int(dataList[4]))
    
f.close()                   # 파일 닫기

print("{0}: {1}".format(headerList[1], espresso))  # 변수에 할당된 값을 출력
print("{0}: {1}".format(headerList[2], americano))
print("{0}: {1}".format(headerList[3], cafelatte))
print("{0}: {1}".format(headerList[4], cappucino))
>>> 에스프레소: [10, 12, 11, 15]
>>> 아메리카노: [50, 45, 53, 49]
>>> 카페라떼: [45, 41, 32, 38]
>>> 카푸치노: [20, 18, 25, 22]

(5) 나흘간 메뉴별 판매량과 하루 평균 판매량 구하기

total_sum = [sum(espresso) ,sum(americano), sum(cafelatte), sum(cappucino)]
total_avg = [sum(espresso)/len(espresso), sum(americano)/len(americano), 
             sum(cafelatte)/len(cafelatte),sum(cappucino)/len(cappucino)]

for k in range(len(total_sum)):
    print('[{0}] 판매량'.format(headerList[k+1]))
    print(f'- 나흘 전체: {total_sum[k]}, 하루 평균: {total_avg[k]}')

>>> [에스프레소] 판매량
>>> - 나흘 전체: 48, 하루 평균: 12.0
>>> [아메리카노] 판매량
>>> - 나흘 전체: 197, 하루 평균: 49.25
>>> [카페라떼] 판매량
>>> - 나흘 전체: 156, 하루 평균: 39.0
>>> [카푸치노] 판매량
>>> - 나흘 전체: 85, 하루 평균: 21.25

'데이터 엔지니어링 과정 > python' 카테고리의 다른 글

[9일차] 배열 데이터를 효과적으로 다루는 Numpy 패키지 (0)	2023.01.02
[8일차] 모듈 (0)	2022.12.29
[6일차] 객체와 클래스 (0)	2022.12.27
[5일차] 함수 (0)	2022.12.26
[4일차] 입력과 출력 (2)	2022.12.22

버티면 되는거야 ᕕ( ᐛ )ᕗ

버티면 되는거야 ᕕ( ᐛ )ᕗ

태그

최근글

댓글

공지사항

아카이브

1. 문자열 다루기

2. 텍스트 파일의 데이터를 읽고 처리하기

'데이터 엔지니어링 과정 > python' 카테고리의 다른 글

관련글

티스토리툴바