« 2025/08 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

차밍이

[Python] Pandas 대용량 데이터 다루기 본문

파이썬

[Python] Pandas 대용량 데이터 다루기

2022. 6. 3. 18:13

100만개 이상이 넘어가는 row를 가진 데이터셋을 불러온다면 너무 무거워져서 속도가 매우 느려지게 됨
pandas.read_csv에서 chunksize라는 매개변수 활용 가능
로컬 메모리에 맞추기 위해 한 번에 DataFrame으로 읽어 올 행의 수를 지정 가능

df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)

1-2. 파일이 안불러와 질 때, 각 Column 타입 줄여서 불러오기

데이터가 너무 많아 불러오기 힘든 경우, 각 Column의 타입을 확인해서 데이터 크기를 줄여서 불러올 수 있다.

def check_dtypes(file_path):
    print(file_path)
    tmp = pd.read_csv(file_path, nrows=0)
    col_dtypes = {}
    for col in tmp.columns:
        df = pd.read_csv(file_path, usecols=[col])
        dtype = str(df[col].dtype)

        if "int" in dtype or "float" in dtype:
            c_min = df[col].min()
            c_max = df[col].max()
        elif dtype == "object":
            n_unique = df[col].nunique()
            threshold = n_unique / df.shape[0]

        if "int" in dtype:
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                col_dtype = "int8"
            elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
                col_dtype = "uint8"
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                col_dtype = "int16"
            elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
                col_dtype = "uint16"
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                col_dtype = "int32"
            elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
                col_dtype = "uint32"
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                col_dtype = "int64"
            elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
                col_dtype = "uint64"
            else:
                col_dtype = "uint64"

        elif "float" in dtype:
            if c_min > np.iinfo(np.float32).min and c_max < np.iinfo(np.float32).max:
                col_dtype = "float32"
            else:
                col_dtype = "float64"

        elif dtype == "object":
            if threshold > 0.7:
                col_dtype = "object"
            else:
                col_dtype = "category"

        col_dtypes[col] = col_dtype

    return col_dtypes


file_path = r"../test.csv"
data_types = check_dtypes(file_path)
df = pd.read_csv(file_path, dtype=data_types)
df

2. 불러온 데이터 관리하기

2-1. 데이터 메모리 용량 줄이기

1-2 에서처럼 줄여서 가져온 것이 아니라면, 한 번 용량을 줄여주면 다루기 편하고 빨라진다.

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
                    df[col] = df[col].astype(np.uint8)
                elif c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
                    df[col] = df[col].astype(np.uint16)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
                    df[col] = df[col].astype(np.uint32)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
                    df[col] = df[col].astype(np.uint64)  
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
                    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

2-2. 필요하지 않은 column을 필터링

시간 절약 및 메모리 절약을 위해 필요하지 않은 열을 필터링하여 불러오기

df = df[['col_1','col_2', 'col_3', 'col_4', 'col_5', 'col_6']]

2-3. 사용이 끝난 변수는 파일로 저장하거나 RAM에서 삭제

python의 변수 삭제 명령어는 del
용량이 큰 경우 메모리 관리를 해주는 것이 좋다.

2-4. 코드화

str 값으로 object 타입을 가지는 것 보다, 숫자로 변환하여 코드화 하면 훨씬 데이터 용량이 줄어든다.

남자 -> 0
여자 -> 1

- 서울특별시 -> 11
대구광역시 -> 45

- 정상 -> 0
비정상 -> 1

3. 다른 라이브러리 사용하기

modin 이나 dask 같은 라이브러리를 사용해서 pandas보다 더 빠르고 큰 데이터를 다룰 수 있는 방법이 있다.
아직 사용해보지 않아 추후 업데이트 예정.

'파이썬' 카테고리의 다른 글

코린이에게 추천하는 코딩강의 - 코드잇 - 직접 듣고 느낀 후기 (0)	2022.07.05
[Python] Pandas 대용량 데이터 불러오기 어려울 때, 데이터 형식 변환해 읽기 (0)	2022.06.02
[Python] 파이썬으로 나만의 텔레그램 봇 만들기 (2)	2021.05.31
[Python] 파이썬 자료형 및 연산자의 시간 복잡도(Big-O) 총 정리 (0)	2020.02.26

차밍이

차밍이

[Python] Pandas 대용량 데이터 다루기 본문

[Python] Pandas 대용량 데이터 다루기

목차

1. 데이터 불러오기

1-1. CSV 파일 데이터를 청크 크기로 읽어오기

1-2. 파일이 안불러와 질 때, 각 Column 타입 줄여서 불러오기

2. 불러온 데이터 관리하기

2-1. 데이터 메모리 용량 줄이기

2-2. 필요하지 않은 column을 필터링

2-3. 사용이 끝난 변수는 파일로 저장하거나 RAM에서 삭제

2-4. 코드화

3. 다른 라이브러리 사용하기

'파이썬' 카테고리의 다른 글

관련된 글 보기

티스토리툴바