« 2024/12 »
일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Tags more

Archives

Today

Total

관리 메뉴

차밍이

[Python] Pandas 대용량 데이터 불러오기 어려울 때, 데이터 형식 변환해 읽기 본문

파이썬

[Python] Pandas 대용량 데이터 불러오기 어려울 때, 데이터 형식 변환해 읽기

2022. 6. 2. 17:01

1. 파일이 불러와지지 않는 경우

대략적인 과정

column 명 만 먼저 가져온다.
각 데이터 테이블에서 특정 column의 데이터만 가져온다.
해당 column의 데이터 타입을 확인한다.
해당 column의 데이터 범위를 확인하여, 데이터 타입을 줄일 수 있는 경우 타입을 변경해 줄인다.
전체 column에 대해서 진행한다.
이 후 pd.read_csv를 통해 데이터를 읽어올 때, 데이터 타입을 정해서 읽어오도록 한다.

소스코드 - 파일이 불러와지지 않는 경우

def check_dtypes(file_path:str) -> dict:
    print(file_path)
    tmp = pd.read_csv(file_path, nrows=0)
    col_dtypes = {}
    for col in tmp.columns:
        df = pd.read_csv(file_path, usecols=[col])
        dtype = str(df[col].dtype)

        if "int" in dtype or "float" in dtype:
            c_min = df[col].min()
            c_max = df[col].max()
        elif dtype == "object":
            n_unique = df[col].nunique()
            threshold = n_unique / df.shape[0]

        if "int" in dtype:
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                col_dtype = "int8"
            elif c_min > np.iinfo(np.uint8).min and c_max < np.iinfo(np.uint8).max:
                col_dtype = "uint8"
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                col_dtype = "int16"
            elif c_min > np.iinfo(np.uint16).min and c_max < np.iinfo(np.uint16).max:
                col_dtype = "uint16"
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                col_dtype = "int32"
            elif c_min > np.iinfo(np.uint32).min and c_max < np.iinfo(np.uint32).max:
                col_dtype = "uint32"
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                col_dtype = "int64"
            elif c_min > np.iinfo(np.uint64).min and c_max < np.iinfo(np.uint64).max:
                col_dtype = "uint64"
            else:
                col_dtype = "uint64"

        elif "float" in dtype:
            if c_min > np.iinfo(np.float32).min and c_max < np.iinfo(np.float32).max:
                col_dtype = "float32"
            else:
                col_dtype = "float64"

        elif dtype == "object":
            if threshold > 0.7:
                col_dtype = "object"
            else:
                col_dtype = "category"

        col_dtypes[col] = col_dtype

    return col_dtypes

check_dtypes(os.path.join(tar_path, "2a997e23-6334-45c1-a075-8c2f4bfd89d2_015.csv"))

뚱뚱하고 굼뜬 판다스(Pandas)를 위한 효과적인 다이어트 전략 - 오성우 - 에서 나왔던 소스코드를 현재 버전에서 바뀐 부분 일부 수정 + 파일에 대해서 사용할 수 있도록 수정 및 보완했다.

장단점

장점

파일의 크기가 너무 커서 불러올 수 없는 경우, 데이터 테이블 용량을 줄여서 가져올 수 있다.

단점

매우 느리다.

전체 데이터 테이블의 각 컬럼을 다 순회하면서 파일을 가져와 데이터 크기를 확인하는 과정 때문에 매우 느릴 수 밖에 없다.

2. 파일이 불러와 지는 경우

메모리 용량 줄이기

일단 데이터프레임 형태로 데이터를 불러올 수 있다면,

위의 소스코드 함수를 사용하는 것 보다는

불러온 후 메모리 용량 줄이는 것이 더 효과적이다.

소스코드 - 데이터 불러온 후 메모리 용량 줄이기

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

'파이썬' 카테고리의 다른 글

코린이에게 추천하는 코딩강의 - 코드잇 - 직접 듣고 느낀 후기 (0)	2022.07.05
[Python] Pandas 대용량 데이터 다루기 (0)	2022.06.03
[Python] 파이썬으로 나만의 텔레그램 봇 만들기 (2)	2021.05.31
[Python] 파이썬 자료형 및 연산자의 시간 복잡도(Big-O) 총 정리 (0)	2020.02.26

차밍이

차밍이