Python学习笔记：高效数据格式feather（鸿毛）

文章目录[隐藏]

一、背景
二、feather是什么？
三、使用方法
四、总结

一、背景

日常使用 Python 读取数据时一般都是 json、csv、txt、xlsx 等格式，或者直接从数据库读取。

针对大数据量一般存储为 csv 格式，但文件占用空间比较大，保存和加载速度也较慢。

而 feather 便是一种速度更快、更加轻量级（压缩后）的二进制保存格式。

二、feather是什么？

Feather 是一种用于存储数据帧的数据格式。

一句话描述：高速读写压缩二进制文件。

Feather 其实是 Apache Arrow 项目中包含的一种数据格式，但是由于其优异的性能，该文件格式也被单独打包，放在 pip 中进行安装。

Pandas 也支持对 Feather 的读写操作。

最初是为了 Python 和 R 之间快速交互而设计的，初衷很简单，就是尽可能高效地完成数据在内存中转换的效率。

难能可贵的是，R、Julia、python 均可以解析 feather ，可以说是3种语言之间进行交互的强力工具了，读写速度一流。

现在 Feather 也不仅限于 Python 和 R ，基本每种主流的编程语言中都可以用 Feather 文件。

不过，它的数据格式并不是为长期存储而设计的，仅限于一般的短期存储。

— 此处不好理解：长期？短期？如何界定？

— 如果长期储存，feather 的空间压缩并不是最好的，可以了解下 Parquet。feather也可以长期存储，只不过不是最优解。

三、使用方法

在 Python 中，可以通过 pandas 或 Feather 两种方式进行操作。

但建议不要使用 pandas 自带的 to_feather 和 read_feather 。因为版本兼容性的问题，直接使用 feather 自带的 api 更优。

1.安装

注意：不要直接使用 pip install feather 进行安装，能正常显示安装但是读取时会报错 ImportError: cannot import name 'getuid' from 'os' (D:anacondalibos.py)。

# pip
pip install feather-format
# 依赖会安装：pyarrow-5.0.0-cp38-cp38-win_amd64.whl

# conda
conda install -c conda-forgefeather-format # 测试报错

2.测试数据集

构建一个 5 列、1000 万行随机数。

import feather
import pandas as pd
import numpy as np

import os
os.chdir(r'C:Users111Desktop')

np.random.seed = 2021
df_size = 10000000

df = pd.DataFrame({
    'a': np.random.rand(df_size),
    'b': np.random.rand(df_size),
    'c': np.random.rand(df_size),
    'd': np.random.rand(df_size),
    'e': np.random.rand(df_size)
    })
df.head()
'''
          a         b         c         d         e
0  0.515694  0.879751  0.346675  0.998066  0.647965
1  0.648172  0.044250  0.546985  0.668001  0.460173
2  0.774530  0.354780  0.034965  0.259252  0.037479
3  0.843657  0.956277  0.059882  0.394459  0.088319
4  0.263218  0.409887  0.149357  0.971544  0.657425
'''

3.pandas操作方式

保存

可以直接利用 DataFrame.to_feather() 进行保存。使用语法为：

df.to_feather(path, compression, compression_level)
# -- path:文件路径
# -- compression：是否压缩以及如何压缩，支持（zstd/uncompressde/lz4)三种方式
# -- compression_level：压缩水平（lz4不支持该参数）

df.to_feather('data.feather')

加载

df = pd.read_feather('data.feather')

4.feather操作方式

原生 feather 方式与 pandas 操作方式类似，速度也差不多。

保存

feather.write_dataframe(df, 'data2.feather')

加载

df = feather.read_dataframe('data2.feather')

5.csv VS feather

写入速度对比

# 导入时间模块
import time

# 1.传统csv方式
start = time.time()
df.to_csv('data_csv.csv')
end = time.time()
print('CSV Running time: %s Seconds' % (end-start))

# 2.原生feather
start = time.time()
feather.write_dataframe(df, 'data_feather_ys.feather')
end = time.time()
print('YS-feather Running time: %s Seconds' % (end-start))

# 3.pandas-feather
start = time.time()
df.to_feather('data_feather_pd.feather')
end = time.time()
print('Pd-feather Running time: %s Seconds' % (end-start))
'''
CSV Running time: 93.85435080528259 Seconds
YS-feather Running time: 0.3590412139892578 Seconds
Pd-feather Running time: 4.7694432735443115 Seconds
'''

读取速度对比

# 导入时间模块
import time

# 1.传统csv方式
start = time.time()
df1 = pd.read_csv('data_csv.csv')
end = time.time()
print('CSV Running time: %s Seconds' % (end-start))

# 2.原生feather
start = time.time()
df2 = feather.read_dataframe('data_feather_ys.feather')
end = time.time()
print('YS-feather Running time: %s Seconds' % (end-start))

# 3.pandas-feather
start = time.time()
df3 = pd.read_feather('data_feather_pd.feather')
end = time.time()
print('Pd-feather Running time: %s Seconds' % (end-start))

'''
CSV Running time: 11.32979965209961 Seconds
YS-feather Running time: 0.34105563163757324 Seconds
Pd-feather Running time: 0.45678043365478516 Seconds
'''

文件大小对比

# 肉眼对比
data_csv.csv             -- 0.97G
data_feather_ys.feather  -- 381M
data_feather_pd.feather  -- 381M

# 利用os获取文件大小（单位：MB）
import os
def get_FileSize(filePath):
    filePath = str(filePath)
    fsize = os.path.getsize(filePath)
    fsize = fsize / float(1024 * 1024)
    return round(fsize, 2)

print(get_FileSize('data_feather_ys.feather'))
print(get_FileSize('data_feather_pd.feather'))
print(get_FileSize('data_csv.csv'))
381.57 MB
381.57 MB
1003.63 MB

# 计算压缩率
standart_ratio = os.stat('data_feather_ys.feather').st_size / os.stat('data_csv.csv').st_size
print(f'Standart feather compression ratio is {standart_ratio*100 :.1f}%')
# Standart feather compression ratio is 38.0%

四、总结

Feather 相比 csv 格式拥有明显的性能提升。

适合中型数据（GB为单位的数据），比如4GB的csv文件，可能只占用700M的feather文件空间
读写速度远胜csv，而且相比较于数据库又具有便携的优势，可以作为很好的中间媒介来传输数据
类似于csv，feather也支持从源文件中仅读取所需要的列，可以减少内存的使用

df = pd.read_feather(path='data.feather', columns=["a","b","c"])

Parquet 是一种追求更多的压缩空间的数据格式，也可以考虑替代 csv 格式。

参考链接：再见 CSV，速度提升 150 倍！

参考链接python读feather格式文件

参考链接：feather——高性能的python数据读写

参考链接：轻如“鸿毛（Feather）”的文件格式却重于泰山

Python学习笔记：高效数据格式feather（鸿毛）

一、背景

二、feather是什么？

三、使用方法

1.安装

2.测试数据集

3.pandas操作方式

4.feather操作方式

5.csv VS feather

四、总结

Published by

风君子

发表回复取消回复

一、背景

二、feather是什么？

三、使用方法

1.安装

2.测试数据集

3.pandas操作方式

4.feather操作方式

5.csv VS feather

四、总结

Published by

风君子

发表回复 取消回复

发表回复取消回复