问题描述
我有一个带有字母数字键的数据框,我想将其保存为 csv 并稍后读回.由于各种原因,我需要将此键列显式读取为字符串格式,我有严格数字的键,甚至更糟,例如:1234e5,pandas 将其解释为浮点数.这显然使密钥完全无用.
i have a data frame with alpha-numeric keys which i want to save as a csv and read back later. for various reasons i need to explicitly read this key column as a string format, i have keys which are strictly numeric or even worse, things like: 1234e5 which pandas interprets as a float. this obviously makes the key completely useless.
问题是,当我为数据框或其任何列指定字符串 dtype 时,我只会得到垃圾.我这里有一些示例代码:
the problem is when i specify a string dtype for the data frame or any column of it i just get garbage back. i have some example code here:
df = pd.dataframe(np.random.rand(2,2), index=['1a', '1b'], columns=['a', 'b']) df.to_csv(savefile)
数据框如下:
a b 1a 0.209059 0.275554 1b 0.742666 0.721165
然后我是这样读的:
df_read = pd.read_csv(savefile, dtype=str, index_col=0)
结果是:
a b b ( <
这是我的电脑问题,还是我在这里做错了什么,或者只是一个错误?
is this a problem with my computer, or something i'm doing wrong here, or just a bug?
推荐答案
更新:这有 已修复:从 0.11.1 开始,您传递 str/np.str 将等同于使用 object.
update: this has been fixed: from 0.11.1 you passing str/np.str will be equivalent to using object.
使用对象数据类型:
in [11]: pd.read_csv('a', dtype=object, index_col=0) out[11]: a b 1a 0.35633069074776547 0.745585398803751 1b 0.20037376323337375 0.013921830784260236
或者更好,只是不要指定数据类型:
or better yet, just don't specify a dtype:
in [12]: pd.read_csv('a', index_col=0) out[12]: a b 1a 0.356331 0.745585 1b 0.200374 0.013922
但是绕过类型嗅探器并真正返回 only 字符串需要使用 converters:
but bypassing the type sniffer and truly returning only strings requires a hacky use of converters:
in [13]: pd.read_csv('a', converters={i: str for i in range(100)}) out[13]: a b 1a 0.35633069074776547 0.745585398803751 1b 0.20037376323337375 0.013921830784260236
其中 100 是等于或大于您的总列数的某个数字.
where 100 is some number equal or greater than your total number of columns.
最好避免使用 str dtype,例如参见 这里.