pandas 将 csv 读取为字符串类型-百家乐凯发k8

问题描述

我有一个带有字母数字键的数据框，我想将其保存为 csv 并稍后读回.由于各种原因，我需要将此键列显式读取为字符串格式，我有严格数字的键，甚至更糟，例如:1234e5，pandas 将其解释为浮点数.这显然使密钥完全无用.

i have a data frame with alpha-numeric keys which i want to save as a csv and read back later. for various reasons i need to explicitly read this key column as a string format, i have keys which are strictly numeric or even worse, things like: 1234e5 which pandas interprets as a float. this obviously makes the key completely useless.

问题是，当我为数据框或其任何列指定字符串 dtype 时，我只会得到垃圾.我这里有一些示例代码:

the problem is when i specify a string dtype for the data frame or any column of it i just get garbage back. i have some example code here:

df = pd.dataframe(np.random.rand(2,2),
                  index=['1a', '1b'],
                  columns=['a', 'b'])
df.to_csv(savefile)

数据框如下:

           a         b
1a  0.209059  0.275554
1b  0.742666  0.721165

然后我是这样读的:

df_read = pd.read_csv(savefile, dtype=str, index_col=0)

结果是:

   a  b
b  (  <

这是我的电脑问题，还是我在这里做错了什么，或者只是一个错误?

is this a problem with my computer, or something i'm doing wrong here, or just a bug?

推荐答案

更新:这有已修复:从 0.11.1 开始，您传递 str/np.str 将等同于使用 object.

update: this has been fixed: from 0.11.1 you passing str/np.str will be equivalent to using object.

使用对象数据类型:

in [11]: pd.read_csv('a', dtype=object, index_col=0)
out[11]:
                      a                     b
1a  0.35633069074776547     0.745585398803751
1b  0.20037376323337375  0.013921830784260236

或者更好，只是不要指定数据类型:

or better yet, just don't specify a dtype:

in [12]: pd.read_csv('a', index_col=0)
out[12]:
           a         b
1a  0.356331  0.745585
1b  0.200374  0.013922

但是绕过类型嗅探器并真正返回 only 字符串需要使用 converters:

but bypassing the type sniffer and truly returning only strings requires a hacky use of converters:

in [13]: pd.read_csv('a', converters={i: str for i in range(100)})
out[13]:
                      a                     b
1a  0.35633069074776547     0.745585398803751
1b  0.20037376323337375  0.013921830784260236

其中 100 是等于或大于您的总列数的某个数字.

where 100 is some number equal or greater than your total number of columns.

最好避免使用 str dtype，例如参见这里.