问题描述
我有一个包含一些日期时间(作为字符串)和一些空值作为nan"的系列:
i have a series with some datetimes (as strings) and some nulls as 'nan':
import pandas as pd, numpy as np, datetime as dt df = pd.dataframe({'date':['2014-10-20 10:44:31', '2014-10-23 09:33:46', 'nan', '2014-10-01 09:38:45']})
我正在尝试将这些转换为日期时间:
i'm trying to convert these to datetime:
df['date'] = df['date'].apply(lambda x: dt.datetime.strptime(x, '%y-%m-%d %h:%m:%s'))
但我得到了错误:
time data 'nan' does not match format '%y-%m-%d %h:%m:%s'
所以我试着把这些变成实际的空值:
so i try to turn these into actual nulls:
df.ix[df['date'] == 'nan', 'date'] = np.nan
然后重复:
df['date'] = df['date'].apply(lambda x: dt.datetime.strptime(x, '%y-%m-%d %h:%m:%s'))
然后我得到错误:
必须是字符串,不能是浮点数
must be string, not float
解决这个问题的最快方法是什么?
what is the quickest way to solve this problem?
推荐答案
只要使用to_datetime 并设置 errors='coerce' 来处理 duff 数据:
just use to_datetime and set errors='coerce' to handle duff data:
in [321]: df['date'] = pd.to_datetime(df['date'], errors='coerce') df out[321]: date 0 2014-10-20 10:44:31 1 2014-10-23 09:33:46 2 nat 3 2014-10-01 09:38:45 in [322]: df.info()int64index: 4 entries, 0 to 3 data columns (total 1 columns): date 3 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 64.0 bytes
调用 strptime 的问题是如果字符串或 dtype 不正确会引发错误.
the problem with calling strptime is that it will raise an error if the string, or dtype is incorrect.
如果你这样做了,那么它会起作用:
if you did this then it would work:
in [324]: def func(x): try: return dt.datetime.strptime(x, '%y-%m-%d %h:%m:%s') except: return pd.nat df['date'].apply(func) out[324]: 0 2014-10-20 10:44:31 1 2014-10-23 09:33:46 2 nat 3 2014-10-01 09:38:45 name: date, dtype: datetime64[ns]
但是使用内置的 to_datetime 而不是调用 apply 会更快,这实际上只是循环您的系列.
but it will be faster to use the inbuilt to_datetime rather than call apply which essentially just loops over your series.
时间
in [326]: %timeit pd.to_datetime(df['date'], errors='coerce') %timeit df['date'].apply(func) 10000 loops, best of 3: 65.8 μs per loop 10000 loops, best of 3: 186 μs per loop
我们在这里看到使用 to_datetime 的速度提高了 3 倍.
we see here that using to_datetime is 3x faster.