i have a series with some datetimes (as strings) and some nulls as 'nan':
import pandas as pd, numpy as np, datetime as dt df = pd.dataframe({'date':['2014-10-20 10:44:31', '2014-10-23 09:33:46', 'nan', '2014-10-01 09:38:45']})
i'm trying to convert these to datetime:
df['date'] = df['date'].apply(lambda x: dt.datetime.strptime(x, '%y-%m-%d %h:%m:%s'))
time data 'nan' does not match format '%y-%m-%d %h:%m:%s'
so i try to turn these into actual nulls:
df.ix[df['date'] == 'nan', 'date'] = np.nan
df['date'] = df['date'].apply(lambda x: dt.datetime.strptime(x, '%y-%m-%d %h:%m:%s'))
must be string, not float
what is the quickest way to solve this problem?
只要使用to_datetime 并设置 errors='coerce' 来处理 duff 数据:
just use to_datetime and set errors='coerce' to handle duff data:
in [321]: df['date'] = pd.to_datetime(df['date'], errors='coerce') df out[321]: date 0 2014-10-20 10:44:31 1 2014-10-23 09:33:46 2 nat 3 2014-10-01 09:38:45 in [322]: df.info()int64index: 4 entries, 0 to 3 data columns (total 1 columns): date 3 non-null datetime64[ns] dtypes: datetime64[ns](1) memory usage: 64.0 bytes
调用 strptime 的问题是如果字符串或 dtype 不正确会引发错误.
the problem with calling strptime is that it will raise an error if the string, or dtype is incorrect.
if you did this then it would work:
in [324]: def func(x): try: return dt.datetime.strptime(x, '%y-%m-%d %h:%m:%s') except: return pd.nat df['date'].apply(func) out[324]: 0 2014-10-20 10:44:31 1 2014-10-23 09:33:46 2 nat 3 2014-10-01 09:38:45 name: date, dtype: datetime64[ns]
但是使用内置的 to_datetime 而不是调用 apply 会更快,这实际上只是循环您的系列.
but it will be faster to use the inbuilt to_datetime rather than call apply which essentially just loops over your series.
in [326]: %timeit pd.to_datetime(df['date'], errors='coerce') %timeit df['date'].apply(func) 10000 loops, best of 3: 65.8 μs per loop 10000 loops, best of 3: 186 μs per loop
我们在这里看到使用 to_datetime 的速度提高了 3 倍.
we see here that using to_datetime is 3x faster.