问题描述
我在 pandas 中有一个名为munged_data"的数据框,其中包含两列entry_date"和dob",我已使用 pd.to_timestamp 将其转换为时间戳.我试图弄清楚如何根据时间计算人的年龄'entry_date' 和 'dob' 之间的区别,要做到这一点,我需要得到两列之间的天数差异(这样我就可以像 round(days/365.25) 一样做一些事情.我似乎无法找到一种使用矢量化操作的方法.当我执行 munged_data.entry_date-munged_data.dob 时,我得到以下信息:
i have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to timestamps using pd.to_timestamp.i am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). i do not seem to be able to find a way to do this using a vectorized operation. when i do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id 2 15685977 days, 23:54:30.457856 3 11651985 days, 23:49:15.359744 4 9491988 days, 23:39:55.621376 7 11907004 days, 0:10:30.196224 9 15282164 days, 23:30:30.196224 15 15282227 days, 23:50:40.261632
但是我似乎无法将天数提取为整数,以便我可以继续计算.任何帮助表示赞赏.
however i do not seem to be able to extract the days as an integer so that i can continue with my calculation. any help appreciated.
推荐答案
你需要 0.11 这个(0.11rc1 已经出来了,下周最后的问题)
you need 0.11 for this (0.11rc1 is out, final prob next week)
in [9]: df = dataframe([ timestamp('20010101'), timestamp('20040601') ]) in [10]: df out[10]: 0 0 2001-01-01 00:00:00 1 2004-06-01 00:00:00 in [11]: df = dataframe([ timestamp('20010101'), timestamp('20040601') ],columns=['age']) in [12]: df out[12]: age 0 2001-01-01 00:00:00 1 2004-06-01 00:00:00 in [13]: df['today'] = timestamp('20130419') in [14]: df['diff'] = df['today']-df['age'] in [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365) in [17]: df out[17]: age today diff years 0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110 1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
最后你需要这个奇怪的应用程序,因为还没有完全支持 timedelta64[ns] 标量(例如,我们现在如何使用时间戳来处理 datetime64[ns],在 0.12 中)
you need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use timestamps now for datetime64[ns], coming in 0.12)