Below is my table with date column as object datatype.
My agenda is to convert object into string datatype and split the date column into day, month and year.
I tried many ways but no luck.
Can someone help on this?
Date x y z a b
09.05.2013 4 31 12472 199.0 1.0
25.12.2013 11 26 1856 1699.0 1.0
18.11.2014 22 25 15263 699.0 1.0
05.03.2015 26 28 5037 2599.0 1.0
14.10.2015 33 6 17270 199.0 1.0
If you are using pandas dataframe, you could do as follows.
df['Date'] = df['Date'].astype(str)
df['Day'] = df['Date'].str[0:2]
df['Month'] = df['Date'].str[3:5]
df['Year'] = df['Date'].str[6:]
df
which gives you the following output.
Date x y z a b Day Month Year
0 09.05.2013 4 31 12472 199.0 1.0 09 05 2013
1 25.12.2013 11 26 1856 1699.0 1.0 25 12 2013
2 18.11.2014 22 25 15263 699.0 1.0 18 11 2014
3 05.03.2015 26 28 5037 2599.0 1.0 05 03 2015
4 14.10.2015 33 6 17270 199.0 1.0 14 10 2015
Related
I have the following dataframe:
rate year month week day pct_day
1973-01-02 8.02 1973 1 1 2 NaN
1973-01-03 8.02 1973 1 1 3 0.000000
1973-01-04 8.00 1973 1 1 4 -0.002494
1973-01-05 8.01 1973 1 1 5 0.001250
1973-01-08 8.00 1973 1 2 8 -0.001248
... ... ... ... ... ...
2020-05-22 75.99 2020 5 21 22 0.004760
2020-05-26 75.43 2020 5 22 26 -0.007369
2020-05-27 75.88 2020 5 22 27 0.005966
2020-05-28 75.67 2020 5 22 28 -0.002768
2020-05-29 75.59 2020 5 22 29 -0.001057
How can i remove dates which are lower than 1998-09-09. To do this i have done this:
date1 = date.datetime(2008, 1, 1)
date1 = date1.strftime('%Y-%m-%d')
data[pd.to_datetime(data.index) >= pd.to_datetime('date1')]
but after the last line of code i am getting :
ParserError: Unknown string format: date1
data[pd.to_datetime(data.index) >= pd.to_datetime('date1')]
should be something like
data[pd.to_datetime(data.index) >= pd.to_datetime(date1)]
as date1 is a variable you've defined and you are calling it as a string.
Alternatively, pandas has a query system built in that allows you to do things like
data_less_than_data = data.query("index >= 1998-09-09")
my syntax for querying the index might be off but that's the basic idea.
date1 is a string. Instead try passing pd.to_datetime(date1) or just pd.to_datetime(df['date1']) if you create a column called date1
I have a csv file in the format:
20 05 2019 12:00:00, 100
21 05 2019 12:00:00, 200
22 05 2019 12:00:00, 480
And i want to access the second variable, ive tried a variety of different alterations but none have worked.
Initially i tried
import pandas as pd
import numpy as np
col = [i for i in range(2)]
col[1] = "Power"
data = pd.read_csv('FILENAME.csv', names=col)
df1 = data.sum(data, axis=1)
df2 = np.cumsum(df1)
print(df2)
You can use cumsum function:
data['Power'].cumsum()
Output:
0 100
1 300
2 780
Name: Power, dtype: int64
Use df.cumsum:
In [1820]: df = pd.read_csv('FILENAME.csv', names=col)
In [1821]: df
Out[1821]:
0 Power
0 20 05 2019 12:00:00 100
1 21 05 2019 12:00:00 200
2 22 05 2019 12:00:00 480
In [1823]: df['cumulative sum'] = df['Power'].cumsum()
In [1824]: df
Out[1824]:
0 Power cumulative sum
0 20 05 2019 12:00:00 100 100
1 21 05 2019 12:00:00 200 300
2 22 05 2019 12:00:00 480 780
Code to generate random database for question (minimum reproducible issue):
df_random = pd.DataFrame(np.random.random((2000,3)))
df_random['order_date'] = pd.date_range(start='1/1/2015',
periods=len(df_random), freq='D')
df_random['customer_id'] = np.random.randint(1, 20, df_random.shape[0])
df_random
Output df_random
0 1 2 order_date customer_id
0 0.018473 0.970257 0.605428 2015-01-01 12
... ... ... ... ... ...
1999 0.800139 0.746605 0.551530 2020-06-22 11
Code to extract mean unique transactions month and year wise
for y in (2015,2019):
for x in (1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df2.sort_values(['customer_id','order_date'],inplace=True)
df2["days"] = df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D"))
df_mean=round(df2['days'].mean(),2)
data2 = data.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Expected output
Mean Month Year
0 5.00 1 2015
.......................
11 6.62 12 2015
..............Mean values of days after which one transaction occurs in order_date for years 2016 and 2017 Jan to Dec
36 6.03 1 2018
..........................
47 6.76 12 2018
48 8.40 1 2019
.......................
48 8.40 12 2019
Basically I want single dataframe starting from 2015 Jan month to 2019 December
Instead of the expected output I am getting dataframe from Jan 2015 to Dec 2018 , then again Jan 2015 data and then the entire dataset repeats again from 2015 to 2018 many more times.
Please help
Try this:
data2 = pd.DataFrame([])
for y in range(2015,2020):
for x in range(1,13):
df2 = df_random[(df_random['order_date'].dt.month == x)&(df_random['order_date'].dt.year== y)]
df_mean=df2.groupby("customer_id")["order_date"].apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")).mean().round(2)
data2 = data2.append(pd.DataFrame({'Mean': df_mean , 'Month': x, 'Year': y}, index=[0]), ignore_index=True)
print(data2)
Try this :
df_random.order_date = pd.to_datetime(df_random.order_date)
df_random = df_random.set_index(pd.DatetimeIndex(df_random['order_date']))
output = df_random.groupby(pd.Grouper(freq="M"))[[0,1,2]].agg(np.mean).reset_index()
output['month'] = output.order_date.dt.month
output['year'] = output.order_date.dt.year
output = output.drop('order_date', axis=1)
output
Output
0 1 2 month year
0 0.494818 0.476514 0.496059 1 2015
1 0.451611 0.437638 0.536607 2 2015
2 0.476262 0.567519 0.528129 3 2015
3 0.519229 0.475887 0.612433 4 2015
4 0.464781 0.430593 0.445455 5 2015
... ... ... ... ... ...
61 0.416540 0.564928 0.444234 2 2020
62 0.553787 0.423576 0.422580 3 2020
63 0.524872 0.470346 0.560194 4 2020
64 0.530440 0.469957 0.566077 5 2020
65 0.584474 0.487195 0.557567 6 2020
Avoid any looping and simply include year and month in groupby calculation:
np.random.seed(1022020)
...
# ASSIGN MONTH AND YEAR COLUMNS, THEN SORT COLUMNS
df_random = (df_random.assign(month = lambda x: x['order_date'].dt.month,
year = lambda x: x['order_date'].dt.year)
.sort_values(['customer_id', 'order_date']))
# GROUP BY CALCULATION
df_random["days"] = (df_random.groupby(["customer_id", "year", "month"])["order_date"]
.apply(lambda x: (x - x.shift()) / np.timedelta64(1, "D")))
# FINAL MEAN AGGREGATION BY YEAR AND MONTH
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"].mean().round(2)
.rename(columns={"days":"mean"}))
print(final_df.head())
# year month mean
# 0 2015 1 8.43
# 1 2015 2 5.87
# 2 2015 3 4.88
# 3 2015 4 10.43
# 4 2015 5 8.12
print(final_df.tail())
# year month mean
# 61 2020 2 8.27
# 62 2020 3 8.41
# 63 2020 4 8.81
# 64 2020 5 9.12
# 65 2020 6 7.00
For multiple aggregates, replace the single groupby.mean() to groupby.agg():
final_df = (df_random.groupby(["year", "month"], as_index=False)["days"]
.agg(['count', 'min', 'mean', 'median', 'max'])
.rename(columns={"days":"mean"}))
print(final_df.head())
# count min mean median max
# year month
# 2015 1 14 1.0 8.43 5.0 25.0
# 2 15 1.0 5.87 5.0 17.0
# 3 16 1.0 4.88 5.0 9.0
# 4 14 1.0 10.43 7.5 23.0
# 5 17 2.0 8.12 8.0 17.0
print(final_df.tail())
# count min mean median max
# year month
# 2020 2 15 1.0 8.27 6.0 21.0
# 3 17 1.0 8.41 7.0 16.0
# 4 16 1.0 8.81 7.0 20.0
# 5 16 1.0 9.12 7.0 22.0
# 6 7 2.0 7.00 7.0 17.0
I have a pandas column like this :
yrmnt
--------
2015 03
2015 03
2013 08
2015 08
2014 09
2015 10
2016 02
2015 11
2015 11
2015 11
2017 02
How to fetch lowest year month combination :2013 08 and highest : 2017 02
And find the difference in months between these two, ie 40
You can connvert column to_datetime and then find indices by max and min values by idxmax and
idxmin:
a = pd.to_datetime(df['yrmnt'], format='%Y %m')
print (a)
0 2015-03-01
1 2015-03-01
2 2013-08-01
3 2015-08-01
4 2014-09-01
5 2015-10-01
6 2016-02-01
7 2015-11-01
8 2015-11-01
9 2015-11-01
10 2017-02-01
Name: yrmnt, dtype: datetime64[ns]
print (df.loc[a.idxmax(), 'yrmnt'])
2017 02
print (df.loc[a.idxmin(), 'yrmnt'])
2013 08
Difference in months:
b = a.dt.to_period('M')
d = b.max() - b.min()
print (d)
42
Another solution working only with month period created by Series.dt.to_period:
b = pd.to_datetime(df['yrmnt'], format='%Y %m').dt.to_period('M')
print (b)
0 2015-03
1 2015-03
2 2013-08
3 2015-08
4 2014-09
5 2015-10
6 2016-02
7 2015-11
8 2015-11
9 2015-11
10 2017-02
Name: yrmnt, dtype: object
Then convert to custom format by Period.strftime minimal and maximal values:
min_d = b.min().strftime('%Y %m')
print (min_d)
2013 08
max_d = b.max().strftime('%Y %m')
print (max_d)
2017 02
And subtract for difference:
d = b.max() - b.min()
print (d)
42
I have a Data frame like this
Temp_in_C Temp_in_F Date Year Month Day
23 65 2011-12-12 2011 12 12
12 72 2011-12-12 2011 12 12
NaN 67 2011-12-12 2011 12 12
0 0 2011-12-12 2011 12 12
7 55 2011-12-13 2011 12 13
I am trying to get output in this format (The NaN and zero values of pertuculer day is replaced by avg temp of that day only)
Output will be
Temp_in_C Temp_in_F Date Year Month Day
23 65 2011-12-12 2011 12 12
12 72 2011-12-12 2011 12 12
17.5 67 2011-12-12 2011 12 12
17.5 68 2011-12-12 2011 12 12
7 55 2011-12-13 2011 12 13
These vales will be replaced by mean of that perticuler day. I am trying to do this
temp_df = csv_data_df[csv_data_df["Temp_in_C"]!=0]
temp_df["Temp_in_C"] =
temp_df["Temp_in_C"].replace('*',np.nan)
x=temp_df["Temp_in_C"].mean()
csv_data_df["Temp_in_C"]=csv_data_df["Temp_in_C"]
.replace(0.0,x)
csv_data_df["Temp_in_C"]=csv_data_df["Temp_in_C"]
.fillna(x)
This code is taking the mean of whole columns and replacing it directly.
How can i group by day and take mean and then replace values for that particular day only.
First, replace zeros with NaN
df = df.replace(0,np.nan)
Then fill the missing values using transform (see this post)
df.groupby('Date').transform(lambda x: x.fillna(x.mean()))
Gives:
Temp_in_C Temp_in_F Year Month Day
0 23.0 65.0 2011 12 12
1 12.0 72.0 2011 12 12
2 17.5 67.0 2011 12 12
3 17.5 68.0 2011 12 12
4 7.0 55.0 2011 12 13