For example I have the following map:
{'df1': Jan Feb Mar
1 3 5
2 4 6
'df2': Jan Feb Mar
7 9 11
8 10 12
......}
And I want the following output:
Jan 1
Jan 2
Feb 3
Feb 4
Mar 5
Mar 6
Jan 7
Jan 8
Feb 9
Feb 10
Mar 11
Mar 12
Does anyone knows if its possible to do it this way?
What I have tried is to iterate through DataFrames to try getting
{'df1': Jan 1
Jan 2
Feb 3
Feb 4
Mar 5
Mar 6
'df2': Jan 7
Jan 8
Feb 9
Feb 10
Mar 11
Mar 12
by using
for x in dfMap:
df = pd.melt(list(x.values()))
Then try to concat it with df1m =
pd.concat(df.values(), ignore_index=True)
Which gave me error
AttributeError: 'list' object has no attribute 'columns'
I am fairly new to programming and really wanted to learn, will be nice if anyone can explain how this works, and why list or dict_values object has no attribute 'columns'.
Thanks in advance!
You can concat and stack:
out = pd.concat(d.values()).stack().droplevel(0)
Or:
out = pd.concat(d.values()).melt()
Example:
df = pd.DataFrame(np.arange(1,10).reshape(-1,3),columns=['Jan','Feb','Mar'])
d = {}
for e,i in df.iterrows():
d[f"df{e+1}"] = i.to_frame().T
print(d,'\n')
out = pd.concat(d.values()).stack().droplevel(0)
print(out)
{'df1': Jan Feb Mar
0 1 2 3, 'df2': Jan Feb Mar
1 4 5 6, 'df3': Jan Feb Mar
2 7 8 9}
Jan 1
Feb 2
Mar 3
Jan 4
Feb 5
Mar 6
Jan 7
Feb 8
Mar 9
dtype: int32
With melt:
out = pd.concat(d.values()).melt()
print(out)
variable value
0 Jan 1
1 Jan 4
2 Jan 7
3 Feb 2
4 Feb 5
5 Feb 8
6 Mar 3
7 Mar 6
8 Mar 9
EDIT, for edited question , try:
out = pd.concat(d).stack().sort_index(level=[0,-1]).droplevel([0,1])
Example below:
df = pd.DataFrame(np.arange(1,13).reshape(3,-1).T,columns=['Jan','Feb','Mar'])
d = {}
for e,i in df.groupby(df.index//2):
d[f"df{e+1}"] = i
print(d,'\n')
out = pd.concat(d).stack().sort_index(level=[0,-1]).droplevel([0,1])
print(out)
{'df1': Jan Feb Mar
0 1 5 9
1 2 6 10, 'df2': Jan Feb Mar
2 3 7 11
3 4 8 12}
Jan 1
Jan 2
Feb 5
Feb 6
Mar 9
Mar 10
Jan 3
Jan 4
Feb 7
Feb 8
Mar 11
Mar 12
dtype: int32
Or you can also convert the dataframe names as int and then sort:
out = (pd.concat(d.values(),keys=[int(key[2:]) for key in d.keys()])
.stack().sort_index(level=[0,-1]).droplevel([0,1]))
Related
I have dates as below:
date
0 Today, 12 Mar
1 Tomorrow, 13 Mar
2 Tomorrow, 13 Mar
3 Tomorrow, 13 Mar
4 Tomorrow, 13 Mar
5 14 Mar 2021
6 14 Mar 2021
7 14 Mar 2021
8 14 Mar 2021
9 15 Mar 2021
How do I parse it as datetime in pandas?
Your date contains 'Today' and 'Tomorrow' which is not a valid format(if it is valid then I don't know I never worked with this type of format) of datetime so firstly replace them to 2021(if year is fixed...i.e 2021):-
df['date']=df['date'].str.replace('Today','2021')
df['date']=df['date'].str.replace('Tomorrow','2021')
Now just use to_datetime() method:-
df['date']=pd.to_datetime(df['date'])
I'm trying to sort a grouped data using Pandas
My code :
df = pd.read_csv("./data3.txt")
grouped = df.groupby(['cust','year','month'])['price'].count()
print(grouped)
My data:
cust,year,month,price
astor,2015,Jan,100
astor,2015,Jan,122
astor,2015,Feb,200
astor,2016,Feb,234
astor,2016,Feb,135
astor,2016,Mar,169
astor,2017,Mar,321
astor,2017,Apr,245
tor,2015,Jan,100
tor,2015,Feb,122
tor,2015,Feb,200
tor,2016,Mar,234
tor,2016,Apr,135
tor,2016,May,169
tor,2017,Mar,321
tor,2017,Apr,245
This is my result.
cust year month
astor 2015 Feb 1
Jan 2
2016 Feb 2
Mar 1
2017 Apr 1
Mar 1
tor 2015 Feb 2
Jan 1
2016 Apr 1
Mar 1
May 1
2017 Apr 1
Mar 1
How to get output sorted by month?
Add parameter sort=False to groupby:
grouped = df.groupby(['cust','year','month'], sort=False)['price'].count()
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
If not possible use first solution is possible convert months to datetimes and last convert back:
df['month'] = pd.to_datetime(df['month'], format='%b')
f = lambda x: x.strftime('%b')
grouped = df.groupby(['cust','year','month'])['price'].count().rename(f, level=2)
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
I have two pandas dataframes:
df1=pd.DataFrame({'month':['jun', 'jul', 'aug'],'a':[3,4,5], 'b':[2,3,4], 'c':[4,5,5]}).set_index('month')
a b c
month
jun 3 2 4
jul 4 3 5
aug 5 4 5
and
df2=pd.DataFrame({'year':[2009,2009,2009, 2010,2010,2010,2011,2011,2011],'month':['jun', 'jul', 'aug','jun', 'jul', 'aug','jun', 'jul', 'aug'],'a':[2,2,2,2,2,2,2,2,2], 'b':[1,2,3,4,5,6,7,8,9], 'c':[3,3,3,3,3,3,3,3,3]}).set_index('year')
month a b c
year
2009 jun 2 1 3
2009 jul 2 2 3
2009 aug 2 3 3
2010 jun 2 4 3
2010 jul 2 5 3
2010 aug 2 6 3
2011 jun 2 7 3
2011 jul 2 8 3
2011 aug 2 9 3
I would like to multiply df2's elements with df1's according to the months. Is there quick way to do it?
Thanks in adavance.
Use DataFrame.mul by months converted to MultiIndex by DataFrame.set_index
:
df = df2.set_index('month', append=True).mul(df1, level=1).reset_index(level=1)
print (df)
month a b c
year
2009 jun 6 2 12
2009 jul 8 6 15
2009 aug 10 12 15
2010 jun 6 8 12
2010 jul 8 15 15
2010 aug 10 24 15
2011 jun 6 14 12
2011 jul 8 24 15
2011 aug 10 36 15
I'm certain that this has been asked and answered, but I'm too stupid to find it. I've got a file that has the form:
StationID, Year, JanValue, FebValue, MarValue, AprilValue,...,DecValue
and I want to convert it from a short fat file with 12 months in each row to a long skinny file with only StationID, Date, Value, Year, Month.
I put together code to do it, and it works. It takes in a pandas dataframe as input and outputs a dataframe. But it's slow and I'm certain I'm doing it wildly inefficiently. Any help would be appreciated.
def long_skinny(df):
# df is a pandas dataframe
# get min and max year from dataframe
min_year = df['year'].min()
max_year = df['year'].max()
# set startdate to Jan. 1st of the first year.
startdate = str(min_year) + "0101"
# final file will have this many periods
num_periods = ((max_year - min_year)+1)*12
# generate a pandas dataframe with a datetime index
dates = pandas.date_range(start=startdate ,periods=num_periods,freq = 'M' )
# set up an empty list
tmps = []
# find years that are in the input dataframe
avail_years = df['year'].tolist()
id_tmp = df['id']
for iyear in range(min_year, max_year+1):
# check to see if year is in the original file
if iyear in avail_years:
year_rec = df[(df['year'] == iyear)]
tmps.append(int(year_rec['tmp1']))
tmps.append(int(year_rec['tmp2']))
tmps.append(int(year_rec['tmp3']))
tmps.append(int(year_rec['tmp4']))
tmps.append(int(year_rec['tmp5']))
tmps.append(int(year_rec['tmp6']))
tmps.append(int(year_rec['tmp7']))
tmps.append(int(year_rec['tmp8']))
tmps.append(int(year_rec['tmp9']))
tmps.append(int(year_rec['tmp10']))
tmps.append(int(year_rec['tmp11']))
tmps.append(int(year_rec['tmp12']))
else:
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps_np = np.asarray(tmps, dtype=np.int64)
var_names = ["temp"]
ls_df = pandas.DataFrame(tmps_np, index = dates, columns = var_names)
# add two fields for the year and month
ls_df['year']=ls_df.index.year
ls_df['month']=ls_df.index.month
ls_df['id'] = id_tmp
return(ls_df)
With an assumed example
StationID,Year,JanValue,FebValue,MarValue,AprValue,DecValue
A,2017,1,2,8,4,5
B,2017,1,2,8,4,5
A,2018,1,2,3,4,5
B,2018,1,2,3,4,5
The code would look like this
df = df.melt(id_vars=['StationID', 'Year'], var_name='Month', value_vars=['JanValue','FebValue','MarValue','AprValue','DecValue'])
after which you can fix the month names with
df['Month'] = df['Month'].str.replace('Value','')
Result
StationID Year Month value
0 A 2017 Jan 1
1 B 2017 Jan 1
2 A 2018 Jan 1
3 B 2018 Jan 1
4 A 2017 Feb 2
5 B 2017 Feb 2
6 A 2018 Feb 2
7 B 2018 Feb 2
8 A 2017 Mar 8
9 B 2017 Mar 8
10 A 2018 Mar 3
11 B 2018 Mar 3
12 A 2017 Apr 4
13 B 2017 Apr 4
14 A 2018 Apr 4
15 B 2018 Apr 4
16 A 2017 Dec 5
17 B 2017 Dec 5
18 A 2018 Dec 5
19 B 2018 Dec 5
So the only thing left is to sort the lines in the way you want
them sorted.
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
df['Month'] = pd.Categorical(df['Month'], categories=months, ordered=True)
df.sort_values(['StationID','Year','Month'], inplace=True)
For the result
StationID Year Month value
0 A 2017 Jan 1
4 A 2017 Feb 2
8 A 2017 Mar 8
12 A 2017 Apr 4
16 A 2017 Dec 5
2 A 2018 Jan 1
6 A 2018 Feb 2
10 A 2018 Mar 3
14 A 2018 Apr 4
18 A 2018 Dec 5
1 B 2017 Jan 1
5 B 2017 Feb 2
9 B 2017 Mar 8
13 B 2017 Apr 4
17 B 2017 Dec 5
3 B 2018 Jan 1
7 B 2018 Feb 2
11 B 2018 Mar 3
15 B 2018 Apr 4
19 B 2018 Dec 5
Oh man that seems like a lot of work I wouldn't do.
df = df.melt(id_vars=("StationID", "Year"), var_name="Month", value_name="Value")
Then you can replace the variable names with months using something like:
df["Month"] = df["Month"].str.replace(...)
Pack up the dates however you want into:
df["Date"] = pd.to_datetime(...)
Etc. I'd be more specific but without an example of your actual data this is the best I can do...
I have the following dataframe:
Year Month Booked
0 2016 Aug 55999.0
6 2017 Aug 60862.0
1 2016 Jul 54062.0
7 2017 Jul 58417.0
2 2016 Jun 42044.0
8 2017 Jun 48767.0
3 2016 May 39676.0
9 2017 May 40986.0
4 2016 Oct 39593.0
10 2017 Oct 41439.0
5 2016 Sep 49677.0
11 2017 Sep 53969.0
I want to obtain the percentage change with respect to the same month from last year. I have tried the following code:
df['pct_ch'] = df.groupby(['Month','Year'])['Booked'].pct_change()
but I get the following, which is not at all what I want:
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 -0.111728
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 -0.280278
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 -0.186417
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 -0.033987
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 0.198798
11 2017 Sep 53969.0 0.086398
Do not groupby Year otherwise you won't get, for instance, Aug 2017 and Aug 2016 together. Also, use transform to broadcast back results to original indices
Try:
df['pct_ch'] = df.groupby(['Month'])['Booked'].transform(lambda s: s.pct_change())
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 NaN
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 NaN
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 NaN
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 NaN
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 NaN
11 2017 Sep 53969.0 0.086398