I'm trying to sort a grouped data using Pandas
My code :
df = pd.read_csv("./data3.txt")
grouped = df.groupby(['cust','year','month'])['price'].count()
print(grouped)
My data:
cust,year,month,price
astor,2015,Jan,100
astor,2015,Jan,122
astor,2015,Feb,200
astor,2016,Feb,234
astor,2016,Feb,135
astor,2016,Mar,169
astor,2017,Mar,321
astor,2017,Apr,245
tor,2015,Jan,100
tor,2015,Feb,122
tor,2015,Feb,200
tor,2016,Mar,234
tor,2016,Apr,135
tor,2016,May,169
tor,2017,Mar,321
tor,2017,Apr,245
This is my result.
cust year month
astor 2015 Feb 1
Jan 2
2016 Feb 2
Mar 1
2017 Apr 1
Mar 1
tor 2015 Feb 2
Jan 1
2016 Apr 1
Mar 1
May 1
2017 Apr 1
Mar 1
How to get output sorted by month?
Add parameter sort=False to groupby:
grouped = df.groupby(['cust','year','month'], sort=False)['price'].count()
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
If not possible use first solution is possible convert months to datetimes and last convert back:
df['month'] = pd.to_datetime(df['month'], format='%b')
f = lambda x: x.strftime('%b')
grouped = df.groupby(['cust','year','month'])['price'].count().rename(f, level=2)
print (grouped)
cust year month
astor 2015 Jan 2
Feb 1
2016 Feb 2
Mar 1
2017 Mar 1
Apr 1
tor 2015 Jan 1
Feb 2
2016 Mar 1
Apr 1
May 1
2017 Mar 1
Apr 1
Name: price, dtype: int64
Related
I want to compare the two columns (date1 and date2) for same ID and set the value of column NewColumn to 'Yes' if date1 matches with the previous date2.
INPUT:
ID
Date1
date2
NewColumn
1
31 Jan 2022
1 Feb 2022
1
1 Feb 2022
2 Feb 2022
1
7 Feb 2022
8 Feb 2022
2
2 Feb 2022
2 Feb 2022
3
2 Feb 2022
3 Feb 2022
Input in CSV format:
ID,date1,date2,NewColumn
1,31/01/2022,01/02/2022,
1,01/02/2022,02/02/2022,
1,07/02/2022,08/02/2022,
2,02/02/2022,02/02/2022,
3,02/02/2022,03/02/2022,
Output:
ID
date1
date2
NewColumn
1
31 Jan 2022
1 Feb 2022
1
1 Feb 2022
2 Feb 2022
YES
1
7 Feb 2022
8 Feb 2022
2
2 Feb 2022
2 Feb 2022
3
2 Feb 2022
3 Feb 2022
In CSV format:
ID,date1,date2,NewColumn
1,31/01/2022,01/02/2022,
1,01/02/2022,02/02/2022, YES
1,07/02/2022,08/02/2022,
2,02/02/2022,02/02/2022,
3,02/02/2022,03/02/2022,
You can use groupby and apply to apply a custom function to each group. The function then needs to compare the date1 with the previous row's date2 which can be done using shift. This will give a boolean value (True or False), to get a string value you can use np.where. For example:
import numpy as np
def func(x):
return x['date1'] == x['date2'].shift(1)
df['NewColumn'] = np.where(df.groupby('ID').apply(func), 'YES', '')
Result:
ID date1 date2 NewColumn
0 1 31/01/2022 01/02/2022
1 1 01/02/2022 02/02/2022 YES
2 1 07/02/2022 08/02/2022
3 2 02/02/2022 02/02/2022
4 3 02/02/2022 03/02/2022
For example I have the following map:
{'df1': Jan Feb Mar
1 3 5
2 4 6
'df2': Jan Feb Mar
7 9 11
8 10 12
......}
And I want the following output:
Jan 1
Jan 2
Feb 3
Feb 4
Mar 5
Mar 6
Jan 7
Jan 8
Feb 9
Feb 10
Mar 11
Mar 12
Does anyone knows if its possible to do it this way?
What I have tried is to iterate through DataFrames to try getting
{'df1': Jan 1
Jan 2
Feb 3
Feb 4
Mar 5
Mar 6
'df2': Jan 7
Jan 8
Feb 9
Feb 10
Mar 11
Mar 12
by using
for x in dfMap:
df = pd.melt(list(x.values()))
Then try to concat it with df1m =
pd.concat(df.values(), ignore_index=True)
Which gave me error
AttributeError: 'list' object has no attribute 'columns'
I am fairly new to programming and really wanted to learn, will be nice if anyone can explain how this works, and why list or dict_values object has no attribute 'columns'.
Thanks in advance!
You can concat and stack:
out = pd.concat(d.values()).stack().droplevel(0)
Or:
out = pd.concat(d.values()).melt()
Example:
df = pd.DataFrame(np.arange(1,10).reshape(-1,3),columns=['Jan','Feb','Mar'])
d = {}
for e,i in df.iterrows():
d[f"df{e+1}"] = i.to_frame().T
print(d,'\n')
out = pd.concat(d.values()).stack().droplevel(0)
print(out)
{'df1': Jan Feb Mar
0 1 2 3, 'df2': Jan Feb Mar
1 4 5 6, 'df3': Jan Feb Mar
2 7 8 9}
Jan 1
Feb 2
Mar 3
Jan 4
Feb 5
Mar 6
Jan 7
Feb 8
Mar 9
dtype: int32
With melt:
out = pd.concat(d.values()).melt()
print(out)
variable value
0 Jan 1
1 Jan 4
2 Jan 7
3 Feb 2
4 Feb 5
5 Feb 8
6 Mar 3
7 Mar 6
8 Mar 9
EDIT, for edited question , try:
out = pd.concat(d).stack().sort_index(level=[0,-1]).droplevel([0,1])
Example below:
df = pd.DataFrame(np.arange(1,13).reshape(3,-1).T,columns=['Jan','Feb','Mar'])
d = {}
for e,i in df.groupby(df.index//2):
d[f"df{e+1}"] = i
print(d,'\n')
out = pd.concat(d).stack().sort_index(level=[0,-1]).droplevel([0,1])
print(out)
{'df1': Jan Feb Mar
0 1 5 9
1 2 6 10, 'df2': Jan Feb Mar
2 3 7 11
3 4 8 12}
Jan 1
Jan 2
Feb 5
Feb 6
Mar 9
Mar 10
Jan 3
Jan 4
Feb 7
Feb 8
Mar 11
Mar 12
dtype: int32
Or you can also convert the dataframe names as int and then sort:
out = (pd.concat(d.values(),keys=[int(key[2:]) for key in d.keys()])
.stack().sort_index(level=[0,-1]).droplevel([0,1]))
Hello guys i have a data which have two cloumns so want to generate unique sequence of id for that...
This is data:
Year Month
0 2010 Jan
1 2010 Feb
2 2010 Mar
3 2010 Mar
4 2010 Mar
I want to join that service id to these two column for that i have write a code:
data['Sr_ID'] = data.groupby(['Month','Year']).ngroup()
data.head()
this give this output:
Year Month Sr_ID
0 2010 Jan 20
1 2010 Feb 15
2 2010 Mar 35
3 2010 Mar 35
4 2010 Mar 35
but i don't want "Sr_ID" like this i want to be like "Sr_0001...Sr_0002"
it should be in a sequence of number this "Sr" so for this
I want a output like this:
Year Month Sr_ID
0 2010 Jan Sr_0001
1 2010 Feb Sr_0002
2 2010 Mar Sr_0003
3 2010 Mar Sr_0004
4 2010 Mar Sr_0005
I want to generate different id for different row because I have 8 columns, with no repeated rows.
np.arange + str.zfill
You can use a range, then pad with zeros to the left:
df['Sr_ID'] = 'Sr_' + pd.Series(np.arange(1, len(df.index)+1)).astype(str).str.zfill(4)
print(df)
Year Month Sr_ID
0 2010 Jan Sr_0001
1 2010 Feb Sr_0002
2 2010 Mar Sr_0003
3 2010 Mar Sr_0004
4 2010 Mar Sr_0005
I'm certain that this has been asked and answered, but I'm too stupid to find it. I've got a file that has the form:
StationID, Year, JanValue, FebValue, MarValue, AprilValue,...,DecValue
and I want to convert it from a short fat file with 12 months in each row to a long skinny file with only StationID, Date, Value, Year, Month.
I put together code to do it, and it works. It takes in a pandas dataframe as input and outputs a dataframe. But it's slow and I'm certain I'm doing it wildly inefficiently. Any help would be appreciated.
def long_skinny(df):
# df is a pandas dataframe
# get min and max year from dataframe
min_year = df['year'].min()
max_year = df['year'].max()
# set startdate to Jan. 1st of the first year.
startdate = str(min_year) + "0101"
# final file will have this many periods
num_periods = ((max_year - min_year)+1)*12
# generate a pandas dataframe with a datetime index
dates = pandas.date_range(start=startdate ,periods=num_periods,freq = 'M' )
# set up an empty list
tmps = []
# find years that are in the input dataframe
avail_years = df['year'].tolist()
id_tmp = df['id']
for iyear in range(min_year, max_year+1):
# check to see if year is in the original file
if iyear in avail_years:
year_rec = df[(df['year'] == iyear)]
tmps.append(int(year_rec['tmp1']))
tmps.append(int(year_rec['tmp2']))
tmps.append(int(year_rec['tmp3']))
tmps.append(int(year_rec['tmp4']))
tmps.append(int(year_rec['tmp5']))
tmps.append(int(year_rec['tmp6']))
tmps.append(int(year_rec['tmp7']))
tmps.append(int(year_rec['tmp8']))
tmps.append(int(year_rec['tmp9']))
tmps.append(int(year_rec['tmp10']))
tmps.append(int(year_rec['tmp11']))
tmps.append(int(year_rec['tmp12']))
else:
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps.append(-9999)
tmps_np = np.asarray(tmps, dtype=np.int64)
var_names = ["temp"]
ls_df = pandas.DataFrame(tmps_np, index = dates, columns = var_names)
# add two fields for the year and month
ls_df['year']=ls_df.index.year
ls_df['month']=ls_df.index.month
ls_df['id'] = id_tmp
return(ls_df)
With an assumed example
StationID,Year,JanValue,FebValue,MarValue,AprValue,DecValue
A,2017,1,2,8,4,5
B,2017,1,2,8,4,5
A,2018,1,2,3,4,5
B,2018,1,2,3,4,5
The code would look like this
df = df.melt(id_vars=['StationID', 'Year'], var_name='Month', value_vars=['JanValue','FebValue','MarValue','AprValue','DecValue'])
after which you can fix the month names with
df['Month'] = df['Month'].str.replace('Value','')
Result
StationID Year Month value
0 A 2017 Jan 1
1 B 2017 Jan 1
2 A 2018 Jan 1
3 B 2018 Jan 1
4 A 2017 Feb 2
5 B 2017 Feb 2
6 A 2018 Feb 2
7 B 2018 Feb 2
8 A 2017 Mar 8
9 B 2017 Mar 8
10 A 2018 Mar 3
11 B 2018 Mar 3
12 A 2017 Apr 4
13 B 2017 Apr 4
14 A 2018 Apr 4
15 B 2018 Apr 4
16 A 2017 Dec 5
17 B 2017 Dec 5
18 A 2018 Dec 5
19 B 2018 Dec 5
So the only thing left is to sort the lines in the way you want
them sorted.
months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
df['Month'] = pd.Categorical(df['Month'], categories=months, ordered=True)
df.sort_values(['StationID','Year','Month'], inplace=True)
For the result
StationID Year Month value
0 A 2017 Jan 1
4 A 2017 Feb 2
8 A 2017 Mar 8
12 A 2017 Apr 4
16 A 2017 Dec 5
2 A 2018 Jan 1
6 A 2018 Feb 2
10 A 2018 Mar 3
14 A 2018 Apr 4
18 A 2018 Dec 5
1 B 2017 Jan 1
5 B 2017 Feb 2
9 B 2017 Mar 8
13 B 2017 Apr 4
17 B 2017 Dec 5
3 B 2018 Jan 1
7 B 2018 Feb 2
11 B 2018 Mar 3
15 B 2018 Apr 4
19 B 2018 Dec 5
Oh man that seems like a lot of work I wouldn't do.
df = df.melt(id_vars=("StationID", "Year"), var_name="Month", value_name="Value")
Then you can replace the variable names with months using something like:
df["Month"] = df["Month"].str.replace(...)
Pack up the dates however you want into:
df["Date"] = pd.to_datetime(...)
Etc. I'd be more specific but without an example of your actual data this is the best I can do...
I have the following dataframe:
Year Month Booked
0 2016 Aug 55999.0
6 2017 Aug 60862.0
1 2016 Jul 54062.0
7 2017 Jul 58417.0
2 2016 Jun 42044.0
8 2017 Jun 48767.0
3 2016 May 39676.0
9 2017 May 40986.0
4 2016 Oct 39593.0
10 2017 Oct 41439.0
5 2016 Sep 49677.0
11 2017 Sep 53969.0
I want to obtain the percentage change with respect to the same month from last year. I have tried the following code:
df['pct_ch'] = df.groupby(['Month','Year'])['Booked'].pct_change()
but I get the following, which is not at all what I want:
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 -0.111728
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 -0.280278
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 -0.186417
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 -0.033987
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 0.198798
11 2017 Sep 53969.0 0.086398
Do not groupby Year otherwise you won't get, for instance, Aug 2017 and Aug 2016 together. Also, use transform to broadcast back results to original indices
Try:
df['pct_ch'] = df.groupby(['Month'])['Booked'].transform(lambda s: s.pct_change())
Year Month Booked pct_ch
0 2016 Aug 55999.0 NaN
6 2017 Aug 60862.0 0.086841
1 2016 Jul 54062.0 NaN
7 2017 Jul 58417.0 0.080556
2 2016 Jun 42044.0 NaN
8 2017 Jun 48767.0 0.159904
3 2016 May 39676.0 NaN
9 2017 May 40986.0 0.033017
4 2016 Oct 39593.0 NaN
10 2017 Oct 41439.0 0.046624
5 2016 Sep 49677.0 NaN
11 2017 Sep 53969.0 0.086398