I am really stuck on how to approach adding columns to Pandas dynamically. I've been trying to search for an answer to work through this, however, I am afraid when searching I may also be using the wrong terminology to summarize what I am attempting to do.
I have a dataframe returned from a query that looks like the following:
department action date
marketing close 09-01-2017
marketing close 07-01-2018
marketing close 06-01-2017
marketing close 10-21-2019
marketing open 08-01-2018
marketing other 07-14-2018
sales open 02-01-2019
sales open 02-01-2017
sales close 02-22-2019
The ultimate goal is I need a count of the types of actions grouped within particular date ranges.
My DESIRED output is something along the lines of:
department 01/01/2017-12/31/2017 01/01/2018-12/31/2018 01/01/2019-12/31/2019
open close other open close other open close other
marketing 0 2 0 1 1 1 0 1 0
sales 1 0 0 0 0 0 1 1 0
"Department" would be my index, then the contents would be filtered by date ranges specified in a list I provide, followed by the action taken (with counts). Being newer to this, I am confused as to what approach I should take - for example should I use Python (should I be looping or iterating), or should the heavy lifting be done in PANDAS. If in PANDAS, I am having difficulty determining what function to use (I've been looking at get_dummy() etc.).
I'd imagine this would be accomplished with either 1. Some type or FOR loop iterating through, 2. Adding a column to the dataframe based on the list then filtering the data underneath based on the value(s), or 3. using a function I am not aware of in Pandas
I have explained more of my thought process in this question, but I am not sure if the question is unclear which is why it may be unanswered.
Building a dataframe with dynamic date ranges using filtered results from another dataframe
There are quite a few concepts you need at once here.
First you dont yet have the count. From your desired output I took you want it yearly but you can specify any time frame you want. Then just count with groupby() and count()
In [66]: df2 = df.groupby([pd.to_datetime(df.date).dt.year, "action", "department"]).count().squeeze().rename("count")
Out[66]:
date action department
2017 close marketing 2
open sales 1
2018 close marketing 1
open marketing 1
other marketing 1
2019 close marketing 1
sales 1
open sales 1
Name: count, dtype: int64
The squeeze() and rename() are there because afterwards both the count column and the year would be called date and you get a name conflict. You could equivalently use rename(columns={'date': 'count'}) and not cast to a Series.
The second step is a pivot_table. This creates column names from values. Because there are combinations of date and action without a corresponding value, you need pivot_table.
In [62]: df2.pivot_table(index="department", columns=["date", "action"])
Out[62]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2.0 NaN 1.0 1.0 1.0 1.0 NaN
sales NaN 1.0 NaN NaN NaN 1.0 1.0
Because NaN is internally representated as floating piont, your counts were also converted to floating point. To fix that, just append fillna and convert back to int.
In [65]: df2.reset_index().pivot_table(index="department", columns=["date", "action"]).fillna(0).astype(int)
Out[65]:
count
date 2017 2018 2019
action close open close open other close open
department
marketing 2 0 1 1 1 1 0
sales 0 1 0 0 0 1 1
To get exactly you output you would need to modify pd.to_datetime(df.date).dt.year. You can do this with strftime (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html). Furthermore the column ["2017", "other"] was dropped because there was no value. If this creates problems you need to include the values beforehand. After the first step a reindex and a fillna should do the trick.
EDIT: Yes it does
In [77]: new_index = pd.MultiIndex.from_product([[2017, 2018, 2019], ["close", "open", "other"], ['marketing', 'sales']], names=['date', 'action', 'department'])
...:
In [78]: df3 = df2.reindex(new_index).fillna(0).astype(int).reset_index()
Out[78]:
date action department count
0 2017 close marketing 2
1 2017 close sales 0
2 2017 open marketing 0
3 2017 open sales 1
4 2017 other marketing 0
5 2017 other sales 0
6 2018 close marketing 1
.. ... ... ... ...
11 2018 other sales 0
12 2019 close marketing 1
13 2019 close sales 1
14 2019 open marketing 0
15 2019 open sales 1
16 2019 other marketing 0
17 2019 other sales 0
In [79]: df3.pivot_table(index="department", columns=["date", "action"])
Out[79]:
count
date 2017 2018 2019
action close open other close open other close open other
department
marketing 2 0 0 1 1 1 1 0 0
sales 0 1 0 0 0 0 1 1 0
Related
Month Year Open High Low Close/Price Volume
6 2019 86.78 87.11 86.06 86.55 1507828
6 2019 86.63 87.23 84.81 85.06 2481284
6 2019 85.38 85.81 84.75 85.33 2034693
6 2019 85.65 86.86 85.13 86.43 1394847
6 2019 86.66 87.74 86.66 87.55 3025379
7 2019 88.84 89.72 87.77 88.45 4017249
7 2019 89.21 90 87.95 88.87 2237183
7 2019 89.14 91.08 89.14 90.67 1647124
7 2019 90.39 90.95 89.07 90.59 3227673
I want to get the monthly average of: Open High Low Close/Price
How do i set two values (Month, Year) as parameters for getting a value that is in another column?
df = pd.read_excel('DatosUnited.xlsx')
month = df.groupby('Month')
year = df.groupby('Year')
june2019 = month.get_group("6")
year2019 = year.get_group('2019')
I tried something like this, but i dont know how to use both as a filter simultaneously
You can use .groupby() with multiple columns, and then you can use .mean() to get the desired averages:
df.groupby(["Month", "Year"]).mean()
This outputs:
Open High Low Close/Price Volume
Month Year
6 2019 86.220 86.9500 85.4820 86.184 2088806.20
7 2019 89.395 90.4375 88.4825 89.645 2782307.25
df: (DataFrame)
Open High Close Volume
2020/1/1 1 2 3 323232
2020/1/2 2 3 4 321321
....
2020/12/31 4 5 6 123213
....
2021
The performance i needed is : (Graph NO.1)
Open High Close Volume Year_Sum_Volume
2020/1/1 1 2 3 323232 (323232 + 321321 +....+ 123213)
2020/1/2 2 3 4 321321 (323232 + 321321 +....+ 123213)
....
2020/12/31 4 5 6 123213 (323232 + 321321 +....+ 123213)
....
2021 (x+x+x.....x)
I want a sum of Volume in different year (the Year_Sum_Volume is the volume of each year)
This is the code i try to calculate the sum of volume in each year but how can i add this data
to daily data , i want to add Year_Sum_Volume to df,like(Graph no.1)
df.resample('Y', on='Date')['Volume'].sum()
thanks you for answering
I believe groupby.sum() and merge should be your friends
import pandas as pd
df = pd.DataFrame({"date":['2021-12-30', '2021-12-31', '2022-01-01'], "a":[1,2.1,3.2]})
df.date = pd.to_datetime(df.date)
df["year"] = df.date.dt.year
df_sums = df.groupby("year").sum().rename(columns={"a":"a_sum"})
df = df.merge(df_sums, right_index=True, left_on="year")
which gives:
date
a
year
a_sum
0
2021-12-30 00:00:00
1
2021
3.1
1
2021-12-31 00:00:00
2.1
2021
3.1
2
2022-01-01 00:00:00
3.2
2022
3.2
Based on your output, Year_Sum_Volume is the same value for every row and can be calculated using df['Volume'].sum().
Then you join a column of a scaled list:
df.join(pd.DataFrame( {'Year_Sum_Volume': [your_sum_val] * len(df['Volume'])} ))
Try below code (after converting date column to pd.to_datetime)
df.assign(Year_Sum_Volume = df.groupby(df['date'].dt.year)['a'].transform('sum'))
I am new to pandas and trying to figure out the following how to calculate the percentage change (difference) between 2 years, given that sometimes there is no previous year.
I am given a dataframe as follows:
company date amount
1 Company 1 2020 3
2 Company 1 2021 1
3 COMPANY2 2020 7
4 Company 3 2020 4
5 Company 3 2021 4
.. ... ... ...
766 Company N 2021 9
765 Company N 2020 1
767 Company XYZ 2021 3
768 Company X 2021 3
769 Company Z 2020 2
I wrote something like this:
for company in unique(df2.company):
company_df = df2[df2.company== company]
company_df.sort_values(by ="date")
company_df_year = company_df.amount.tolist()
company_df_year.pop()
company_df_year.insert(0,0)
company_df["value_year_before"] = company_df_year
if any in company_df.value_year_before == None:
company_df["diff"] = 0
else:
company_df["diff"] = (company_df.amount- company_df.value_year_before)/company_df.value_year_before
df2["ratio"] = company_df["diff"]
But I keep getting >NAN.
Where did I make a mistake?
The main issue is that you are overwriting company_df in each iteration of the loop and only keeping the last one.
However, normally when using Pandas if you are starting to use a for loop then you are doing something wrong and there is an easier way to accomplish the goal. Here you could use groupby and pct_change to compute the ratio of each group.
df = df.sort_values(['company', 'date'])
df['ratio'] = df.groupby('company')['amount'].pct_change()
df['ratio'] = df['ratio'].fillna(0.0)
Groupby will keep the order of the rows within each group so we sort before to ensure that the order of the dates is correct and fillna replace any nans with 0.
Result:
company date amount ratio
3 COMPANY2 2020 7 0.000000
1 Company 1 2020 3 0.000000
2 Company 1 2021 1 -0.666667
4 Company 3 2020 4 0.000000
5 Company 3 2021 4 0.000000
765 Company N 2020 1 0.000000
766 Company N 2021 9 8.000000
768 Company X 2021 3 0.000000
767 Company XYZ 2021 3 0.000000
769 Company Z 2020 2 0.000000
Apply an anonymous function that calculate the change percentage and returns that if there is more than one values. Use:
df = pd.DataFrame({'company': [1,1,3], 'date':[2020,2021,2020], 'amount': [4,5,7]})
df.groupby('company')['amount'].apply(lambda x: (list(x)[1]-list(x)[0])/list(x)[0] if len(x)>1 else 'not enough values')
Input df:
Output:
I have a df:
company year revenues
0 company 1 2019 1,425,000,000
1 company 1 2018 1,576,000,000
2 company 1 2017 1,615,000,000
3 company 1 2016 1,498,000,000
4 company 1 2015 1,569,000,000
5 company 2 2019 nan
6 company 2 2018 1,061,757,075
7 company 2 2017 nan
8 company 2 2016 573,414,893
9 company 2 2015 599,402,347
I would like to fill the nan values, with an order. I want to linearly interpolate first, then forward fill and then backward fill. I currently have:
f_2_impute = [x for x in cl_data.columns if cl_data[x].dtypes != 'O' and 'total' not in x and 'year' not in x]
def ffbf(x):
return x.ffill().bfill()
group_with = ['company']
for x in cl_data[f_2_impute]:
cl_data[x] = cl_data.groupby(group_with)[x].apply(lambda fill_it: ffbf(fill_it))
which performs ffill() and bfill(). Ideally I want a function that tries first to linearly intepolate the missing values, then try forward filling them and then backward filling them.
Any quick ways of achieving it? Thanking you in advance.
I believe you need first convert columns to floats if , there:
df = pd.read_csv(file, thousands=',')
Or:
df['revenues'] = df['revenues'].replace(',','', regex=True).astype(float)
and then add DataFrame.interpolate:
def ffbf(x):
return x.interpolate().ffill().bfill()
I have a dataframe which looks like this
In []: df.head()
Out [] :
DATE NAME AMOUNT CURRENCY
2018-07-27 John 100 USD
2018-06-25 Jane 150 GBP
...
The contents under the DATE column are of date type.
I want to aggregate all the data to be able to see to understand the days of the month and the count of the number of transactions that happened on that date.
I also wanted to group it by year as well as day.
The end result I wanted would have looked something like this
YEAR DAY COUNT
2018 1 0
2 1
3 0
4 0
5 3
6 4
and so on
I used the following code but the numbers are all wrong. Please help
In []: df = pd.DataFrame({'DATE':pd.date_range(start=dt.datetime(2018,7,27),end=dt.datetime(2020,7,21))})
df.groupby([df['DATE'].dt.year, df['DATE'].dt.day]).agg({'count'})