I have a dataframe called data that looks like this:
org_id
commit_date
commit_amt
123
2020-06-01
50000
123
2020-06-01
50000
123
2021-06-01
60000
234
2019-07-01
30000
234
2020-07-01
40000
234
2021-07-01
50000
I want the dataframe to look like this:
org_id
date_1
date_2
date_3
amt_1
amt_2
amt_3
123
2020-06-01
2021-06-01
2022-06-01
50000
50000
60000
234
2019-07-01
2020-07-01
2021-07-01
30000
40000
50000
I've gotten the date columns and org_id column by:
dates = data.groupby('org_id').apply(lambda x: x['commit_date'].unique()) #get all unique commit_date for the org_id
dates = dates.apply(pd.Series) #put each unique commit_date into it's own column, NaN if the org_id doesn't have enough commit_dates
c_dates = pd.DataFrame() #create empty dataframe
c_dates['org_id'] = dates.index #I had to specify each col bc the
dates df was too hard to work with.
c_dates['date_1'] = dates[0].values.tolist()
c_dates['date_2'] = dates[1].values.tolist()
c_dates['date_3'] = dates[2].values.tolist()
I cannot figure out how to get amt_1, amt_2, and amt_3 columns. I can't just repeat date columns code bc it will miss the repeat 50000 for org_id_123. Bc the c_dates dataframe does not match length of the original data dataframe, I can't just compare c_dates to data.
EXCITING UPDATE!
I haven't totally solved my problem yet, but I have made a bit of progress:
dates = data.groupby(['org_id','commit_amt']).apply(lambda x: x['commit_date'].unique()) #get all unique commit_date for the org_id
dates = dates.apply(pd.Series) #put each unique commit_date into it's own column, NaN if the org_id doesn't have enough commit_dates
gives me the data I want, however, it is not formatted how I want. It gives results that look like:
org_id
commit_amt
123
50000
2020-06-01
2021-06-01
123
60000
2022-06-01
234
30000
2019-07-01
234
40000
2020-07-01
234
50000
2021-07-01
I would appreciate any help in getting me to the format I want. I ultimately want to be able to take the difference between amt_1 and amt_2, etc.
Hope this makes sense.
P.S. Thanks to the hero who edited this thereby teaching me how to make tables!
EXCITINGER NEWS!! I HAVE SOLVED MY PROBLEM!!!
Long story short, the function I needed was unstack. I am tired now but tomorrow, I will edit this with the solution! w00t!
i think you can use pandas.pivot() , for reshaping your date. but there is problem in using pivot() is you must not have duplicated value.
first i think you drop duplicated rows then use pivot.
data = data.drop_duplicates()
data.pivot(index='org_id', columns=['commit_amt'], values=['commit_date'])
Related
I have a dataframe like as shown below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10
1 1234 1231 1256 1239
2 5678 3425 3255 2345
I would like to do the below
a) get average of revenue for each customer based on latest two columns (revenue_m9 and revenue_m10)
b) get average of revenue for each customer based on latest four columns (revenue_m7, revenue_m8, revenue_m9 and revenue_m10)
So, I tried the below
df['revenue_mean_2m'] = (df['revenue_m10']+df['revenue_m9'])/2
df['revenue_mean_4m'] = (df['revenue_m10']+df['revenue_m9']+df['revenue_m8']+df['revenue_m7'])/4
df['revenue_mean_4m'] = df.mean(axis=1) # i also tried this but how to do for only two columns (and not all columns)
But if I wish to compute average for past 12 months, then it may not be elegant to write this way. Is there any other better or efficient way to write this? I can just key in number of columns to look back and it can compute the average based on keyed in input
I expect my output to be like as below
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 revenue_mean_2m revenue_mean_4m
1 1234 1231 1256 1239 1867 1240
2 5678 3425 3255 2345 2800 3675.75
Use filter and slicing:
# keep only the "revenue_" columns
df2 = df.filter(like='revenue_')
# or
# df2 = df.filter(regex=r'revenue_m\d+')
# get last 2/4 columns and aggregate as mean
df['revenue_mean_2m'] = df2.iloc[:, -2:].mean(axis=1)
df['revenue_mean_4m'] = df2.iloc[:, -4:].mean(axis=1)
Output:
customer_id revenue_m7 revenue_m8 revenue_m9 revenue_m10 \
0 1 1234 1231 1256 1239
1 2 5678 3425 3255 2345
revenue_mean_2m revenue_mean_4m
0 1247.5 1240.00
1 2800.0 3675.75
if column order it not guaranteed
Sort them with natural sorting
# shuffle the DataFrame columns for demo
df = df.sample(frac=1, axis=1)
# filter and reorder the needed columns
from natsort import natsort_key
df2 = df.filter(regex=r'revenue_m\d+').sort_index(key=natsort_key, axis=1)
you could try something like this in reference to this post:
n_months = 4 # you could also do this in a loop for all months range(1, 12)
df[f'revenue_mean_{n_months}m'] = df.iloc[:, -n_months:-1].mean(axis=1)
I have a pandas dataframe with 2 columns ("Date" and "Gross Margin). I want to delete rows based on what the value in the "Date" column is. This is my dataframe:
Date Gross Margin
0 2021-03-31 44.79%
1 2020-12-31 44.53%
2 2020-09-30 44.47%
3 2020-06-30 44.36%
4 2020-03-31 43.69%
.. ... ...
57 2006-12-31 49.65%
58 2006-09-30 52.56%
59 2006-06-30 49.86%
60 2006-03-31 46.20%
61 2005-12-31 40.88%
I want to delete every row where the "Date" value doesn't end with "12-31". I read some similar posts on this and the pandas.drop() function seemed to be the solution, but I haven't figured out how to use it for this specific case.
Please leave any suggestions as to what I should do.
you can try the following code, where you match the day and month.
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df[df['Date'].dt.strftime('%m-%d') == '12-31']
Assuming you have the date formatted as year-month-day
df = df[~df['Date'].str.endswith('12-31')]
If the dates are using a consistent format, you can do it like this:
df = df[df['Date'].str.contains("12-31", regex=False)]
I have a list of transactions for a business.
Example dataframe:
userid date amt start_of_day_balance
123 2017-01-04 10 100.0
123 2017-01-05 20 NaN
123 2017-01-02 30 NaN
123 2017-01-04 40 100.0
The start of day balance is not always retrieved (in that case we receive a NaN). But from the moment that we know the start of day balance for any day, we can accurately estimate the balance after each transaction afterwards.
In this example the new column should look as follows:
userid date amt start_of_day_balance calculated_balance
123 2017-01-04 10 100.0 110
123 2017-01-05 20 NaN 170
123 2017-01-02 30 NaN NaN
123 2017-01-04 40 100.0 150
Note that there is no way to tell the exact order of the transactions that occurred on the same day - I'm happy to overlook that in this case.
My question is how to create this new column. Something like:
df['calculated_balance'] = df.sort_values(['date']).groupby(['userid'])\
['amt'].cumsum() + df['start_of_day_balance'].min()
wouldn't work because of the NaNs.
I also don't want to filter out any transactions that happened before the first recorded start of day balance.
I came up with a solution that seems to work. I'm not sure how elegant it is.
def calc_estimated_balance(g):
# find the first date which has a start of day balance
first_date_with_bal = g.loc[g['start_of_day_balance'].first_valid_index(), 'date']
# only calculate the balance if date is greater than or equal to the date of the first balance
g['calculated_balance'] = g[g['date'] >= first_date_with_bal]['amt'].cumsum().add(g['start_of_day_balance'].min())
return g
df = df.sort_values(['date']).groupby(['userid']).apply(calc_estimated_balance)
I have a pandas dataframe (originally generated from a sql query) that looks like:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
2 100 1000 1/2/2016
3 100 1000 1/3/2016
4 101 1234 9/15/2016
5 101 1234 9/16/2016
etc....
I'd like to get this whittled down to a unique list, returning only the entry with the earliest date available, something like this:
index AccountId ItemID EntryDate
1 100 1000 1/1/2016
4 101 1234 9/15/2016
etc....
Any pointers or direction for a fairly new pandas dev? The unique function doesn't appear to be able to handle these types of rules, and looping through the array and working out which one to drop seems like a lot of trouble for a simple task... Is there a function that I'm missing that does this?
Let's use groupby, idxmin, and .loc:
df_out = df2.loc[df2.groupby('AccountId')['EntryDate'].idxmin()]
print(df_out)
Output:
AccountId ItemID EntryDate
index
1 100 1000 2016-01-01
4 101 1234 2016-09-15
So, I have a DataFrame with a multiindex which looks like this:
info1 info2 info3
abc-8182 2012-05-08 10:00:00 1 6.0 "yeah!"
2012-05-08 10:01:00 2 25.0 ":("
pli-9230 2012-05-08 11:00:00 1 30.0 "see yah!"
2012-05-08 11:15:00 1 30.0 "see yah!"
...
The index is an id and a datetime representing when that info about that id was recorded. What we needed to do was to find, for each id, the earliest record. We tried a lot of options from the dataframe methods but we ended up doing it by looping through the DataFrame:
df = pandas.read_csv(...)
empty = pandas.DataFrame()
ids = df.index.get_level_values(0)
for id in ids:
minDate = df.xs(id).index.min()
row = df.xs(id).xs(minDate)
mindf = pandas.DataFrame(row).transpose()
mindf.index = pandas.MultiIndex.from_tuples([(id, mindate)])
empty = empty.append(mindf)
print empty.groupby(lambda x : x).first()
Which gives me:
x0 x1 x2
('abc-8182', <Timestamp: 2012-05-08 10:00:00>) 1 6 yeah!
('pli-9230', <Timestamp: 2012-05-08 11:00:00>) 1 30 see yah!
I feel that there must be a simple, "pandas idiomatic", very immediate way to do this without looping though the data frame like this. Is there? :)
Thanks.
To get the first item in each group, you can do:
df.reset_index(level=1).groupby(level=0).first()
which will drop the datetime field to a column before the groups are grouped by groupby, therefore it will remain in the dataframe in the result.
If you need to ensure the earliest time is kept, you can sort, before you call first:
df.reset_index(level=1).sort_index(by="datetime").groupby(level=0).first()