Create a new column with partial name from dataframe

Create a new column with partial name from dataframe - python

I have five datasets that I have added a 'Year' column to, like this:
newyork2014['Year'] = 2014
newyork2015['Year'] = 2015
newyork2016['Year'] = 2016
newyork2017['Year'] = 2017
newyork2018['Year'] = 2018
However, I'm wondering if there's a more Pythonic way of doing this, perhaps with a function? I don't want to change the actual dataframe into a string though, but I want to "stringify" the name of the dataframe. Here's what I was thinking:
def get_year(df):
df['Year'] = last four digits of name of df
return df

You may need to adjust a little bit when you create the dataframe , need assign a name
newyork2014.name='newyork2014'
def get_year(df):
df['Year'] = df.name[-4:]
return df
get_year(newyork2014)
Out[42]:
ID Col1 Col2 New Year
2018-06-01 A 10 100 0.5 2014
2018-06-02 B 5 25 2.1 2014
2018-06-03 A 25 25 0.6 2014

Related

How to compare two dataframes and find the combination for which there is value in first dataframe but not in second

I have two pandas dataframes with same columns say name, jan, feb, march, april. I want to compare the two dataframes and find out the name, month combination for which I have value in my first dataframe but not in my second dataframe.
df1:
Name
jan
feb
March
ABC
125
225
NaN
DEF
NaN
30
214
df2:
Name
jan
feb
March
ABC
125
NaN
NaN
XYZ
254
130
NaN
Expected output:
Name
Month
ABC
feb
Def
feb
Def
March
I tried to merge the two dataframes, but it is not giving me the expected result. I'm not sure how to proceed with this.

Here is a possible approach:
# only if not already index
# df1 = df1.set_index('Name')
# df2 = df2.set_index('Name')
s = ((df1.notna()&df1.ne(df2.reindex_like(df1)))
.rename_axis('Month', axis=1).stack())
s[s].reset_index().drop(0, axis=1)
output:
Name Month
0 ABC feb
1 DEF feb
2 DEF March

set_index to "Name" + stack + reset_index the DataFrames and outer merge on name and month. Then filter the merged DataFrame by the condition:
out = df1.set_index('Name').stack().reset_index().merge(df2.set_index('Name').stack().reset_index(), on=['Name','level_1'], how='outer')
out = out.loc[out['0_x'].notna() & out['0_y'].isna(), ['Name','level_1']].rename(columns={'level_1':'Month'})
Output:
Name Month
1 ABC feb
2 DEF feb
3 DEF March

Python / Pandas: Fill NaN with order - linear interpolation --> ffill --> bfill

I have a df:
company year revenues
0 company 1 2019 1,425,000,000
1 company 1 2018 1,576,000,000
2 company 1 2017 1,615,000,000
3 company 1 2016 1,498,000,000
4 company 1 2015 1,569,000,000
5 company 2 2019 nan
6 company 2 2018 1,061,757,075
7 company 2 2017 nan
8 company 2 2016 573,414,893
9 company 2 2015 599,402,347
I would like to fill the nan values, with an order. I want to linearly interpolate first, then forward fill and then backward fill. I currently have:
f_2_impute = [x for x in cl_data.columns if cl_data[x].dtypes != 'O' and 'total' not in x and 'year' not in x]
def ffbf(x):
return x.ffill().bfill()
group_with = ['company']
for x in cl_data[f_2_impute]:
cl_data[x] = cl_data.groupby(group_with)[x].apply(lambda fill_it: ffbf(fill_it))
which performs ffill() and bfill(). Ideally I want a function that tries first to linearly intepolate the missing values, then try forward filling them and then backward filling them.
Any quick ways of achieving it? Thanking you in advance.

I believe you need first convert columns to floats if , there:
df = pd.read_csv(file, thousands=',')
Or:
df['revenues'] = df['revenues'].replace(',','', regex=True).astype(float)
and then add DataFrame.interpolate:
def ffbf(x):
return x.interpolate().ffill().bfill()

Python function to iterate over a column and calculate the forumla

i have a data set like this :
YEAR MONTH VALUE
2018 3 59.507
2018 3 26.03
2018 5 6.489
2018 2 -3.181
i am trying to perform a calculation like
((VALUE1 + 1) * (VALUE2 + 1) * (VALUE3+1).. * (VALUEn +1)-1) over VALUE column
Whats the best way to accomplish this?

Use:
df['VALUE'].add(1).prod()-1
#-26714.522733572892
If you want cumulative product to create a new column use Series.cumprod:
df['new_column']=df['VALUE'].add(1).cumprod().sub(1)
print(df)
YEAR MONTH VALUE new_column
0 2018 3 59.507 59.507000
1 2018 3 26.030 1634.504210
2 2018 5 6.489 12247.291029
3 2018 2 -3.181 -26714.522734

I think you're after...
cum_prod = (1 + df['VALUE'].cumprod()) - 1

First you should understand the objects you're dealing with, what attributes and methods they have. This is a Dataframe and the Value column is a Series.
here is the documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html

Pandas get the Month Ending Values from Series

I need to get the month-end balance from a series of entries.
Sample data:
date contrib totalShrs
0 2009-04-23 5220.00 10000.000
1 2009-04-24 10210.00 20000.000
2 2009-04-27 16710.00 30000.000
3 2009-04-30 22610.00 40000.000
4 2009-05-05 28909.00 50000.000
5 2009-05-20 38409.00 60000.000
6 2009-05-28 46508.00 70000.000
7 2009-05-29 56308.00 80000.000
8 2009-06-01 66108.00 90000.000
9 2009-06-02 78108.00 100000.000
10 2009-06-12 86606.00 110000.000
11 2009-08-03 95606.00 120000.000
The output would look something like this:
2009-04-30 40000
2009-05-31 80000
2009-06-30 110000
2009-07-31 110000
2009-08-31 120000
Is there a simple Pandas method?
I don't see how I can do this with something like a groupby?
Or would I have to do something like iterrows, find all the monthly entries, order them by date and pick the last one?
Thanks.

Use Grouper with GroupBy.last, forward filling missing values by ffill with Series.reset_index:
#if necessary
#df['date'] = pd.to_datetime(df['date'])
df = df.groupby(pd.Grouper(freq='m',key='date'))['totalShrs'].last().ffill().reset_index()
#alternative
#df = df.resample('m',on='date')['totalShrs'].last().ffill().reset_index()
print (df)
date totalShrs
0 2009-04-30 40000.0
1 2009-05-31 80000.0
2 2009-06-30 110000.0
3 2009-07-31 110000.0
4 2009-08-31 120000.0

Following gives you the information you want, i.e. end of month values, though the format is not exactly what you asked:
df['month'] = df['date'].str.split('-', expand = True)[1] # split date column to get month column
newdf = pd.DataFrame(columns=df.columns) # create a new dataframe for output
grouped = df.groupby('month') # get grouped values
for g in grouped: # for each group, get last row
gdf = pd.DataFrame(data=g[1])
newdf.loc[len(newdf),:] = gdf.iloc[-1,:] # fill new dataframe with last row obtained
newdf = newdf.drop('date', axis=1) # drop date column, since month column is there
print(newdf)
Output:
contrib totalShrs month
0 22610 40000 04
1 56308 80000 05
2 86606 110000 06
3 95606 120000 08

Pandas: Group by bi-monthly date field

I am trying to group by hospital staff working hours bi monthly. I have raw data on daily basis which look like below.
date hourse_spent emp_id
9/11/2016 8 1
15/11/2016 8 1
22/11/2016 8 2
23/11/2016 8 1
How I want to group by is.
cycle hourse_spent emp_id
1/11/2016-15/11/2016 16 1
16/11/2016-31/11/2016 8 2
16/11/2016-31/11/2016 8 1
I am trying to do the same with grouper and frequency in pandas something as below.
data.set_index('date',inplace=True)
print data.head()
dt = data.groupby(['emp_id', pd.Grouper(key='date', freq='MS')])['hours_spent'].sum().reset_index().sort_values('date')
#df.resample('10d').mean().interpolate(method='linear',axis=0)
print dt.resample('SMS').sum()
I also tried resampling
df1 = dt.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
data.set_index('date',inplace=True)
df1 = data.resample('MS', loffset=pd.Timedelta(15, 'd')).sum()
But this is giving data of 15 days interval not like 1 to 15 and 15 to 31.
Please let me know what I am doing wrong here.

You were almost there. This will do it -
dt = df.groupby(['emp_id', pd.Grouper(key='date', freq='SM')])['hours_spent'].sum().reset_index().sort_values('date')
emp_id date hours_spent
1 2016-10-31 8
1 2016-11-15 16
2 2016-11-15 8
The freq='SM' is the concept of semi-months which will use the 15th and the last day of every month

Put DateTime-Values into Bins
If I got you right, you basically want to put your values in the date column into bins. For this, pandas has the pd.cut() function included, which does exactly what you want.
Here's an approach which might help you:
import pandas as pd
df = pd.DataFrame({
'hours' : 8,
'emp_id' : [1,1,2,1],
'date' : [pd.datetime(2016,11,9),
pd.datetime(2016,11,15),
pd.datetime(2016,11,22),
pd.datetime(2016,11,23)]
})
bins_dt = pd.date_range('2016-10-16', freq='SM', periods=3)
cycle = pd.cut(df.date, bins_dt)
df.groupby([cycle, 'emp_id']).sum()
Which gets you:
cycle emp_id hours
------------------------ ------ ------
(2016-10-31, 2016-11-15] 1 16
2 NaN
(2016-11-15, 2016-11-30] 1 8
2 8

Had a similar question, here was my solution:
df1['BiMonth'] = df1['Date'] + pd.DateOffset(days=-1) + pd.offsets.SemiMonthEnd()
df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')
The construction "df1['Date'] + pd.DateOffset(days=-1)" will take whatever is in the date column and -1 day.
The construction "+ pd.offsets.SemiMonthEnd()" converts it to a bimonthly basket, but its off by a day unless you reduce the reference date by 1.
The construction "df1['BiMonth'] = df1['BiMonth'].dt.to_period('D')" cleans out the time so you just have days.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a new column with partial name from dataframe - python

Related

How to compare two dataframes and find the combination for which there is value in first dataframe but not in second

Python / Pandas: Fill NaN with order - linear interpolation --> ffill --> bfill

Python function to iterate over a column and calculate the forumla

Pandas get the Month Ending Values from Series

Pandas: Group by bi-monthly date field

Categories

Resources