Extracting multiple values in different rows - python

I have a dataset
ID col1 col2 year
1 A 111,222,3334 2010
2 B 344, 111 2010
3 C 121,123 2011
I wanna rearrange the dataset in the following way
ID col1 col2 year
1 A 111 2010
1 A 222 2010
1 A 3334 2010
2 B 344 2010
2 B 111 2010
3 C 121 2011
3 C 123 2011
I can do it using the following code.
a = df.COMP_MONITOR_TYPE_CODE.str[:3]
df['col2'] = np.where(a == 111, 111)
Since, I have a very long data, its would be time consuming to do it one by one. Is there any other way to do it

split + explode:
df.assign(col2 = df.col2.str.split(',')).explode('col2')
# ID col1 col2 year
#0 1 A 111 2010
#0 1 A 222 2010
#0 1 A 3334 2010
#1 2 B 344 2010
#1 2 B 111 2010
#2 3 C 121 2011
#2 3 C 123 2011

Related

Loop through timeseries and fill missing data - Python

I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()

Changing an existing column conditional on two other column

I have a data set:
ID Fv_year HP_b_year HP_e_year
1 2010 0 2012
2 0 2009 2011
3 2000 0 2008
4 2001 0 0
I want generate:
ID Fv_year HP_b_year HP_e_year
1 2010 2010 2012
2 0 2009 2011
3 2000 2000 2008
4 2001 0 0
In word, when Fv_year >0 , HP_b_year =0 and HP_e_year>0 then I want to make HP_b_year = Fv_year, otherwise keep HP_b_year as it was before. I have used following cod:
def myfunc(x,y,z):
if x == 0 and y>0 and z>0:
return y
else:
return x
df['HP_b_year'] = df.apply(lambda x: myfunc(x.HP_b_year, x.Fv_year, x.HP_e_year), axis=1)
But its not working
You can use loc with conditions
df.loc[(df['HP_e_year']>0) & (df['Fv_year'].ne(0)), ['HP_b_year']] = df['Fv_year'][(df['HP_e_year']>0) & (df['Fv_year'].ne(0))]
ID Fv_year HP_b_year HP_e_year
0 1 2010 2010 2012
1 2 0 2009 2011
2 3 2000 2000 2008
3 4 2001 0 0

Fill a column in a dataframe if a condition is met

I have the following dataframe:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016
2 20 2014 2014
1 30 2017 2016
1 40 2016 2016
4 300 2015 2000
5 150 2005 2002
What I'm looking for is the Amount Paid should appear in the withNYears column if the payment was made within n years of start date otherwise you get NaN.
N years can be any number but let's say 2 for this example (as I will be playing with this to see findings).
so basically the above dataframe would come out like this if the amount was paid within 2 years:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016 100
2 20 2014 2014 20
1 30 2017 2016 30
1 40 2016 2016 40
4 300 2015 2000 NaN
5 150 2005 2002 NaN
does anyone know how to achieve this? cheers.
Subtract columns and compare by scalar for boolean mask and then set value by numpy.where, Series.where or DataFrame.loc:
m = (df['PaymentReceivedDate'] - df['StartDate']) < 2
df['withinNYears'] = np.where(m, df['AmountPaid'], np.nan)
#alternatives
#df['withinNYears'] = df['AmountPaid'].where(m)
#df.loc[m, 'withinNYears'] = df['AmountPaid']
print (df)
PersonID AmountPaid PaymentReceivedDate StartDate \
0 1 100 2017 2016
1 2 20 2014 2014
2 1 30 2017 2016
3 1 40 2016 2016
4 4 300 2015 2000
5 5 150 2005 2002
withinNYears
0 100.0
1 20.0
2 30.0
3 40.0
4 NaN
5 NaN
EDIT:
If StartDate column have datetimes:
m = (df['PaymentReceivedDate'] - df['StartDate'].dt. year) < 2
Just do with assign using loc
df.loc[(df['PaymentReceivedDate'] - df['StartDate']<2),'withinNYears']=df.AmountPaid
df
Out[37]:
PersonID AmountPaid ... StartDate withinNYears
0 1 100 ... 2016 100.0
1 2 20 ... 2014 20.0
2 1 30 ... 2016 30.0
3 1 40 ... 2016 40.0
4 4 300 ... 2000 NaN
5 5 150 ... 2002 NaN
[6 rows x 5 columns]

Loop Optimization in python

I have a dataframe df like this
Product Yr Value
A 2014 1
A 2015 3
A 2016 2
B 2015 2
B 2016 1
I want to do max cumululative ie
Product Yr Value
A 2014 1
A 2015 3
A 2016 3
B 2015 2
B 2016 2
My actual data has about 50000 products
I am writing a code like:
df2=pd.DataFrame()
for i in (df['Product'].unique()):
data3=df[df['Product']==i]
data3.sort_values(by=['Yr'])
data3['Value']=data3['Value'].cummax()
df2=df2.append(data3)
#df2 is my result
This code is taking a lot of time(~3 days) for about 50000 products and 10 years. Is there some way to speed it up?
You can use groupby.cummax instead:
df['Value'] = df.sort_values('Yr').groupby('Product').Value.cummax()
df
#Product Yr Value
#0 A 2014 1
#1 A 2015 3
#2 A 2016 3
#3 B 2015 2
#4 B 2016 2

reshape a pandas DataFrame by transposing two columns and repeating another

The title might be a bit confusing, this is what I want to do:
I would like to convert this dataframe
pd.DataFrame({'name':['A','B','C'],'date1':[1999,2000,2001],'date2':[2011,2012,2013]})
date1 date2 name
0 1999 2011 A
1 2000 2012 B
2 2001 2013 C
Into the following:
dates name
0 1999 A
1 2011 A
2 2000 B
3 2012 B
4 2001 C
5 2013 C
I've been trying to do pivot tables and transposing, but with no luck.
You can use melt, remove column by drop and last sort_values:
print (pd.melt(df, id_vars='name', value_name='dates')
.drop('variable', axis=1)
.sort_values('name')[['dates','name']])
dates name
0 1999 A
3 2011 A
1 2000 B
4 2012 B
2 2001 C
5 2013 C
Another solution with unstack and sort_index:
print (df.set_index('name')
.unstack()
.reset_index(drop=True, level=0)
.sort_index()
.reset_index(name='dates')[['dates','name']])
dates name
0 1999 A
1 2011 A
2 2000 B
3 2012 B
4 2001 C
5 2013 C
Solution with lreshape and sort_values:
print (pd.lreshape(df, {'dates':['date1', 'date2']}).sort_values('name')[['dates','name']])
dates name
0 1999 A
3 2011 A
1 2000 B
4 2012 B
2 2001 C
5 2013 C
Numpy solution with numpy.repeat and flattening by numpy.ravel:
df2 = pd.DataFrame({
"name": np.repeat(df.name, 2),
"dates": df[['date1','date2']].values.ravel()})
print (df2)
dates name
0 1999 A
0 2011 A
1 2000 B
1 2012 B
2 2001 C
2 2013 C
EDIT:
lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

Categories

Resources