I have this initial DataFrame in Pandas
A B C D E
0 23 2015 1 14937 16.25
1 23 2015 1 19054 7.50
2 23 2015 2 14937 16.75
3 23 2015 2 19054 17.25
4 23 2015 3 14937 71.75
5 23 2015 3 19054 15.00
6 23 2015 4 14937 13.00
7 23 2015 4 19054 37.75
8 23 2015 5 14937 4.25
9 23 2015 5 19054 18.25
10 23 2015 6 14937 16.50
11 23 2015 6 19054 1.00
If I want to obtain this result, how could I do it?
A B C D E
0 23 2015 1 14937 NaN
1 23 2015 2 14937 NaN
2 23 2015 2 14937 16.6
3 23 2015 1 14937 35.1
4 23 2015 2 14937 33.8
5 23 2015 3 14937 29.7
6 23 2015 4 14937 11.3
7 23 2015 4 19054 NaN
8 23 2015 5 19054 NaN
9 23 2015 5 19054 13.3
10 23 2015 6 19054 23.3
11 23 2015 6 19054 23.7
12 23 2015 6 19054 19.0
I tried a GroupBy but I dind't get it
DfMean = pd.DataFrame(DfGby.rolling(center=False,window=3)['E'].mean())
I think you can use groupby with rolling (need at least pandas 0.18.1):
s = df.groupby('D').rolling(3)['E'].mean()
print (s)
D
14937 0 NaN
2 NaN
4 34.916667
6 33.833333
8 29.666667
10 11.250000
19054 1 NaN
3 NaN
5 13.250000
7 23.333333
9 23.666667
11 19.000000
Name: E, dtype: float64
Then set_index by D with swaplevel for same order for matching output:
df = df.set_index('D', append=True).swaplevel(0,1)
df['E'] = s
Last reset_index and reorder columns:
df = df.reset_index(level=0).sort_values(['D','C'])
df = df[['A','B','C','D','E']]
print (df)
A B C D E
0 23 2015 1 14937 NaN
2 23 2015 2 14937 NaN
4 23 2015 3 14937 34.916667
6 23 2015 4 14937 33.833333
8 23 2015 5 14937 29.666667
10 23 2015 6 14937 11.250000
1 23 2015 1 19054 NaN
3 23 2015 2 19054 NaN
5 23 2015 3 19054 13.250000
7 23 2015 4 19054 23.333333
9 23 2015 5 19054 23.666667
11 23 2015 6 19054 19.000000
Related
ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67
My goal is to replace the last value (or the last several values) of each id with NaN. My real dataset is quite large and has groups of different sizes.
Example:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2010,2011,2012,2013,2014,2015]
percent = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]
dictex ={"id":ids,"year":year,"percent [%]": percent}
dfex = pd.DataFrame(dictex)
print(dfex)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 50
5 1 2005 110
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 52
11 2 1995 80
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 50
17 3 2015 110
My goal is to replace the last 1 / or 2 / or 3 values of the "percent [%]" column for each id (group) with NaN.
The result should look like this: (here: replace the last 2 values of each id)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 NaN
17 3 2015 NaN
I know there should be a relatively easy solution for this, but i'm new to python and simply haven't been able to figure out an elegant way.
Thanks for the help!
try using groupby, tail and index to find the index of those rows that will be modified and use loc to change the values
nrows = 2
idx = df.groupby('id').tail(nrows).index
df.loc[idx, 'percent [%]'] = np.nan
#output
id year percent [%]
0 1 2000 120.0
1 1 2001 70.0
2 1 2002 37.0
3 1 2003 40.0
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140.0
7 2 1991 100.0
8 2 1992 90.0
9 2 1993 5.0
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60.0
13 3 2011 40.0
14 3 2012 70.0
15 3 2013 60.0
16 3 2014 NaN
17 3 2015 NaN
I am trying to create a new variable which performs the SALES_AMOUNT difference between years-month on the following dataframe. I think my code should be think with this groupby but i dont know how to add the condition [df2 df.Control - df.Control.shift(1) == 12] after the groupby so as to perform a correct difference between years
df['LY'] = df.groupby(['month']).SALES_AMOUNT.shift(1)
Dataframe:
SALES_AMOUNT Store Control year month
0 16793.14 A 3 2013 3
1 42901.61 A 5 2013 5
2 63059.72 A 6 2013 6
3 168471.43 A 10 2013 10
4 58570.72 A 11 2013 11
5 67526.71 A 12 2013 12
6 50649.07 A 14 2014 2
7 48819.97 A 18 2014 6
8 97100.77 A 19 2014 7
9 67778.40 A 21 2014 9
10 90327.52 A 22 2014 10
11 75703.12 A 23 2014 11
12 26098.50 A 24 2014 12
13 81429.36 A 25 2015 1
14 19539.85 A 26 2015 2
15 71727.66 A 27 2015 3
16 20117.79 A 28 2015 4
17 44252.19 A 29 2015 6
18 68578.82 A 30 2015 7
19 91483.39 A 31 2015 8
20 39220.87 A 32 2015 10
21 12224.11 A 33 2015 11
result should look like this:
SALES_AMOUNT Store Control year month year_diff
0 16793.14 A 3 2013 3 Nan
1 42901.61 A 5 2013 5 Nan
2 63059.72 A 6 2013 6 Nan
3 168471.43 A 10 2013 10 Nan
4 58570.72 A 11 2013 11 Nan
5 67526.71 A 12 2013 12 Nan
6 50649.07 A 14 2014 2 Nan
7 48819.97 A 18 2014 6 -14239.75
8 97100.77 A 19 2014 7 Nan
9 67778.40 A 21 2014 9 Nan
10 90327.52 A 22 2014 10 -78143.91
11 75703.12 A 23 2014 11 17132.4
12 26098.50 A 24 2014 12 -41428.21
13 81429.36 A 25 2015 1 Nan
14 19539.85 A 26 2015 2 -31109.22
15 71727.66 A 27 2015 3 Nan
16 20117.79 A 28 2015 4 Nan
17 44252.19 A 29 2015 6 -4567.78
18 68578.82 A 30 2015 7 -28521.95
19 91483.39 A 31 2015 8 Nan
20 39220.87 A 32 2015 10 -51106.65
21 12224.11 A 33 2015 11 -63479.01
I think what you're looking for is the below:
df = df.sort_values(by=['month', 'year'])
df['SALES_AMOUNT_shifted'] = df.groupby(['month'])['SALES_AMOUNT'].shift(1).tolist()
df['LY'] = df['SALES_AMOUNT'] - df['SALES_AMOUNT_shifted']
Once you sort by month and year, the month groups will be organized in a consistent way and then the shift makes sense.
-- UPDATE --
After applying the solution above, you could set to None all instances where the year difference is greater than 1.
df['year_diff'] = df['year'] - df.groupby(['month'])['year'].shift()
df['year_diff'] = df['year_diff'].fillna(0)
df.loc[df['year_diff'] != 1, 'LY'] = None
Using this I'm getting the desired output that you added.
Does this work? I would also greatly appreciate a pandas-centric solution, as I spent some time on this and could not come up with one.
df = pd.read_clipboard().set_index('Control')
df['yoy_diff'] = np.nan
for i in df.index:
for j in df.index:
if j - i == 12:
df['yoy_diff'].loc[j] = df.loc[j, 'SALES_AMOUNT'] - df.loc[i, 'SALES_AMOUNT']
df
Output:
SALES_AMOUNT Store year month yoy_diff
Control
3 16793.14 A 2013 3 NaN
5 42901.61 A 2013 5 NaN
6 63059.72 A 2013 6 NaN
10 168471.43 A 2013 10 NaN
11 58570.72 A 2013 11 NaN
12 67526.71 A 2013 12 NaN
14 50649.07 A 2014 2 NaN
18 48819.97 A 2014 6 -14239.75
19 97100.77 A 2014 7 NaN
21 67778.40 A 2014 9 NaN
22 90327.52 A 2014 10 -78143.91
23 75703.12 A 2014 11 17132.40
24 26098.50 A 2014 12 -41428.21
25 81429.36 A 2015 1 NaN
26 19539.85 A 2015 2 -31109.22
27 71727.66 A 2015 3 NaN
28 20117.79 A 2015 4 NaN
29 44252.19 A 2015 6 NaN
30 68578.82 A 2015 7 19758.85
31 91483.39 A 2015 8 -5617.38
32 39220.87 A 2015 10 NaN
33 12224.11 A 2015 11 -55554.29
i am a stata user and i trying to switch to python and i having problem with some codes. If i have the following panel data
id year quarter fecha jobs
1 2007 1 220 10
1 2007 2 221 12
1 2007 3 222 12
1 2007 4 223 12
1 2008 1 224 12
1 2008 2 225 13
1 2008 3 226 14
1 2008 4 227 9
1 2009 1 228 12
1 2009 2 229 15
1 2009 3 230 18
1 2009 4 231 15
1 2010 1 232 15
1 2010 2 233 16
1 2010 3 234 17
1 2010 4 235 18
2 2007 1 220 10
2 2007 2 221 12
2 2007 3 222 12
2 2007 4 223 12
2 2008 1 224 12
2 2008 2 225 13
2 2008 3 226 14
2 2008 4 227 9
2 2009 1 228 12
2 2009 2 229 15
2 2009 3 230 18
2 2009 4 231 15
2 2010 1 232 15
2 2010 2 233 16
2 2010 4 235 18
(My panel data is much bigger than the example, is just to illustrate my problem). I want to calculate the variation of jobs of the same quarter and three year before
So result should look like these
id year quarter fecha jobs jobs_variation
1 2007 1 220 10 Nan
1 2007 2 221 12 Nan
1 2007 3 222 12 Nan
1 2007 4 223 12 Nan
1 2008 1 224 12 Nan
1 2008 2 225 13 Nan
1 2008 3 226 14 Nan
1 2008 4 227 9 Nan
1 2009 1 228 12 Nan
1 2009 2 229 15 Nan
1 2009 3 230 18 Nan
1 2009 4 231 15 Nan
1 2010 1 232 15 0.5
1 2010 2 233 16 0.33
1 2010 3 234 17 0.30769
1 2010 4 235 18 0.5
2 2007 1 220 10 Nan
2 2007 4 223 12 Nan
2 2008 1 224 12 Nan
2 2008 2 225 13 Nan
2 2008 3 226 14 Nan
2 2008 4 227 9 Nan
2 2009 1 228 12 Nan
2 2009 2 229 15 Nan
2 2009 3 230 18 Nan
2 2009 4 231 15 Nan
2 2010 1 232 15 0.5
2 2010 2 233 16 Nan
2 2010 3 234 20 Nan
2 2010 4 235 18 0.5
Check that in the second id year 2010 in the second and thir quarter calculation must not be me made because the id was not present at 2007Q2 and 2007Q3.
In stata the code would be,
bys id: gen jobs_variation=jobs/jobs[_n-12]-1 if fecha[_n-12]==fecha-12
IIUC, you need a groupby on id and quarter followed by apply:
df['jobs_variation'] = df.groupby(['id', 'quarter']).jobs\
.apply(lambda x: x / x.shift(3) - 1)
df
id year quarter fecha jobs jobs_variation
0 1 2007 1 220 10 NaN
1 1 2007 2 221 12 NaN
2 1 2007 3 222 12 NaN
3 1 2007 4 223 12 NaN
4 1 2008 1 224 12 NaN
5 1 2008 2 225 13 NaN
6 1 2008 3 226 14 NaN
7 1 2008 4 227 9 NaN
8 1 2009 1 228 12 NaN
9 1 2009 2 229 15 NaN
10 1 2009 3 230 18 NaN
11 1 2009 4 231 15 NaN
12 1 2010 1 232 15 0.500000
13 1 2010 2 233 16 0.333333
14 1 2010 3 234 17 0.416667
15 1 2010 4 235 18 0.500000
16 2 2007 1 220 10 NaN
17 2 2007 4 223 12 NaN
18 2 2008 1 224 12 NaN
19 2 2008 2 225 13 NaN
20 2 2008 3 226 14 NaN
21 2 2008 4 227 9 NaN
22 2 2009 1 228 12 NaN
23 2 2009 2 229 15 NaN
24 2 2009 3 230 18 NaN
25 2 2009 4 231 15 NaN
26 2 2010 1 232 15 0.500000
27 2 2010 2 233 16 NaN
28 2 2010 3 234 20 NaN
29 2 2010 4 235 18 0.500000
x / x.shift(3) will divide the current year's job count (for that quarter) by the corresponding value from 3 years ago.
l have a few years data set but some of values are missing.l would like to fill these rows with "NAN"
here is an example data:
year month day min
2011 1 1 -2.3
2011 1 2 -9.1
2011 1 3 -4.7
2011 1 4 -3.5
2011 1 6 -1.4
2011 1 7 0.1
2011 1 9 -6.3
2011 1 10 -9.4
2011 1 11 -13.3
2011 1 12 -17.9
2011 1 14 -11.8
2011 1 15 -11.2
2011 1 16 -7.1
2011 1 17 -7.6
2011 1 18 -9.9
2011 1 20 -6.9
2011 1 21 -8.8
2011 1 22 -11.3
2011 1 24 -3.1
2011 1 25 -0.7
2011 1 26 0.8
2011 1 27 -0.9
2011 1 28 -6.9
2011 1 29 -3.2
2011 1 30 -2.3
2011 1 31 -7
as you see ,in first month of 2011 , many value missing and l need to open a row for this values and then fill. is there any way to do it ?
You need reindex by MultiIndex.from_arrays created by date_range:
start = '2011-01-01'
end = '2011-01-31'
rng = pd.date_range(start, end)
mux = pd.MultiIndex.from_arrays([rng.year, rng.month, rng.day], names=('year','month','day'))
df = df.set_index(['year','month','day'])
print (df.reindex(mux).reset_index())
year month day min
0 2011 1 1 -2.3
1 2011 1 2 -9.1
2 2011 1 3 -4.7
3 2011 1 4 -3.5
4 2011 1 5 NaN
5 2011 1 6 -1.4
6 2011 1 7 0.1
7 2011 1 8 NaN
8 2011 1 9 -6.3
9 2011 1 10 -9.4
10 2011 1 11 -13.3
11 2011 1 12 -17.9
12 2011 1 13 NaN
13 2011 1 14 -11.8
14 2011 1 15 -11.2
15 2011 1 16 -7.1
16 2011 1 17 -7.6
17 2011 1 18 -9.9
18 2011 1 19 NaN
19 2011 1 20 -6.9
20 2011 1 21 -8.8
21 2011 1 22 -11.3
22 2011 1 23 NaN
23 2011 1 24 -3.1
24 2011 1 25 -0.7
25 2011 1 26 0.8
26 2011 1 27 -0.9
27 2011 1 28 -6.9
28 2011 1 29 -3.2
29 2011 1 30 -2.3
30 2011 1 31 -7.0
Convert the DataFrame to a timeseries with a datetime index, and then change the frequency of the index to daily ('D') using asfreq:
import pandas as pd
raw = """2011 1 1 -2.3
2011 1 2 -9.1
2011 1 3 -4.7
2011 1 4 -3.5
2011 1 6 -1.4"""
# Parse the rows into dates and values
new_rows = []
for row in raw.split('\n'):
date = pd.to_datetime('/'.join(row.split()[:3]))
value = row[-1]
new_rows.append({'date': date, 'value': value})
timeseries = pd.DataFrame(new_rows).set_index('date')
timeseries.asfreq('D')
I think df.replace() does the job:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
print df.replace(r'\s+', np.nan, regex=True)
Produces:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
Yeah use Pandas
Create a dataframe with your date as index
Use asfreq
Hope this helps, see http://pandas.pydata.org/pandas-docs/stable/timeseries.html for more information :)