l have a few years data set but some of values are missing.l would like to fill these rows with "NAN"
here is an example data:
year month day min
2011 1 1 -2.3
2011 1 2 -9.1
2011 1 3 -4.7
2011 1 4 -3.5
2011 1 6 -1.4
2011 1 7 0.1
2011 1 9 -6.3
2011 1 10 -9.4
2011 1 11 -13.3
2011 1 12 -17.9
2011 1 14 -11.8
2011 1 15 -11.2
2011 1 16 -7.1
2011 1 17 -7.6
2011 1 18 -9.9
2011 1 20 -6.9
2011 1 21 -8.8
2011 1 22 -11.3
2011 1 24 -3.1
2011 1 25 -0.7
2011 1 26 0.8
2011 1 27 -0.9
2011 1 28 -6.9
2011 1 29 -3.2
2011 1 30 -2.3
2011 1 31 -7
as you see ,in first month of 2011 , many value missing and l need to open a row for this values and then fill. is there any way to do it ?
You need reindex by MultiIndex.from_arrays created by date_range:
start = '2011-01-01'
end = '2011-01-31'
rng = pd.date_range(start, end)
mux = pd.MultiIndex.from_arrays([rng.year, rng.month, rng.day], names=('year','month','day'))
df = df.set_index(['year','month','day'])
print (df.reindex(mux).reset_index())
year month day min
0 2011 1 1 -2.3
1 2011 1 2 -9.1
2 2011 1 3 -4.7
3 2011 1 4 -3.5
4 2011 1 5 NaN
5 2011 1 6 -1.4
6 2011 1 7 0.1
7 2011 1 8 NaN
8 2011 1 9 -6.3
9 2011 1 10 -9.4
10 2011 1 11 -13.3
11 2011 1 12 -17.9
12 2011 1 13 NaN
13 2011 1 14 -11.8
14 2011 1 15 -11.2
15 2011 1 16 -7.1
16 2011 1 17 -7.6
17 2011 1 18 -9.9
18 2011 1 19 NaN
19 2011 1 20 -6.9
20 2011 1 21 -8.8
21 2011 1 22 -11.3
22 2011 1 23 NaN
23 2011 1 24 -3.1
24 2011 1 25 -0.7
25 2011 1 26 0.8
26 2011 1 27 -0.9
27 2011 1 28 -6.9
28 2011 1 29 -3.2
29 2011 1 30 -2.3
30 2011 1 31 -7.0
Convert the DataFrame to a timeseries with a datetime index, and then change the frequency of the index to daily ('D') using asfreq:
import pandas as pd
raw = """2011 1 1 -2.3
2011 1 2 -9.1
2011 1 3 -4.7
2011 1 4 -3.5
2011 1 6 -1.4"""
# Parse the rows into dates and values
new_rows = []
for row in raw.split('\n'):
date = pd.to_datetime('/'.join(row.split()[:3]))
value = row[-1]
new_rows.append({'date': date, 'value': value})
timeseries = pd.DataFrame(new_rows).set_index('date')
timeseries.asfreq('D')
I think df.replace() does the job:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
print df.replace(r'\s+', np.nan, regex=True)
Produces:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
Yeah use Pandas
Create a dataframe with your date as index
Use asfreq
Hope this helps, see http://pandas.pydata.org/pandas-docs/stable/timeseries.html for more information :)
Related
ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67
My goal is to replace the last value (or the last several values) of each id with NaN. My real dataset is quite large and has groups of different sizes.
Example:
import pandas as pd
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year = [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2010,2011,2012,2013,2014,2015]
percent = [120,70,37,40,50,110,140,100,90,5,52,80,60,40,70,60,50,110]
dictex ={"id":ids,"year":year,"percent [%]": percent}
dfex = pd.DataFrame(dictex)
print(dfex)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 50
5 1 2005 110
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 52
11 2 1995 80
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 50
17 3 2015 110
My goal is to replace the last 1 / or 2 / or 3 values of the "percent [%]" column for each id (group) with NaN.
The result should look like this: (here: replace the last 2 values of each id)
id year percent [%]
0 1 2000 120
1 1 2001 70
2 1 2002 37
3 1 2003 40
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140
7 2 1991 100
8 2 1992 90
9 2 1993 5
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60
13 3 2011 40
14 3 2012 70
15 3 2013 60
16 3 2014 NaN
17 3 2015 NaN
I know there should be a relatively easy solution for this, but i'm new to python and simply haven't been able to figure out an elegant way.
Thanks for the help!
try using groupby, tail and index to find the index of those rows that will be modified and use loc to change the values
nrows = 2
idx = df.groupby('id').tail(nrows).index
df.loc[idx, 'percent [%]'] = np.nan
#output
id year percent [%]
0 1 2000 120.0
1 1 2001 70.0
2 1 2002 37.0
3 1 2003 40.0
4 1 2004 NaN
5 1 2005 NaN
6 2 1990 140.0
7 2 1991 100.0
8 2 1992 90.0
9 2 1993 5.0
10 2 1994 NaN
11 2 1995 NaN
12 3 2010 60.0
13 3 2011 40.0
14 3 2012 70.0
15 3 2013 60.0
16 3 2014 NaN
17 3 2015 NaN
I am trying to create a new variable which performs the SALES_AMOUNT difference between years-month on the following dataframe. I think my code should be think with this groupby but i dont know how to add the condition [df2 df.Control - df.Control.shift(1) == 12] after the groupby so as to perform a correct difference between years
df['LY'] = df.groupby(['month']).SALES_AMOUNT.shift(1)
Dataframe:
SALES_AMOUNT Store Control year month
0 16793.14 A 3 2013 3
1 42901.61 A 5 2013 5
2 63059.72 A 6 2013 6
3 168471.43 A 10 2013 10
4 58570.72 A 11 2013 11
5 67526.71 A 12 2013 12
6 50649.07 A 14 2014 2
7 48819.97 A 18 2014 6
8 97100.77 A 19 2014 7
9 67778.40 A 21 2014 9
10 90327.52 A 22 2014 10
11 75703.12 A 23 2014 11
12 26098.50 A 24 2014 12
13 81429.36 A 25 2015 1
14 19539.85 A 26 2015 2
15 71727.66 A 27 2015 3
16 20117.79 A 28 2015 4
17 44252.19 A 29 2015 6
18 68578.82 A 30 2015 7
19 91483.39 A 31 2015 8
20 39220.87 A 32 2015 10
21 12224.11 A 33 2015 11
result should look like this:
SALES_AMOUNT Store Control year month year_diff
0 16793.14 A 3 2013 3 Nan
1 42901.61 A 5 2013 5 Nan
2 63059.72 A 6 2013 6 Nan
3 168471.43 A 10 2013 10 Nan
4 58570.72 A 11 2013 11 Nan
5 67526.71 A 12 2013 12 Nan
6 50649.07 A 14 2014 2 Nan
7 48819.97 A 18 2014 6 -14239.75
8 97100.77 A 19 2014 7 Nan
9 67778.40 A 21 2014 9 Nan
10 90327.52 A 22 2014 10 -78143.91
11 75703.12 A 23 2014 11 17132.4
12 26098.50 A 24 2014 12 -41428.21
13 81429.36 A 25 2015 1 Nan
14 19539.85 A 26 2015 2 -31109.22
15 71727.66 A 27 2015 3 Nan
16 20117.79 A 28 2015 4 Nan
17 44252.19 A 29 2015 6 -4567.78
18 68578.82 A 30 2015 7 -28521.95
19 91483.39 A 31 2015 8 Nan
20 39220.87 A 32 2015 10 -51106.65
21 12224.11 A 33 2015 11 -63479.01
I think what you're looking for is the below:
df = df.sort_values(by=['month', 'year'])
df['SALES_AMOUNT_shifted'] = df.groupby(['month'])['SALES_AMOUNT'].shift(1).tolist()
df['LY'] = df['SALES_AMOUNT'] - df['SALES_AMOUNT_shifted']
Once you sort by month and year, the month groups will be organized in a consistent way and then the shift makes sense.
-- UPDATE --
After applying the solution above, you could set to None all instances where the year difference is greater than 1.
df['year_diff'] = df['year'] - df.groupby(['month'])['year'].shift()
df['year_diff'] = df['year_diff'].fillna(0)
df.loc[df['year_diff'] != 1, 'LY'] = None
Using this I'm getting the desired output that you added.
Does this work? I would also greatly appreciate a pandas-centric solution, as I spent some time on this and could not come up with one.
df = pd.read_clipboard().set_index('Control')
df['yoy_diff'] = np.nan
for i in df.index:
for j in df.index:
if j - i == 12:
df['yoy_diff'].loc[j] = df.loc[j, 'SALES_AMOUNT'] - df.loc[i, 'SALES_AMOUNT']
df
Output:
SALES_AMOUNT Store year month yoy_diff
Control
3 16793.14 A 2013 3 NaN
5 42901.61 A 2013 5 NaN
6 63059.72 A 2013 6 NaN
10 168471.43 A 2013 10 NaN
11 58570.72 A 2013 11 NaN
12 67526.71 A 2013 12 NaN
14 50649.07 A 2014 2 NaN
18 48819.97 A 2014 6 -14239.75
19 97100.77 A 2014 7 NaN
21 67778.40 A 2014 9 NaN
22 90327.52 A 2014 10 -78143.91
23 75703.12 A 2014 11 17132.40
24 26098.50 A 2014 12 -41428.21
25 81429.36 A 2015 1 NaN
26 19539.85 A 2015 2 -31109.22
27 71727.66 A 2015 3 NaN
28 20117.79 A 2015 4 NaN
29 44252.19 A 2015 6 NaN
30 68578.82 A 2015 7 19758.85
31 91483.39 A 2015 8 -5617.38
32 39220.87 A 2015 10 NaN
33 12224.11 A 2015 11 -55554.29
I have this initial DataFrame in Pandas
A B C D E
0 23 2015 1 14937 16.25
1 23 2015 1 19054 7.50
2 23 2015 2 14937 16.75
3 23 2015 2 19054 17.25
4 23 2015 3 14937 71.75
5 23 2015 3 19054 15.00
6 23 2015 4 14937 13.00
7 23 2015 4 19054 37.75
8 23 2015 5 14937 4.25
9 23 2015 5 19054 18.25
10 23 2015 6 14937 16.50
11 23 2015 6 19054 1.00
If I want to obtain this result, how could I do it?
A B C D E
0 23 2015 1 14937 NaN
1 23 2015 2 14937 NaN
2 23 2015 2 14937 16.6
3 23 2015 1 14937 35.1
4 23 2015 2 14937 33.8
5 23 2015 3 14937 29.7
6 23 2015 4 14937 11.3
7 23 2015 4 19054 NaN
8 23 2015 5 19054 NaN
9 23 2015 5 19054 13.3
10 23 2015 6 19054 23.3
11 23 2015 6 19054 23.7
12 23 2015 6 19054 19.0
I tried a GroupBy but I dind't get it
DfMean = pd.DataFrame(DfGby.rolling(center=False,window=3)['E'].mean())
I think you can use groupby with rolling (need at least pandas 0.18.1):
s = df.groupby('D').rolling(3)['E'].mean()
print (s)
D
14937 0 NaN
2 NaN
4 34.916667
6 33.833333
8 29.666667
10 11.250000
19054 1 NaN
3 NaN
5 13.250000
7 23.333333
9 23.666667
11 19.000000
Name: E, dtype: float64
Then set_index by D with swaplevel for same order for matching output:
df = df.set_index('D', append=True).swaplevel(0,1)
df['E'] = s
Last reset_index and reorder columns:
df = df.reset_index(level=0).sort_values(['D','C'])
df = df[['A','B','C','D','E']]
print (df)
A B C D E
0 23 2015 1 14937 NaN
2 23 2015 2 14937 NaN
4 23 2015 3 14937 34.916667
6 23 2015 4 14937 33.833333
8 23 2015 5 14937 29.666667
10 23 2015 6 14937 11.250000
1 23 2015 1 19054 NaN
3 23 2015 2 19054 NaN
5 23 2015 3 19054 13.250000
7 23 2015 4 19054 23.333333
9 23 2015 5 19054 23.666667
11 23 2015 6 19054 19.000000
I have a data frame with columns: year, month, day, and prec as header. How can I count the longest number of day having the value 0 in 'prec' column for each month.
datasub = data[data['prec'] ==0.0]
datasub.groupby(['year','month'])['prec'].count()
from this code I did not got the value result which I expected
and the data looks like below:
Out[70]:
year month day prec
0 1981 1 1 1.5
1 1981 1 2 0.0
2 1981 1 3 0.0
3 1981 1 4 0.4
4 1981 1 5 0.0
5 1981 1 6 1.0
6 1981 1 7 1.9
7 1981 1 8 0.6
8 1981 1 9 3.7
9 1981 1 10 0.0
10 1981 1 11 0.0
11 1981 1 12 0.0
12 1981 1 13 0.0
13 1981 1 14 12.2
14 1981 1 15 1.7
15 1981 1 16 0.6
16 1981 1 17 0.9
17 1981 1 18 0.6
18 1981 1 19 0.4
19 1981 1 20 0.2
20 1981 1 21 1.4
21 1981 1 22 3.2
22 1981 1 23 0.0
23 1981 1 24 0.2
24 1981 1 25 1.2
25 1981 1 26 0.0
26 1981 1 27 0.0
27 1981 1 28 0.0
28 1981 1 29 0.0
29 1981 1 30 0.2
... ... ... ... ...
3987 1991 12 2 0.0
3988 1991 12 3 0.0
3989 1991 12 4 0.0
3990 1991 12 5 0.5
3991 1991 12 6 0.4
3992 1991 12 7 1.2
3993 1991 12 8 0.0
3994 1991 12 9 0.0
3995 1991 12 10 0.0
3996 1991 12 11 0.0
3997 1991 12 12 0.0
import pandas as pd
import numpy as np
# simulate some artificial data
# ============================================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(4000), columns=['prec'], index=pd.date_range('1981-01-01', periods=4000, freq='D'))
df['prec'] = np.where(df['prec'] > 0, df['prec'], 0.0)
df['year'] = df.index.year
df['month'] = df.index.month
df['day'] = df.index.day
df
prec year month day
1981-01-01 1.7641 1981 1 1
1981-01-02 0.4002 1981 1 2
1981-01-03 0.9787 1981 1 3
1981-01-04 2.2409 1981 1 4
1981-01-05 1.8676 1981 1 5
1981-01-06 0.0000 1981 1 6
1981-01-07 0.9501 1981 1 7
1981-01-08 0.0000 1981 1 8
1981-01-09 0.0000 1981 1 9
1981-01-10 0.4106 1981 1 10
1981-01-11 0.1440 1981 1 11
1981-01-12 1.4543 1981 1 12
1981-01-13 0.7610 1981 1 13
1981-01-14 0.1217 1981 1 14
1981-01-15 0.4439 1981 1 15
... ... ... ... ...
1991-11-30 0.9764 1991 11 30
1991-12-01 0.1772 1991 12 1
1991-12-02 0.0000 1991 12 2
1991-12-03 0.1067 1991 12 3
1991-12-04 0.0000 1991 12 4
1991-12-05 0.0000 1991 12 5
1991-12-06 0.5765 1991 12 6
1991-12-07 0.0653 1991 12 7
1991-12-08 0.0000 1991 12 8
1991-12-09 0.3949 1991 12 9
1991-12-10 0.0000 1991 12 10
1991-12-11 1.7796 1991 12 11
1991-12-12 0.0000 1991 12 12
1991-12-13 1.5771 1991 12 13
1991-12-14 0.0000 1991 12 14
[4000 rows x 4 columns]
# processing
# ===========================================
def func(group):
return (group.prec != 0).astype(int).cumsum().value_counts().values[0] - 1
df.groupby(['year', 'month']).apply(func)
year month
1981 1 2
2 5
3 4
4 2
5 3
6 4
7 3
8 5
9 5
10 2
11 6
12 6
1982 1 5
2 3
3 4
..
1990 10 9
11 4
12 5
1991 1 6
2 4
3 4
4 4
5 4
6 9
7 3
8 5
9 6
10 6
11 3
12 2
dtype: int64
The idea here is to use a impulse for non-zero values and then create a step function.
# take a look at a sample group
# ===========================================
group = df.groupby(['year', 'month']).get_group((1981,1))
group
# create a step function
group['step_func'] = (group.prec != 0).astype(int).cumsum()
prec year month day step_func
1981-01-01 1.7641 1981 1 1 1
1981-01-02 0.4002 1981 1 2 2
1981-01-03 0.9787 1981 1 3 3
1981-01-04 2.2409 1981 1 4 4
1981-01-05 1.8676 1981 1 5 5
1981-01-06 0.0000 1981 1 6 5
1981-01-07 0.9501 1981 1 7 6
1981-01-08 0.0000 1981 1 8 6
1981-01-09 0.0000 1981 1 9 6
1981-01-10 0.4106 1981 1 10 7
1981-01-11 0.1440 1981 1 11 8
1981-01-12 1.4543 1981 1 12 9
1981-01-13 0.7610 1981 1 13 10
1981-01-14 0.1217 1981 1 14 11
1981-01-15 0.4439 1981 1 15 12
1981-01-16 0.3337 1981 1 16 13
1981-01-17 1.4941 1981 1 17 14
1981-01-18 0.0000 1981 1 18 14
1981-01-19 0.3131 1981 1 19 15
1981-01-20 0.0000 1981 1 20 15
1981-01-21 0.0000 1981 1 21 15
1981-01-22 0.6536 1981 1 22 16
1981-01-23 0.8644 1981 1 23 17
1981-01-24 0.0000 1981 1 24 17
1981-01-25 2.2698 1981 1 25 18
1981-01-26 0.0000 1981 1 26 18
1981-01-27 0.0458 1981 1 27 19
1981-01-28 0.0000 1981 1 28 19
1981-01-29 1.5328 1981 1 29 20
1981-01-30 1.4694 1981 1 30 21
1981-01-31 0.1549 1981 1 31 22
# value_counts, pick the max value and subtract 1
group['step_func'].value_counts().values[0] - 1
2
Update:
using .values[0] causes confusions for integer index. Replace it by .iloc[0].
# processing
# ===========================================
def func(group):
return (group.prec != 0).astype(int).cumsum()[group.prec == 0].value_counts().iloc[0]