I have a data frame that consist of a number of different item numbers from different locations. The problem is that I am missing dates for all the different combos. So for example for item number 1, I want all the dates that are missing for all the locations. What is the best way to add dates with quantity 0 for every single item at every single location for days that don't exist in the data set? Please and thank you!
I tried the following
df.set_index(data["DATE", "ITEMNUMBER"], inplace=True)
df = data.resample('D').sum().fillna(0)
Which gives me the following error - ValueError: Length mismatch: Expected 1 rows, received array of length 749629
So I tried the following -
df.set_index(data["DATE", "ITEMNUMBER"], inplace=True)
df = data.resample('D').sum().fillna(0)
That results in a Key Error if tolerance is not None:
To get all combinations of DATE, ITEMNUMBER, and LOCATION you can try:
import itertools
df2 = df.set_index(["DATE", "ITEMNUMBER", "LOCATION"])
df2 = df2.reindex(itertools.product(df['DATE'].unique(),
df['ITEMNUMBER'].unique(),
df['LOCATION'].unique())
).fillna(0).reset_index()
df2
example input:
DATE ITEMNUMBER LOCATION QUANTITY
0 2021-07-28 1 A 0
1 2021-07-28 2 B 1
2 2021-07-28 1 B 2
3 2021-07-29 1 A 3
4 2021-07-30 2 A 4
output:
DATE ITEMNUMBER LOCATION QUANTITY
0 2021-07-28 1 A 0.0
1 2021-07-28 1 B 2.0
2 2021-07-28 2 A 0.0
3 2021-07-28 2 B 1.0
4 2021-07-29 1 A 3.0
5 2021-07-29 1 B 0.0
6 2021-07-29 2 A 0.0
7 2021-07-29 2 B 0.0
8 2021-07-30 1 A 0.0
9 2021-07-30 1 B 0.0
10 2021-07-30 2 A 4.0
11 2021-07-30 2 B 0.0
Using a toy data frame:
>>> df = pd.DataFrame([{'date': '2014-07-14', 'id': 1, 'q': 1}, {'date': '2014-07-15', 'id': 1, 'q': 1}, {'date': '2014-07-17', 'id': 1, 'q': 1}, {'date': '2014-07-18', 'id': 1, 'q': 2}, {'date': '2014-07-14', 'id': 5, 'q': 2}])
>>> df
date id q
0 2014-07-14 1 1
1 2014-07-15 1 1
2 2014-07-17 1 1
3 2014-07-18 1 2
4 2014-07-14 5 2
I convert the dates to date times, then within each ID, reindex between the index minimum and maximum, creating empty rows. I then fill the quantity column q with 0 for np.nan and forward fill remaining nulls.
>>> df.assign(date=lambda df: pd.to_datetime(df['date'])) \
.set_index('date').groupby('id') \
.apply(lambda df: df.reindex(pd.date_range(df.index.min(), df.index.max(), freq='D'))) \
.assign(q=lambda df: df['q'].fillna(0)). \
.groupby(level=0).ffill()
id q
id
1 2014-07-14 1.0 1.0
2014-07-15 1.0 1.0
2014-07-16 1.0 0.0
2014-07-17 1.0 1.0
2014-07-18 1.0 2.0
5 2014-07-14 5.0 2.0
I'm not sure how you want to deal with the location column. My answer is simplified by removing that column entirely.
If you don't yourself know, do not ffill at the end. Instead, group by and assign an ffill of the ID column only back to ID, leaving the location as nan.
Related
I have a dataframe that looks like this
d = {'date': ['1999-01-01', '1999-01-02', '1999-01-03', '1999-01-04', '1999-01-05', '1999-01-06'], 'ID': [1,1,1,1,1,1], 'Value':[1,2,3,np.NaN,5,6]}
df = pd.DataFrame(data = d)
date ID Value
0 1999-01-01 1 1
1 1999-01-02 1 2
2 1999-01-03 1 3
3 1999-01-04 1 NaN
4 1999-01-05 1 5
5 1999-01-06 1 6
I would like to fill in NaNs using a rolling mean (e.g 2) and extend that to a df with multiple IDs and dates. I tried s.th like this but it takes a very long time and fails with the error "cannot join with no overlapping index names"
df.groupby(['date','ID']).fillna(df.rolling(2, min_periods=1).mean().shift())
or
df.groupby(['date','ID']).fillna(df.groupby(['date','ID']).rolling(2, min_periods=1).mean().shift())
IIUC, here is one way to do it
if you add expected output that will help validate this solution
df2=df.fillna(0).groupby('ID')['Value'].rolling(2).mean().reset_index()
df.update(df2, overwrite=False)
df
date ID Value
0 1999-01-01 1 1.0
1 1999-01-02 1 2.0
2 1999-01-03 1 3.0
3 1999-01-04 1 1.5
4 1999-01-05 1 5.0
5 1999-01-06 1 6.0
I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019
Consider the following dataset:
After running the code:
convert_dummy1 = convert_dummy.pivot(index='Product_Code', columns='Month', values='Sales').reset_index()
The data is in the right form, but my index column is named 'Month', and I cannot seem to remove this at all. I have tried codes such as the below, but it does not do anything.
del convert_dummy1.index.name
I can save the dataset to a csv, delete the ID column, and then read the csv - but there must be a more efficient way.
Dataset after reset_index():
convert_dummy1
Month Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
convert_dummy1.index = pd.RangeIndex(len(convert_dummy1.index))
del convert_dummy1.columns.name
convert_dummy1
Product_Code 0 1 2 3 4
0 10133.9 0 0 0 0 0
1 10146.9 120 80 60 0 100
Since you pivot with columns="Month", each column in output corresponds to a month. If you decide to reset index after the pivot, you should check column names with convert_dummy1.columns.value which should return in your case :
array(['Product_Code', 1, 2, 3, 4, 5], dtype=object)
while convert_dummy1.columns.names should return:
FrozenList(['Month'])
So to rename Month, use rename_axis function:
convert_dummy1.rename_axis('index',axis=1)
Output:
index Product_Code 1 2 3 4 5
0 10133 NaN NaN NaN NaN 0.0
1 10234 NaN 0.0 NaN NaN NaN
2 10245 0.0 NaN NaN NaN NaN
3 10345 NaN NaN NaN 0.0 NaN
4 10987 NaN NaN 1.0 NaN NaN
If you wish to reproduce it, this is my code:
df1=pd.DataFrame({'Product_Code':[10133,10245,10234,10987,10345], 'Month': [1,2,3,4,5], 'Sales': [0,0,0,1,0]})
df2=df1.pivot_table(index='Product_Code', columns='Month', values='Sales').reset_index().rename_axis('index',axis=1)
I have a dictionary of Pandas Series objects that I want to turn into a Dataframe. The key for each series should be the column heading. The individual series overlap but, each label is unique.
I thought I should be able to just do
df = pd.DataFrame(data)
But I keep getting the error InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
I get the same error if I try to turn each series into a frame, and use pd.concat(data, axis=1).
Which doesn't make sense if you take the column label into account. What am I doing wrong, and how do I fix it?
I believe you need reset_index with parameter drop=True of each Series in dict comprehension, because duplicates in index:
s = pd.Series([1,4,5,2,0], index=[1,2,2,3,5])
s1 = pd.Series([5,7,8,1],index=[1,2,3,4])
data = {'a':s, 'b': s1}
print (s.reset_index(drop=True))
0 1
1 4
2 5
3 2
4 0
dtype: int64
df = pd.concat({k:v.reset_index(drop=True) for k,v in data.items()}, axis=1)
print (df)
a b
0 1 5.0
1 4 7.0
2 5 8.0
3 2 1.0
4 0 NaN
If you need drop rows where duplicated index use boolean indexing with duplicated:
print (s[~s.index.duplicated()])
1 1
2 4
3 2
5 0
dtype: int64
df = pd.concat({k:v[~v.index.duplicated()] for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.0 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN
Another solution:
print (s.groupby(level=0).mean())
1 1.0
2 4.5
3 2.0
5 0.0
dtype: float64
df = pd.concat({k:v.groupby(level=0).mean() for k,v in data.items()}, axis=1)
print (df)
a b
1 1.0 5.0
2 4.5 7.0
3 2.0 8.0
4 NaN 1.0
5 0.0 NaN
Using 'bfill' or 'ffill' on a groupby element is trivial, but what if you need to fill the na with a specific value in a second column, based on a condition in a third column?
For example:
>>> df=pd.DataFrame({'date':['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'], 'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
>>> df
a b date
0 1 4.0 01/10/2017
1 1 NaN 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 NaN 01/11/2017
5 2 7.0 02/10/2016
I need to group by column 'a', and fill the NaN with the column 'b' value where the date for that row is closest to the date in the NaN row.
So the output should look like:
a b date
0 1 4.0 01/10/2017
1 1 6.0 02/09/2017
2 1 6.0 02/10/2016
3 2 5.0 01/10/2017
4 2 5.0 01/11/2017
5 2 7.0 02/10/2016
Assume there is a closest_date() function that takes the NaN date and the list of other dates in that group, and returns the closest date.
I'm trying to find a clean solution that doesn't have to iterate through rows, ideally able to use apply() with lambdas. Any ideas?
This should work:
df['closest_date_by_a'] = df.groupby('a')['date'].apply(closest_date)
df['b'] = df.groupby(['a', 'closest_date_by_a'])['b'].ffill().bfill()
Given a function (closest_date()), you need to apply that function by group so it calculates the closest dates for rows within each group. Then you can group by both the main grouping column (a) and the closest date column (closest_date_by_a) and perform your filling.
Ensure that your date column are in fact dates.
df = pd.DataFrame(
{'date': ['01/10/2017', '02/09/2017', '02/10/2016','01/10/2017', '01/11/2017', '02/10/2016'],
'a':[1,1,1,2,2,2], 'b':[4,np.nan,6, 5, np.nan, 7]})
df.date = pd.to_datetime(df.date)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 NaN 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 NaN 2017-01-11
5 2 7.0 2016-02-10
Use reindex with method='nearest' after having dropna()
def fill_with_nearest(df):
s = df.set_index('date').b
s = s.dropna().reindex(s.index, method='nearest')
s.index = df.index
return s
df.loc[df.b.isnull(), 'b'] = df.groupby('a').apply(fill_with_nearest).reset_index(0, drop=True)
print(df)
a b date
0 1 4.0 2017-01-10
1 1 4.0 2017-02-09
2 1 6.0 2016-02-10
3 2 5.0 2017-01-10
4 2 5.0 2017-01-11
5 2 7.0 2016-02-10