I want to add missing dates for a specific date range, but keep all columns. I found many posts using afreq(), resample(), reindex(), but they seemed to be for Series and I couldn't get them to work for my DataFrame.
Given a sample dataframe:
data = [{'id' : '123', 'product' : 'apple', 'color' : 'red', 'qty' : 10, 'week' : '2019-3-7'}, {'id' : '123', 'product' : 'apple', 'color' : 'blue', 'qty' : 20, 'week' : '2019-3-21'}, {'id' : '123', 'product' : 'orange', 'color' : 'orange', 'qty' : 8, 'week' : '2019-3-21'}]
df = pd.DataFrame(data)
color id product qty week
0 red 123 apple 10 2019-3-7
1 blue 123 apple 20 2019-3-21
2 orange 123 orange 8 2019-3-21
My goal is to return below; filling in qty as 0, but fill other columns. Of course, I have many other ids. I would like to be able to specify the start/end dates to fill; this example uses 3/7 to 3/21.
color id product qty week
0 red 123 apple 10 2019-3-7
1 blue 123 apple 20 2019-3-21
2 orange 123 orange 8 2019-3-21
3 red 123 apple 0 2019-3-14
4 red 123 apple 0 2019-3-21
5 blue 123 apple 0 2019-3-7
6 blue 123 apple 0 2019-3-14
7 orange 123 orange 0 2019-3-7
8 orange 123 orange 0 2019-3-14
How can I keep the remainder of my DataFrame intact?
In you case , you just need do with unstack and stack + reindex
df.week=pd.to_datetime(df.week)
s=pd.date_range(df.week.min(),df.week.max(),freq='7 D')
df=df.set_index(['color','id','product','week']).\
qty.unstack().reindex(columns=s,fill_value=0).stack().reset_index()
df
color id product level_3 0
0 blue 123 apple 2019-03-14 0.0
1 blue 123 apple 2019-03-21 20.0
2 orange 123 orange 2019-03-14 0.0
3 orange 123 orange 2019-03-21 8.0
4 red 123 apple 2019-03-07 10.0
5 red 123 apple 2019-03-14 0.0
One option is to use the complete function from pyjanitor to expose the implicitly missing rows; afterwards you can fill with fillna:
# pip install pyjanitor
import pandas as pd
import janitor
df.week = pd.to_datetime(df.week)
# create new dates, which will be used to expand the dataframe
new_dates = {"week": pd.date_range(df.week.min(), df.week.max(), freq="7D")}
# use the complete function
# note how color, id and product are wrapped together
# this ensures only missing values based on data in the dataframe is exposed
# if you want all combinations, then you get rid of the tuple,
(df
.complete(("color", "id", "product"), new_dates, sort = False)
.fillna({'qty':0, downcast='infer')
)
id product color qty week
0 123 apple red 10 2019-03-07
1 123 apple blue 20 2019-03-21
2 123 orange orange 8 2019-03-21
3 123 apple red 0 2019-03-14
4 123 apple red 0 2019-03-21
5 123 apple blue 0 2019-03-07
6 123 apple blue 0 2019-03-14
7 123 orange orange 0 2019-03-07
8 123 orange orange 0 2019-03-14
Related
This is part 2 question of the original problem.
Is there a way to find IDs that have both Apple and Strawberry, and then find the total length? and IDs that has only Apple, and IDS that has only Strawberry? BASED ON LOCATION
df:
ID Fruit Location
0 ABC Apple NY <-ABC has Apple and Strawberry
1 ABC Strawberry NY <-ABC has Apple and Strawberry
2 EFG Apple LA <-EFG has Apple only
3 XYZ Apple HOUSTON <-XYZ has Apple and Strawberry
4 XYZ Strawberry HOUSTON <-XYZ has Apple and Strawberry
5 CDF Strawberry BOSTON <-CDF has Strawberry
6 AAA Apple CHICAGO <-AAA has Apple only
Desired output:
IDs that has Apple and Strawberry:
NY 1
HOUSTON 1
IDs that has Apple only:
LA 1
CHICAGO 1
IDs that has Strawberry only:
BOSTON 1
The previous code was:
v = ['Apple','Strawberry']
out = df.groupby('ID')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print(out)
>>> 2
I tried the following but it did not work and gave me the same results
v = ['Apple','Strawberry']
out = df.groupby('ID', 'LOCATION')['Fruit'].apply(lambda x: set(x) == set(v)).sum()
print(out)
>>> 2
Thanks!
Inefficient solution using groupby and apply
x = df.groupby('ID').agg({ 'Fruit': lambda x: tuple(x), 'Location': 'first'})
y=x.groupby('Fruit')['Location'].value_counts()
y:
Fruit Location
(Apple,) CHICAGO 1
LA 1
(Apple, Strawberry) HOUSTON 1
NY 1
(Strawberry,) BOSTON 1
Name: Location, dtype: int64
for index in set(y.index.get_level_values(0)):
if len(index)==2:
print(f"IDs that has {index[0]} and {index[1]}:")
print(y.loc[index].to_string())
else:
print(f"IDs that has {index[0]} only:")
print(y.loc[index].to_string())
IDs that has Apple only:
Location
CHICAGO 1
LA 1
IDs that has Apple and Strawberry:
Location
HOUSTON 1
NY 1
IDs that has Strawberry only:
Location
BOSTON 1
DataFrame 1 - Price of Fruits by date (Index is a date)
fruits_price = {'Apple': [9,5,14],
'Orange': [10,12,10],
'Kiwi': [5,4,20],
'Watermelon': [4.4,5.4,6.4]}
df1 = pd.DataFrame(fruits_price,
columns = ['Apple','Orange','Kiwi','Watermelon'],
index=['2020-01-01','2020-01-02','2020-01-10'])
date Apple Oranges Kiwi Watermelon ... Fruit_100
2020-01-01 9 10 5 4.4
2002-01-02 5 12 4 5.4
...
2002-12-10 14 10 20 6.4
Dataframe 2 (Top fruits by Rank) (Index is a date)
top_fruits = {'Fruit_1': ['Apple','Apple','Apple'],
'Fruit_2': ['Kiwi','Orange','Kiwi'],
'Fruit_3': ['Orange','Watermelon','Watermelon'],
'Fruit_4': ['Watermelon','Kiwi','Orange']}
df2 = pd.DataFrame(top_fruits,
columns = ['Fruit_1','Fruit_2','Fruit_3','Fruit_4'],
index=['2020-01-01','2020-01-02','2020-01-10'])
date Fruit_1 Fruit_2 Fruit_3 Fruit_4 ... Fruit_100
2020-01-01 Apple Kiwi Oranges Watermelon Pineapple
2002-01-02 Apple Oranges Watermelon Kiwi Pineapple
...
2002-12-10 Apple Kiwi Watermelon Oranges Pineapple
I want DataFrame 3 (Price of the top fruit for the given date)
which actually tells me the price of the top fruit at the given date
date Price_1 Price_2 Price_3 Price_4 ..... Price_100
2020-01-01 9 5 10 4.4
2002-01-02 5 12 5.4 4
...
2002-12-10 14 20 6.4 10
Spent almost 1 night and have tried iterating Dataframe 2 and then Inner loop on DataFrame 1 and added values to DataFrame 3. I have I tried almost 6-7 different ways by iterrow ,iteritems, and then storing output directly via iloc to df3. None of those worked.
Just wondering there is an easier way to do this.
This I will later then multiply with sales of fruits in the same dataframe formate.
Just use apply function with axis=1, what this does is row by row, and each row is a series, its name is the date, replace the value with corresponding row in df1.
df2.apply(lambda x: x.replace(df1.to_dict('index')[x.name]), axis=1)
Make a dict by df1, and then use replace on df2:
import pandas as pd
fruits_price = {'Apple': [9,5,14],
'Orange': [10,12,10],
'Kiwi': [5,4,20],
'Watermelon': [4.4,5.4,6.4]}
df1 = pd.DataFrame(fruits_price,
columns = ['Apple','Orange','Kiwi','Watermelon'],
index=['2020-01-01','2020-01-02','2020-01-10'])
top_fruits = {'Fruit_1': ['Apple','Apple','Apple'],
'Fruit_2': ['Kiwi','Orange','Kiwi'],
'Fruit_3': ['Orange','Watermelon','Watermelon'],
'Fruit_4': ['Watermelon','Kiwi','Orange']}
df2 = pd.DataFrame(top_fruits,
columns = ['Fruit_1','Fruit_2','Fruit_3','Fruit_4'],
index=['2020-01-01','2020-01-02','2020-01-10'])
result = df2.T.replace(df1.T.to_dict()).T
result.columns = [f"Price_{i}" for i in range(1, len(result.columns)+1)]
result
output:
Price_1 Price_2 Price_3 Price_4
2020-01-01 9.0 5.0 10.0 4.4
2020-01-02 5.0 12.0 5.4 4.0
2020-01-10 14.0 20.0 6.4 10.0
I have a DF as follows:
Date Bought | Fruit
2018-01 Apple
2018-02 Orange
2018-02 Orange
2018-02 Lemon
I wish to group the data by 'Date Bought' & 'Fruit' and count the occurrences.
Expected result:
Date Bought | Fruit | Count
2018-01 Apple 1
2018-02 Orange 2
2018-02 Lemon 1
What I get:
Date Bought | Fruit | Count
2018-01 Apple 1
2018-02 Orange 2
Lemon 1
Code used:
Initial attempt:
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count')
#2
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count').reset_index()
ERROR: Cannot insert Fruit, already exists
#3
df.groupby(['Date Bought','Fruit'])['Fruit'].agg('count').reset_index(inplace=True)
ERROR: Type Error: Cannot reset_index inplace on a Series to create a DataFrame
Documentation shows that the groupby function returns a 'groupby object' not a standard DF. How can I group the data as mentioned above and retain the DF format?
The problem here is that by resetting the index you'd end up with 2 columns with the same name. Because working with Series is possible set parameter name in Series.reset_index:
df1 = (df.groupby(['Date Bought','Fruit'], sort=False)['Fruit']
.agg('count')
.reset_index(name='Count'))
print (df1)
Date Bought Fruit Count
0 2018-01 Apple 1
1 2018-02 Orange 2
2 2018-02 Lemon 1
I have a dataframe that records the number and type of fruits owned by various people. I'd like to add a column that indicates the top fruit(s) for each person. If a person has 2+ top-ranking fruits (aka, a tie), I want a list (or tuple) of them all.
Input
For example, let's say my input is this dataframe:
# Create all the fruit data
data = [{'fruit0':'strawberry','fruit0_count':23,'fruit1':'orange','fruit1_count':4,'fruit2':'grape','fruit2_count':27},
{'fruit0':'apple','fruit0_count':45,'fruit1':'mango','fruit1_count':45,'fruit2':'orange','fruit2_count':12},
{'fruit0':'blueberry','fruit0_count':30,'fruit1':'grapefruit','fruit1_count':32,'fruit2':'cherry','fruit2_count':94},
{'fruit0':'pineapple','fruit0_count':4,'fruit1':'grape','fruit1_count':4,'fruit2':'lemon','fruit2_count':67}]
# Add people's names as an index
df = pd.DataFrame(data, index=['Shawn', 'Monica','Jamal','Tracy'])
# Print the dataframe
df
. . . which creates the input dataframe:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count
Shawn strawberry 23 orange 4 grape 27
Monica apples 45 mango 45 orange 12
Jamal blueberry 30 grapefruit 32 cherry 94
Tracy pineapple 4 grape 4 lemon 67
Target output
What I'd like to get is a new column that gives the name of the top fruit for each person. If the person has two (or more) fruits that tied for first, I'd like a list or a tuple of those fruits:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count top_fruit
Shawn strawberry 23 orange 4 grape 27 grape
Monica apple 45 mango 45 orange 12 (apple,mango)
Jamal blueberry 30 grapefruit 32 cherry 94 cherry
Tracy pineapple 4 grape 4 lemon 67 lemon
My attempt far
The closest I've gotten is based on https://stackoverflow.com/a/38955365/6480859.
Problems:
If there is a tie for top fruit, it only captures one top fruit (Monica's top fruit is only apple.)
It's really complicated. Not really a problem, but if there is a more straightforward path, I'd like to learn it.
# List the columns that contain count numbers
cols = ['fruit0_count', 'fruit1_count', 'fruit2_count']
# Make a new dataframe with just those columns.
only_counts_df=pd.DataFrame()
only_counts_df[cols]=df[cols].copy()
# Indicate how many results you want. Note: If you increase
# this from 1, it gives you the #2, #3, etc. ranking -- it
# doesn't represent tied results.
nlargest = 1
# The next two lines are suggested from
# https://stackoverflow.com/a/38955365/6480859. I don't totally
# follow along . . .
order = np.argsort(-only_counts_df.values, axis=1)[:, :nlargest]
result = pd.DataFrame(only_counts_df.columns[order],
columns=['top{}'.format(i) for i in range(1, nlargest+1)],
index=only_counts_df.index)
# Join the results back to our original dataframe
result = df.join(result).copy()
# The dataframe now reports the name of the column that
# contains the top fruit. Convert this to the fruit name.
def id_fruit(row):
if row['top1'] == 'fruit0_count':
return row['fruit0']
elif row['top1'] == 'fruit1_count':
return row['fruit1']
elif row['top1'] == 'fruit2_count':
return row['fruit2']
else:
return "Failed"
result['top_fruit'] = result.apply(id_fruit,axis=1)
result = result.drop(['top1'], axis=1).copy()
result
. . . which outputs:
fruit0 fruit0_count fruit1 fruit1_count fruit2 fruit2_count top_fruit
Shawn strawberry 23 orange 4 grape 27 grape
Monica apple 45 mango 45 orange 12 apple
Jamal blueberry 30 grapefruit 32 cherry 94 cherry
Tracy pineapple 4 grape 4 lemon 67 lemon
Monica's top fruit should be apple and mango.
Any tips are welcome, thanks!
Idea is filter each pair and unpair column to df1 and df2, then compare values by max and filter with DataFrame.mask, last get non missing values in apply:
df1 = df.iloc[:, ::2]
df2 = df.iloc[:, 1::2]
mask = df2.eq(df2.max(axis=1), axis=0)
df['top'] = df1.where(mask.to_numpy()).apply(lambda x: x.dropna().tolist(), axis=1)
print (df)
fruit0 fruit0_count fruit1 fruit1_count fruit2 \
Shawn strawberry 23 orange 4 grape
Monica apple 45 mango 45 orange
Jamal blueberry 30 grapefruit 32 cherry
Tracy pineapple 4 grape 4 lemon
fruit2_count top
Shawn 27 [grape]
Monica 12 [apple, mango]
Jamal 94 [cherry]
Tracy 67 [lemon]
Here's what I've come up with:
maxes = df[[f"fruit{i}_count" for i in range(3)]].max(axis=1)
mask = df[[f"fruit{i}_count" for i in range(3)]].isin(maxes)
df_masked = df[[f"fruit{i}" for i in range(3)]][
mask.rename(lambda x: x.replace("_count", ""), axis=1)
]
df["top_fruit"] = df_masked.apply(lambda x: x.dropna().tolist(), axis=1)
This will return
fruit0 fruit0_count ... fruit2_count top_fruit
Shawn strawberry 23 ... 27 [grape]
Monica apple 45 ... 12 [apple, mango]
Jamal blueberry 30 ... 94 [cherry]
Tracy pineapple 4 ... 67 [lemon]
I've a dataframe which contains a list of tuples in one of its columns. I need to split the list tuples into corresponding columns. My dataframe df looks like as given below:-
A B
[('Apple',50),('Orange',30),('banana',10)] Winter
[('Orange',69),('WaterMelon',50)] Summer
The expected output should be:
Fruit rate B
Apple 50 winter
Orange 30 winter
banana 10 winter
Orange 69 summer
WaterMelon 50 summer
You can use DataFrame constructor with numpy.repeat and numpy.concatenate:
df1 = pd.DataFrame(np.concatenate(df.A), columns=['Fruit','rate']).reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
Another solution with chain.from_iterable:
from itertools import chain
df1 = pd.DataFrame(list(chain.from_iterable(df.A)), columns=['Fruit','rate'])
.reset_index(drop=True)
df1['B'] = np.repeat(df.B.values, df['A'].str.len())
print (df1)
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer
This should work:
fruits = []
rates = []
seasons = []
def create_lists(row):
tuples = row['A']
season = row['B']
for t in tuples:
fruits.append(t[0])
rates.append(t[1])
seasons.append(season)
df.apply(create_lists, axis=1)
new_df = pd.DataFrame({"Fruit" :fruits, "Rate": rates, "B": seasons})[["Fruit", "Rate", "B"]]
output:
Fruit Rate B
0 Apple 50 winter
1 Orange 30 winter
2 banana 10 winter
3 Orange 69 summer
4 WaterMelon 50 summer
You can do this in a chained operation:
(
df.apply(lambda x: [[k,v,x.B] for k,v in x.A],axis=1)
.apply(pd.Series)
.stack()
.apply(pd.Series)
.reset_index(drop=True)
.rename(columns={0:'Fruit',1:'rate',2:'B'})
)
Out[1036]:
Fruit rate B
0 Apple 50 Winter
1 Orange 30 Winter
2 banana 10 Winter
3 Orange 69 Summer
4 WaterMelon 50 Summer