Filling missing values based on a specific column condition

Filling missing values based on a specific column condition - python

I have a data frame like this :
Day
Type
From
to
01/09/2021
car
170
Nan
02/09/2021
car
140
Nan
03/09/2021
none
120
77
04/09/2021
car
15
45
05/09/2021
car
34
Nan
06/09/2021
car
36
84
07/09/2021
none
23
11
08/09/2021
car
36
Nan
The logic is
For each row containing a Type none
fill the previous Nan rows in column to with values from column
from(Only for the beginning of the dataset until the first row with Type none)
fill the following Nan rows in column to with values from column to
The values used to fill the missing needs to be taken from the latest
row containing a Type none
Desired output :
Day
Type
From
to
01/09/2021
car
170
120
02/09/2021
car
140
120
03/09/2021
none
120
77
04/09/2021
car
15
45
05/09/2021
car
34
77
06/09/2021
car
36
84
07/09/2021
none
23
11
08/09/2021
car
36
11
I tried using ffill and bfill , but I'm not sure how to apply the conditions

Here in the ind list, the indexes of the rows are copied, where 'Type' == 'none'. The dataframe is copied to aaa through a slice on the first element of ind. In ind1 I get the indices of the first rows with 'to' == 'Nan' and set the values via loc.
ind_to its elements are fed into list comprehensions, the desired values are set through the my_finc function.
import pandas as pd
df = pd.read_csv('df.csv', header=0)
ind = df[df['Type'] == 'none'].index
aaa = df[:ind[0]]
ind1 = aaa[aaa['to'] == 'Nan'].index
df.loc[ind1, 'to'] = df.loc[ind[0], 'From']
ind_to = df[df['to'] == 'Nan'].index
def my_finc(x):
bbb = df.loc[: x, 'Type']
kkk = bbb[bbb == 'none'].index
df.loc[x, 'to'] = df.loc[kkk[-1], 'to']
[my_finc(i) for i in ind_to]
print(df)
Output
Day Type From to
0 01/09/2021 car 170 120
1 02/09/2021 car 140 120
2 03/09/2021 none 120 77
3 04/09/2021 car 15 45
4 05/09/2021 car 34 77
5 06/09/2021 car 36 84
6 07/09/2021 none 23 11
7 08/09/2021 car 36 11

Related

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000

ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

Pandas matching elements

i have a database named df1 and a sheet named df2。
i want to use df1 filling df2 by pandas。
DF1:
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33
4 Meir 20 110 20
5 Payne 10 175 21
DF2:
name SCORE height weight
1 JACK
2 PAUL
3 MLKE
*name maybe mess up the order
my misktake code :
import openpyxl
import pandas as pd
df1 = pd.DataFrame(pd.read_excel('df1.xlsx',sheet_name =0))
df2 = pd.DataFrame(pd.read_excel('df2.xlsx',sheet_name = 0))
result = df1.merge(df2,on = ['NAME'],how="left")
DF1:
Expected result:
DF2:
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33

As you mentioned, name maybe mess up the order, therefore, if you want to use df1 to fill-up df2, you can try setting name as index in both df1 and df2 and then use .update(), as follows:
df1a = df1.set_index('name')
df2a = df2.set_index('name')
df2a.update(df1a)
df2 = df2a.reset_index()
Result:
(Using df1 data based on the picture near the bottom):
print(df2)
name SCORE height weight
0 JACK 66 150 100
1 PAUL 50 165 22
2 MLKE 30 132 33
If you want to keep the original row index of df2, you can save the index and then restore it later, as follows:
df1a = df1.set_index('name')
df2a = df2.set_index('name')
df2a.update(df1a)
df2_idx = df2.index
df2 = df2a.reset_index()
df2.index = df2_idx
Result:
print(df2)
name SCORE height weight
1 JACK 66 150 100
2 PAUL 50 165 22
3 MLKE 30 132 33

python: replacing values in a dataframe with values from another dataframe at specific indices

I have two dataframes df1 and df2.
d = d = {'ID': [31,42,63,44,45,26],
'lat': [64,64,64,64,64,64],
'lon': [152,152,152,152,152,152],
'other1': [12,13,14,15,16,17],
'other2': [21,22,23,24,25,26]}
df1 = pd.DataFrame(data=d)
d2 ={'ID': [27,48,31,45,49,10],
'LAT': [63,63,63,63,63,63],
'LON': [153,153,153,153,153,153]}
df2 = pd.DataFrame(data=d2)
df1 has incorrect values for columns lat and lon, but has correct data in the other columns that I need to keep track of. df2 has correct LAT and LON values but only has a few common IDs with df1. There are two things I would like to accomplish. First, I want to split df1 into two dataframes: df3 which has IDs that are present in df2; and df4 which has everything else. I can get df3 with:
df3=pd.DataFrame()
for i in reduce(np.intersect1d, [df1.ID, df2.ID]):
df3=df3.append(df1.loc[df1.ID==i])
but how do I get df4 to be the remaining data?
Second, I want to replace the lat and lon values in df3 with the correct data fromdf2.
I figure there is a slick python way to do something like:
for j in range(len(df3)):
for k in range(len(df2)):
if df3.ID[j] == df2.ID[k]:
df3.lat[j] = df2.LAT[k]
df3.lon[j] = df2.LON[k]
But I can't even get the above nested loop working correctly. I don't want to spend a lot of time getting it working if there is a better way to accomplish this in python.

For question 1, you can use boolean indexing:
m = df1.ID.isin(df2.ID)
df3 = df1[m]
df4 = df1[~m]
print(df3)
print(df4)
Prints:
ID lat lon other1 other2
0 31 64 152 12 21
4 45 64 152 16 25
ID lat lon other1 other2
1 42 64 152 13 22
2 63 64 152 14 23
3 44 64 152 15 24
5 26 64 152 17 26
For question 2:
x = df3.merge(df2, on="ID")[["ID", "other1", "other2", "LAT", "LON"]]
print(x)
Prints:
ID other1 other2 LAT LON
0 31 12 21 63 153
1 45 16 25 63 153
EDIT: For question 2 you can do:
x = df3.merge(df2, on="ID").drop(columns=["lat", "lon"])
print(x)

You can merge with indicator True and then keep preference for LAT and LON and fill the rest by lat and lon, then use the indicator and a grouper and create a dictionary. Then grab the keys of the dictionary:
u = df1.merge(df2,on='ID',how='left',indicator='I')
u[['LAT','LON']] = np.where(u[['LAT','LON']].isna(),u[['lat','lon']],u[['LAT','LON']])
u = u.drop(['lat','lon'],1)
u['I'] = np.where(u['I'].eq("left_only"),"left_df","others")
d = dict(iter(u.groupby("I")))
print(d['left_df'],'\n--------\n',d['others'])
ID other1 other2 LAT LON I
1 42 13 22 64.0 152.0 left_df
2 63 14 23 64.0 152.0 left_df
3 44 15 24 64.0 152.0 left_df
5 26 17 26 64.0 152.0 left_df
--------
ID other1 other2 LAT LON I
0 31 12 21 63.0 153.0 others
4 45 16 25 63.0 153.0 others

Pandas groupby two columns and only keep records satisfying condition based on count

Trying to filter out a number of actions a user has done if the number of actions reaches a threshold.
Here is the data set: (Only Few records)
user_id,session_id,item_id,rating,length,time
123,36,28,3.5,6243.0,2015-03-07 22:44:40
123,36,29,2.5,4884.0,2015-03-07 22:44:14
123,36,30,3.5,6846.0,2015-03-07 22:44:28
123,36,54,6.5,10281.0,2015-03-07 22:43:56
123,36,61,3.5,7639.0,2015-03-07 22:43:44
123,36,62,7.5,18640.0,2015-03-07 22:43:34
123,36,63,8.5,7189.0,2015-03-07 22:44:06
123,36,97,2.5,7627.0,2015-03-07 22:42:53
123,36,98,4.5,9000.0,2015-03-07 22:43:04
123,36,99,7.5,7514.0,2015-03-07 22:43:13
223,63,30,8.0,5412.0,2015-03-22 01:42:10
123,36,30,5.5,8046.0,2015-03-07 22:42:05
223,63,32,8.5,4872.0,2015-03-22 01:42:03
123,36,32,7.5,11914.0,2015-03-07 22:41:54
225,63,35,7.5,6491.0,2015-03-22 01:42:19
123,36,35,5.5,7202.0,2015-03-07 22:42:15
123,36,36,6.5,6806.0,2015-03-07 22:42:43
123,36,37,2.5,6810.0,2015-03-07 22:42:34
225,63,41,5.0,15026.0,2015-03-22 01:42:37
225,63,45,6.5,8532.0,2015-03-07 22:42:25
I can groupby the data using user_id and session_id and get a count of items a user has rated in a session:
df.groupby(['user_id', 'session_id']).agg({'item_id':'count'}).rename(columns={'item_id': 'count'})
List of items that user has rated in a session can be obtained:
df.groupby(['user_id','session_id'])['item_id'].apply(list)
The goal is to get following if a user has rated more than 3 items in session, I want to pick only the first three items (keep only first three per user per session) from the original data frame. Maybe use the time to sort the items?
First tried to obtain which sessions contain more than 3, somewhat struggling to go beyond.
df.groupby(['user_id', 'session_id'])['item_id'].apply(
lambda x: (x > 3).count())
Example: from original df, user 123 should have first three records belong to session 36

It seems like you want to use groupby with head:
In [8]: df.groupby([df.user_id, df.session_id]).head(3)
Out[8]:
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
10 223 63 30 8.0 5412.0 2015-03-22 01:42:10
12 223 63 32 8.5 4872.0 2015-03-22 01:42:03
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25

One way is to use sort_values followed by groupby.cumcount. A method I find useful is to extract any series or MultiIndex data before applying any filtering.
The below example filters for minimum user_id / session_id combination of 3 items and only takes the first 3 in each group.
sizes = df.groupby(['user_id', 'session_id']).size()
counter = df.groupby(['user_id', 'session_id']).cumcount() + 1 # counting begins at 0
indices = df.set_index(['user_id', 'session_id']).index
df = df.sort_values('time')
res = df[(indices.map(sizes.get) >= 3) & (counter <=3)]
print(res)
user_id session_id item_id rating length time
0 123 36 28 3.5 6243.0 2015-03-07 22:44:40
1 123 36 29 2.5 4884.0 2015-03-07 22:44:14
2 123 36 30 3.5 6846.0 2015-03-07 22:44:28
14 225 63 35 7.5 6491.0 2015-03-22 01:42:19
18 225 63 41 5.0 15026.0 2015-03-22 01:42:37
19 225 63 45 6.5 8532.0 2015-03-07 22:42:25

calculate values between two pandas dataframe based on a column value

EDITED: let me copy the whole data set
df is the store sales/inventory data
branch daqu store store_name style color size stocked sold in_stock balance
0 huadong wenning C301 EE #��#��##�� EEBW52301M 39 160 7 4 3 -5
1 huadong wenning C301 EE #��#��##�� EEBW52301M 39 165 1 0 1 1
2 huadong wenning C301 EE #��#��##�� EEBW52301M 39 170 6 3 3 -3
dh is the transaction (move 'amount' from store 'from' to 'to')
branch daqu from to style color size amount box_sum
8 huadong shanghai C306 C30C EEOM52301M 59 160 1 162
18 huadong shanghai C306 C30C EEOM52301M 39 160 1 162
25 huadong shanghai C306 C30C EETJ52301M 52 160 9 162
26 huadong shanghai C306 C30C EETJ52301M 52 155 1 162
32 huadong shanghai C306 C30C EEOW52352M 19 160 2 162
What I want is the store inventory data after the transaction, which would look exactly the same format as the df, but only 'in_stock' numbers would have changed from the original df according to numbers in dh.
below is what I tried:
df['full_code'] = df['store']+df['style']+df['color'].astype(str)+df['size'].astype(str)
dh['from_code'] = dh['from']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
dh['to_code'] = dh['to']+dh['style']+dh['color'].astype(str)+dh['size'].astype(str)
# subtract from 'from' store
dh_from = pd.DataFrame(dh.groupby('from_code')['amount'].sum())
for code, stock in dh_from.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] - stock
# add to 'to' store
dh_to = pd.DataFrame(dh.groupby('to_code')['amount'].sum())
for code, stock in dh_to.iterrows() :
df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock
df.to_csv('d:/after_dh.csv')
But when I open the csv file then the 'in_stock' values for those which transaction occured are all blanks.
I think df.loc[df['full_code'] == code, 'in_stock'] = df.loc[df['full_code'] == code, 'in_stock'] + stock this has some problem. What's the correct way of updating the value?
ORIGINAL: I have two pandas dataframe: df1 is for the inventory, df2 is for the transaction
df1 look something like this:
full_code in_stock
1 AAA 200
2 BBB 150
3 CCC 150
df2 look something like this:
from to full_code amount
1 XX XY AAA 30
2 XX XZ AAA 35
3 ZY OI BBB 50
4 AQ TR AAA 15
What I want is the inventory after all transactions are done.
In this case,
full_code in_stock
1 AAA 120
2 BBB 100
3 CCC 150
Note that full_code is unique in df1, but not unique in df2.
Is there any pandas way of doing this? I got messed up with the original dataframe and a view of the dataframe and got it solved by turning them into numpy array and finding matching full_codes. But the resulting code is also a mess and wonder if there is a simpler way of doing this not turning everything into a numpy array.

What I would do is to set the index in df1 to the 'full_code' column and then call sub to subtract the other df.
What we pass for the values is the result of grouping on 'full_code' and calling sum on 'amount' column.
An additional param for sub is fill_values this is because product 'CCC' does not exist on the rhs so we want this value to be preserved, otherwise it becomes NaN:
In [25]:
total = df1.set_index('full_code')['in_stock'].sub(df2.groupby('full_code')['amount'].sum(), fill_value=0)
total.reset_index()

Out[25]:
full_code in_stock
0 AAA 120
1 BBB 100
2 CCC 150

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling missing values based on a specific column condition - python

Related

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

Pandas matching elements

python: replacing values in a dataframe with values from another dataframe at specific indices

Pandas groupby two columns and only keep records satisfying condition based on count

calculate values between two pandas dataframe based on a column value

Categories

Resources