Fill missing value - python

I have a data sets which has column with missing value 2439.
But the missing value is such that for specific index has some missing value and some fill value as shown below (Compare column 'Item_Identifier' and 'Item_Weight')
If carefully seen for specific item_identifier, there missing value in item_weight. Like this there many more Item_Identifier which missing value. Is there any way using python we fill missing value for only item_weight as same.

You can make the table into a pandas DataFrame then df['item_weight'].fillna(15.5, inplace=True)

Reproducible example:
df = pd.DataFrame({'col1': ['a', 'a', 'b','b', 'b', 'c'],
'col2': [10, np.nan, np.nan, np.nan, 20, 30]})
col1 col2
0 a 10.0
1 a NaN
2 b NaN
3 b NaN
4 b 20.0
5 c 30.0
You can groupby your col1 and agg using first
vals = df.groupby('col1').agg('first')
col2
col1
a 10.0
b 20.0
c 30.0
Then just use same indexing and fillna() to match and fill the values
df = df.set_index('col1').fillna(vals).reset_index()
col1 col2
0 a 10.0
1 a 10.0
2 b 20.0
3 b 20.0
4 b 20.0
5 c 30.0

Related

Update a DataFrame with duplicate destination

I would like to update a dataframe with another one but with multiple "destination". Here is an example
df1 = pd.DataFrame({'name':['A', 'B', 'C', 'A'], 'category':['X', 'X', 'Y', 'Y'], 'value1':[None, 1, None, None], 'value2':[None, 10, None, None]})
name category value1 value2
0 A X NaN NaN
1 B X 1.0 10.0
2 C Y NaN NaN
3 A Y NaN NaN
df2 = pd.DataFrame({'name':['A', 'C'], 'value1':[2, 3], 'value2':[11, 12]})
name value1 value2
0 A 2 11
1 C 3 12
And the desired result would be
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0
I don't think pd.update works since there are two time 'A' in my first DataFrame.
pd.merge creates other columns and I think there is probably a more elegant way than to merge these columns manually after their creation
Thanks in advance for your help!
You can use fillna after mapping the column A in df1 with the corresponding values from df2:
mapping = df2.set_index('name')['value']
df1['value'] = df1['value'].fillna(df1['name'].map(mapping))
If you want to map multiple columns:
mapping = df2.set_index('name')
for col in mapping:
df1[col] = df1[col].fillna(df1['name'].map(mapping[col]))
Alternatively you can try merge:
df = df1.merge(df2, on='name', how='left', suffixes=['', '_r'])
df.groupby(df.columns.str.rstrip('_r'), axis=1, sort=False).first()
name category value1 value2
0 A X 2.0 11.0
1 B X 1.0 10.0
2 C Y 3.0 12.0
3 A Y 2.0 11.0

Filtering rows with some NaNs in DFs

I have a dataframes with many rows, and some values are NaNs.
For example -
index col1 col2 col3
0 1.0 NaN 3.0
1 NaN 4.0 NaN
3 1.0 5.0 NaN
I would like to filter the DF and return only the rows with 2+ values.
The number should be configurable.
The resulted DF will be -
index col1 col2 col3
0 1.0 NaN 3.0
3 1.0 5.0 NaN
Any idea how can I achieve this result? I've tried creating new column but it doesn't seem the right way.
Thanks!
Code to create the DF:
d = {'col1': [1, None, 1], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
You can use dropna() set the threshold to be 2 thresh=2, and perform operation along the rows axis=0:
res = df.dropna(thresh=2,axis=0)
res
col1 col2 col3
0 1.00 NaN 3.00
2 1.00 5.00 NaN
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html
You can delete the 2nd row by using the drop() method.
ax = df.drop([1])
print(ax)

apply function only within the same row index?

I have a dataframe with 2 sorted indexes and I want to apply diff on the column only within col1 in the order sorted by col2.
mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]})
mini_df = mini_df.set_index(['col1', 'col2']).sort_index()
mini_df['diff'] = mini_df.col3.diff(1)
This gives me
col3 diff
col1 col2
__________________________
A 1 1 nan
4 3 2
B 2 4 1
C 3 7 3
Above it applys diff by row.
What I want is
col3 diff
col1 col2
__________________________
A 1 1 nan
4 3 2
B 2 4 nan
C 3 7 nan
You'll want to use groupby to apply diff to each group:
mini_df = pd.DataFrame({'col1': ['A', 'B', 'C', 'A'], 'col2': [1,2,3,4], 'col3': [1,4,7,3]})
mini_df = mini_df.set_index(['col1', 'col2']).sort_index()
mini_df['diff'] = mini_df.groupby(axis=0, level='col1')['col3'].diff()
Since you already go through the heavy lifting of sort, you can diff and only assign within the group. You can't shift non-datetime indices, so either make a Series, or use np.roll, though that wraps around, and would lead to the wrong answer for a single group DataFrame
import pandas as pd
s = pd.Series(mini_df.index.get_level_values('col1'))
mini_df['diff'] = mini_df.col3.diff().where(s.eq(s.shift(1)).values)
col3 diff
col1 col2
A 1 1 NaN
4 3 2.0
B 2 4 NaN
C 3 7 NaN

Pandas: move row (index and values) from last to first [duplicate]

This question already has answers here:
add a row at top in pandas dataframe [duplicate]
(6 answers)
Closed 4 years ago.
I would like to move an entire row (index and values) from the last row to the first row of a DataFrame. Every other example I can find either uses an ordered row index (to be specific - my row index is not a numerical sequence - so I cannot simply add at -1 and then reindex with +1) or moves the values while maintaining the original index. My DF has descriptions as the index and the values are discrete to the index description.
I'm adding a row and then would like to move that into row 1. Here is the setup:
df = pd.DataFrame({
'col1' : ['A', 'A', 'B', 'F', 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3],
}).set_index('col1')
#output
In [7]: df
Out[7]:
col2 col3
col1
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
I then add a new row as follows:
df.loc["Deferred Description"] = pd.Series([''])
In [9]: df
Out[9]:
col2 col3
col1
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
Deferred Description NaN NaN
I would like the resulting output to be:
In [9]: df
Out[9]:
col2 col3
col1
Defenses Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0
I've tried using df.shift() but only the values shift. I've also tried df.sort_index() but that requires the index to be ordered (there are several SO examples using df.loc[-1] = ... then then reindexing with df.index = df.index + 1). In my case I need the Defenses Description to be the first row.
Your problem is not one of cyclic shifting, but a simpler oneā€”one of insertion (which is why I've chosen to mark this question as duplicate).
Construct an empty DataFrame and then concatenate the two using pd.concat.
pd.concat([pd.DataFrame(columns=df.columns, index=['Deferred Description']), df])
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
If this were columns, it'd have been easier. Funnily enough, pandas has a DataFrame.insert function that works for columns, but not rows.
Generalized Cyclic Shifting
If you were curious to know how you'd cyclically shift a dataFrame, you can use np.roll.
# apply this fix to your existing DataFrame
pd.DataFrame(np.roll(df.values, 1, axis=0),
index=np.roll(df.index, 1), columns=df.columns
)
col2 col3
Deferred Description NaN NaN
A 2 0
A 1 1
B 9 9
F 8 4
D 7 2
C 4 3
This, thankfully, also works when you have duplicate index values. If the index or columns aren't important, then pd.DataFrame(np.roll(df.values, 1, axis=0)) works well enough.
You can using append
pd.DataFrame({'col2':[np.nan],'col3':[np.nan]},index=["Deferred Description"]).append(df)
Out[294]:
col2 col3
Deferred Description NaN NaN
A 2.0 0.0
A 1.0 1.0
B 9.0 9.0
F 8.0 4.0
D 7.0 2.0
C 4.0 3.0

for loop for searching value in dataframe and updating values next to it

I want python to perform updating of values next to a value found in both dataframes (somewhat similar to VLOOKUP in MS Excel). So, for
import pandas as pd
df1 = pd.DataFrame(data = {'col1':['a', 'b', 'd'], 'col2': [1, 2, 4], 'col3': [2, 3, 4]})
df2 = pd.DataFrame(data = {'col1':['a', 'f', 'c', 'd']})
In [3]: df1
Out[3]:
col1 col2 col3
0 a 1 2
1 b 2 3
2 d 4 4
In [4]: df2
Out[4]:
col1
0 a
1 f
2 c
3 d
Outcome must be the following:
In [6]: df3 = *somecode*
df3
Out[6]:
col1 col2 col3
0 a 1 2
1 f
2 c
3 d 4 4
The main part is that I want some sort of "for loop" to do this.
So, for instance python searches for first value in col1 in df2, finds it in df1, and updates col2 and col3 respectivly, then moves forward.
First for loop in pandas is best avoid if some vectorized solution exist.
I think merge with left join is necessary, parameter on should be omit if only col1 is same in both DataFrames:
df3 = df2.merge(df1, how='left')
print (df3)
col1 col2 col3
0 a 1.0 2.0
1 f NaN NaN
2 c NaN NaN
3 d 4.0 4.0
try this,
Simple left join will solve your problem,
pd.merge(df2,df1,how='left',on=['col1'])
col1 col2 col3
0 a 1.0 2.0
1 f NaN NaN
2 c NaN NaN
3 d 4.0 4.0

Categories

Resources