Example dataframe:
name stuff floats ints
0 Mike a 1.0 1
1 Joey d 2.2 3
2 Zendaya c NaN 8
3 John a 1.0 1
4 Abruzzi d NaN 3
I have 'to_delete' list :
[['Abruzzi', 'd', pd.NA, 3], ['Mike', 'a', 1.0, 1]]
How can i remove data in the dataframe based on the 'to_delete' list?
What pandas method suit this?
So i will get new dataframe like:
name stuff floats ints
1 Joey d 2.2 3
2 Zendaya c NaN 8
3 John a 1.0 1
Thanks,
*im new to pandas
I would use a merge with indicator:
keep = (
df.merge(pd.DataFrame(to_delete, columns=df.columns), how='left', indicator=True)
.query('_merge == "left_only"').index
)
out = df.loc[keep]
print(out)
Output:
name stuff floats ints
1 Joey d 2.2 3
2 Zendaya c <NA> 8
3 John a 1.0 1
You can use the drop function to delete rows and columns in a Pandas DataFrame.
You can use the following code for your help finding the row and delete.
import pandas as pa
res = pa.DataFrame({
'name': ['Mike', 'Joey', 'Zendaya', 'John', 'Abruzzi'],
'stuff': ['a', 'd', 'c', 'a', 'd'],
'floats': [1.0, 2.2, pa.NA, 1.0, pa.NA],
'ints': [1, 3, 8, 1, 3]
})
remove = [['Abruzzi', 'd', pa.NA, 3], ['Mike', 'a', 1.0, 1]]
res = res[~res.isin(remove)].dropna(how='all')
Related
Consider two DataFrames:
>>> df1 = pd.DataFrame({'key': [1, 2, 3, 4, 5],
'bar': ['w','x','y','z','h'],
'foo': ['A', 'B', 'C', 'D','E']})
>>> df2 = pd.DataFrame({'key': [1, 2, 3, 8, 9, 10],
'foo': [np.nan, np.nan, np.nan, 'I','J','K']})
Imagine we want to join DataFrames on 'key' so that ONLY the keys in df1 are returned EXCEPT for those keys in df2 that are greater than 8. You can do this by
first doing a left join via df3 = pd.merge(df1,df2,on='key',how='left')
Then, doing an outer join with a slice of df2 via df4 = pd.merge(df3,df2.loc[df2['key']>8],on='key',how='outer')
However, rather than aligning the columns 'foo' in each DataFrame, each 'foo' column will be added to df4 as discrete columns with suffixes added to distinguish between them. And, several lines of code will be required to combine the three 'foo' columns so that I have a DataFrame with only one 'foo' column. Is there a more concise way to do this?
EDIT:
I guess my example belies the true question. Let's use these DataFrames:
>>> df1 = pd.DataFrame({'key': [1, 2, 3, 4, 5],
'bar': ['w','x','y','z','h'],
'foo': [np.nan, np.nan, 'C', 'D','E'],})
>>> df2 = pd.DataFrame({'key': [1, 2, 3, 8, 9, 10],
'foo': ['A', 'B', np.nan, 'I','J','K']})
If I use a left and then outer join as described above, I will get this...
key bar foo_x foo_y foo
0 1 w NaN A NaN
1 2 x NaN B NaN
2 3 y C NaN NaN
3 4 z D NaN NaN
4 5 h E NaN NaN
5 9 NaN NaN NaN J
6 10 NaN NaN NaN K
Because combining the three 'foo' columns will require many lines of code, I wondering if there is a more concise do all this. That is, merge the two DataFrames and combine the 'foo' columns such that the returned DataFrame is this:
key bar foo
0 1 w A
1 2 x B
2 3 y C
3 4 z D
4 5 h E
5 9 NaN J
6 10 NaN K
Let's try concat and groupby:
(pd.concat((df1, df2.query('key>8')))
.groupby('key',as_index=False).first()
)
Output:
key foo bar
0 1 A w
1 2 B x
2 3 C y
3 4 D z
4 5 E h
5 9 J NaN
6 10 K NaN
I'm learning Python/Pandas with a DataFrame having the following structure:
df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})
print(df1)
unique_id brand
0 1 A
1 1 B
2 2 A
3 2 C
4 2 X
5 3 A
6 3 C
7 3 X
8 3 X
9 3 X
My goal is to make some calculations on the above DataFrame.
Specifically, for each unique_id, I want to:
Count the number of brands without taking brand X into account;
Count only how many times brand ´X´ appears.
Visually, using the above example, the resulting DataFrame I'm looking for should look like this:
unique_id count_brands_not_x count_brand_x
0 1 2 0
1 2 2 1
2 3 2 3
I have used the groupby method on simple examples in the past but I don't know how to specify conditions in a groupby to solve this new problem I have. Any help would be appreciated.
You can use GroupBy and merge:
maskx = df1['brand'].eq('X')
d1 = df1[~maskx].groupby('unique_id')['brand'].size().reset_index()
d2 = df1[maskx].groupby('unique_id')['brand'].size().reset_index()
df = d1.merge(d2, on='unique_id', how='outer', suffixes=['_not_x', '_x']).fillna(0)
unique_id brand_not_x brand_x
0 1 2 0.00
1 2 2 1.00
2 3 2 3.00
I use pd.crosstab on True/False mask of comparing against value X
s = df1.brand.eq('X')
df_final = (pd.crosstab(df1.unique_id, s)
.rename({False: 'count_brands_not_x' , True: 'count_brand_x'}, axis=1))
Out[134]:
brand count_brands_not_x count_brand_x
unique_id
1 2 0
2 2 1
3 2 3
You can subset the original DataFrame and use the appropriate groupby operations for each calculation. concat joins the results.
import pandas as pd
s = df1.brand.eq('X')
res = (pd.concat([df1[~s].groupby('unique_id').brand.nunique().rename('unique_not_X'),
df1[s].groupby('unique_id').size().rename('count_X')],
axis=1)
.fillna(0))
# unique_not_X count_X
#unique_id
#1 2 0.0
#2 2 1.0
#3 2 3.0
If instead of "unique_brands" you just want the number of rows with brands that are not "X" then we can perform a single groupby and unstack the result.
(df1.groupby(['unique_id', df1.brand.eq('X').map({True: 'count_X', False: 'count_not_X'})])
.size().unstack(-1).fillna(0))
#brand count_X count_not_X
#unique_id
#1 0.0 2.0
#2 1.0 2.0
#3 3.0 2.0
I would first create groups and later count elements in groups
But maybe there is better function to count items in agg()
import pandas as pd
df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})
g = df1.groupby('unique_id')
df = pd.DataFrame()
df['count_brand_x'] = g['brand'].agg(lambda data:sum(data=='X'))
df['count_brands_not_x'] = g['brand'].agg(lambda data:sum(data!='X'))
df = df.reset_index()
print(df)
EDIT: If I have df['count_brand_x'] then other can count
df['count_brands_not_x'] = g['brand'].count() - df['count_brand_x']
I am trying to replace hours in df with hours from replacements for project IDs that exist in both dataframes:
import pandas as pd
df = pd.DataFrame({
'project_ids': [1, 2, 3, 4, 5],
'hours': [111, 222, 333, 444, 555],
'else' :['a', 'b', 'c', 'd', 'e']
})
replacements = pd.DataFrame({
'project_ids': [2, 5, 3],
'hours': [666, 999, 1000],
})
for project in replacements['project_ids']:
df.loc[df['project_ids'] == project, 'hours'] = replacements.loc[replacements['project_ids'] == project, 'hours']
print(df)
However, only the project ID 3 gets correct assignment (1000), but both 2 and 5 get NaN:
projects hours else
0 1 111.0 a
1 2 NaN b
2 3 1000.0 c
3 4 444.0 d
4 5 NaN e
How can I fix it?
Is there a better way to do this?
Use Series.map with another Series created by replacements with DataFrame.set_index:
s = replacements.set_index('project_ids')['hours']
df['hours'] = df['project_ids'].map(s).fillna(df['hours'])
print(df)
project_ids hours else
0 1 111.0 a
1 2 666.0 b
2 3 1000.0 c
3 4 444.0 d
4 5 999.0 e
Another way using df.update():
m=df.set_index('project_ids')
m.update(replacements.set_index('project_ids')['hours'])
print(m.reset_index())
project_ids hours else
0 1 111.0 a
1 2 666.0 b
2 3 1000.0 c
3 4 444.0 d
4 5 999.0 e
Another solution would be to use pandas.merge and then use fillna:
df_new = pd.merge(df, replacements, on='project_ids', how='left', suffixes=['_1', ''])
df_new['hours'].fillna(df_new['hours_1'], inplace=True)
df_new.drop('hours_1', axis=1, inplace=True)
print(df_new)
project_ids else hours
0 1 a 111.0
1 2 b 666.0
2 3 c 1000.0
3 4 d 444.0
4 5 e 999.0
I want to reshape the data by Date in Python as dataframe.
Required:
IS there any Pandas function?
Create additional key by using cumcount , then we do pivot , Data from jpp
df.assign(key=df.groupby('Col1').cumcount()).pivot('key','Col1','Col2')
Out[29]:
Col1 A B C
key
0 1.0 4.0 6.0
1 2.0 5.0 7.0
2 3.0 NaN 8.0
One way is to use pandas.concat on series derived from unique values in your key column.
Here is a minimal example.
import pandas as pd
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'Col2': [1, 2, 3, 4, 5, 6, 7, 8]})
res = pd.concat({k: df.loc[df['Col1']==k, 'Col2'].reset_index(drop=True)
for k in df['Col1'].unique()}, axis=1)
print(res)
A B C
0 1 4.0 6
1 2 5.0 7
2 3 NaN 8
Consider two dataframes:
df_a = pd.DataFrame([
['a', 1],
['b', 2],
['c', NaN],
], columns=['name', 'value'])
df_b = pd.DataFrame([
['a', 1],
['b', NaN],
['c', 3],
['d', 4]
], columns=['name', 'value'])
So looking like
# df_a
name value
0 a 1
1 b 2
2 c NaN
# df_b
name value
0 a 1
1 b NaN
2 c 3
3 d 4
I want to merge these two dataframes and fill in the NaN values of the value column with the existing values in the other column. In other words, I want out:
# DESIRED RESULT
name value
0 a 1
1 b 2
2 c 3
3 d 4
Sure, I can do this with a custom .map or .apply, but I want a solution that uses merge or the like, not writing a custom merge function. How can this be done?
I think you can use combine_first:
print (df_b.combine_first(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Or fillna:
print (df_b.fillna(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Solution with update is not so common as combine_first:
df_b.update(df_a)
print (df_b)
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0