I want to filter my dataframe, use the part of filter with condition. And I do't know how to do it
import numpy as np
table = pd.DataFrame({'movie': ['thg', 'thg', 'mol', 'mol', 'lob', 'lob'],
'rating': [3., 4., 5., np.nan, np.nan, np.nan],
'name': ['John', 'Paul', 'Adam', 'Graham', 'Eva', 'Thomas']})
filter = pd.DataFrame({'name': ['John', 'Paul','Adam', 'Graham', 'Eva', 'Thomas'],
'qty': [1, 1, 3, 10, 7, 5]})
>>> table
movie name rating
0 thg John 3
1 thg Paul 4
3 mol Adam 5
4 mol Graham NaN
5 lob Eva NaN
6 lob Thomas NaN
I know that this doesn't work, but I can't change this, help me please
result=df[(df['name'] == filter[qty<3]) ]
>>> result
movie name rating
0 thg John 3
1 thg Paul 4
I believe you need:
table[table['name'].isin(filt.loc[filt['qty']<3,'name'])]
movie rating name
0 thg 3.0 John
1 thg 4.0 Paul
Note: i have changed the filter variable to filt since filter is a builtin function and you should not name a variable with such name
You can try with. I am undeleting this answer because of not using loc unlike the other answers, although it is essentially the same:
result = table[table['name'].isin(filter[filter['qty']<3]['name'].values)]
Use Series.isin with callable :
table[table['name'].isin(df_filter.loc[lambda x: x['qty']<3, 'name'])]
movie rating name
0 thg 3.0 John
1 thg 4.0 Paul
or DataFrame.merge
table.merge(df_filter.loc[lambda x: x['qty'].lt(3), ['name']])
I have a more concise solution, why not use query:
df=table.merge(filter,on="name")
df.query("qty<3")
Related
I need to make a function to expand a dataframe. For example, the input of the function is :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
suppose the n value is 3. Then, for each person inside the Name column, I have to add 3 more new rows and leave the Cart as np.nan. The output should be like this :
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', np.nan, np.nan, np.nan, 'phone', 'food', 'bag', np.nan, np.nan, np.nan]
})
How can I solve this with using copy() and append()?
You can use np.repeat with pd.Series.unique:
n = 3
print (df.append(pd.DataFrame(np.repeat(df["Name"].unique(), n), columns=["Name"])))
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Sasha phone
4 Sasha food
5 Sasha bag
0 Ali NaN
1 Ali NaN
2 Ali NaN
3 Sasha NaN
4 Sasha NaN
5 Sasha NaN
Try this one: (it adds n rows to each group of rows with the same Name value)
import pandas as pd
import numpy as np
n = 3
list_of_df_unique_names = [df[df["Name"]==name] for name in df["Name"].unique()]
df2 = pd.concat([d.append(pd.DataFrame({"Name":np.repeat(d["Name"].values[-1], n)}))\
for d in list_of_df_unique_names]).reset_index(drop=True)
print(df2)
Output:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
4 Ali NaN
5 Ali NaN
6 Sasha phone
7 Sasha food
8 Sasha bag
9 Sasha NaN
10 Sasha NaN
11 Sasha NaN
Maybe not the most beautiful of all solutions, but it works. Say that you want to add 4 NaN rows by group. Then, given your df:
df = pd.DataFrame({
'Name':['Ali', 'Ali', 'Ali', 'Sasha', 'Sasha', 'Sasha'],
'Cart':['book', 'phonecase', 'shirt', 'phone', 'food', 'bag']
})
you can creat an empty dataframe DF and loop trough the range (1,4), filter the df you had and in every loop add an empty row:
DF = []
names = list(set(df.Name))
for i in range(4):
for name in names:
gf = df[df['Name']=='{}'.format(name)]
a = pd.concat([gf, gf.groupby('Name')['Cart'].apply(lambda x: x.shift(-1).iloc[-1]).reset_index()]).sort_values('Name').reset_index(drop=True)
DF.append(a)
DF_full = pd.concat(DF)
Now, you'll end up with copies of your original df, so you need to dump them without dumping the NaN rows:
DFF = DF_full.sort_values(['Name','Cart'])
DFF = DFF[(~DFF.duplicated()) | (DFF['Cart'].isnull())]
which gives:
Name Cart
0 Ali book
1 Ali phonecase
2 Ali shirt
3 Ali NaN
3 Ali NaN
3 Ali NaN
3 Ali NaN
2 Sasha bag
1 Sasha food
0 Sasha phone
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN
3 Sasha NaN
I have a DataFrame with 3 columns: ID, BossID and Name. Each row has a unique ID and has a corresponding name. BossID is the ID of the boss of the person in that row. Suppose I have the following DataFrame:
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3],
'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
So here, Anne is Ben's boss and Ben Coe is Cate and Dan's boss, etc.
Now, I want to have another column that has the boss's name for each person.
The desired output is:
id boss name boss_name
0 1 NaN Anne NaN
1 2 1.0 Ben Anne
2 3 2.0 Cate Ben
3 4 2.0 Dan Ben
4 5 3.0 Erin Cate
I can get my output using an ugly double for-loop. Is there a cleaner way to obtain the desired output?
This should work:
bossmap = df.set_index('id')['name'].squeeze()
df['boss_name'] = df['bossId'].map(bossmap)
You can set id as index, then use pd.Series.reindex
df = df.set_index('id')
df['boss_name'] = df['name'].reindex(df['bossId']).to_numpy() # or .to_list()
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe
Create a separate dataframe for 'name' and 'id'.
Renamed 'name' and set 'id' as the index
.merge df with the new dataframe
import pandas as pd
# test dataframe
df = pd.DataFrame({'id':[1,2,3,4,5], 'bossId':[np.nan, 1, 2, 2, 3], 'name':['Anne Boe','Ben Coe','Cate Doe','Dan Ewe','Erin Aoi']})
# separate dataframe with id and name
names = df[['id', 'name']].dropna().set_index('id').rename(columns={'name': 'boss_name'})
# merge the two
df = df.merge(names, left_on='bossId', right_index=True, how='left')
# df
id bossId name boss_name
0 1 NaN Anne Boe NaN
1 2 1.0 Ben Coe Anne Boe
2 3 2.0 Cate Doe Ben Coe
3 4 2.0 Dan Ewe Ben Coe
4 5 3.0 Erin Aoi Cate Doe
I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows
just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)
I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.
I´m trying to apply to my pandas dataframe something similar to R's tidyr::spread . I saw in some places people using pd.pivot but so far I had no success.
So in this example I have the following dataframe DF:
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
How does it look like:
Ok, so what I want is a multi-index pivot table having 'action_id' and 'name' as index, "spread" the address column and fill it with the 'date' column. So my df would look like this:
What I tryed to do was:
df.pivot(index = ['action_id', 'name'], columns = 'address', values = 'date')
And I got the error TypeError: MultiIndex.name must be a hashable type
Does anyone know what am I doing wrong?
You do not need to mention the index in pd.pivot
This will work
import pandas as pd
df = pd.DataFrame({'action_id' : [1,2,1,4,5],
'name': ['jess', 'alex', 'jess', 'cath', 'mary'],
'address': ['house', 'house', 'park', 'park', 'park'],
'date': [ '01/01', '02/01', '03/01', '04/01', '05/01']})
df = pd.concat([df, pd.pivot(data=df, index=None, columns='address', values='date')], axis=1) \
.reset_index(drop=True).drop(['address','date'], axis=1)
print(df)
action_id name house park
0 1 jess 01/01 NaN
1 2 alex 02/01 NaN
2 1 jess NaN 03/01
3 4 cath NaN 04/01
4 5 mary NaN 05/01
And to arrive at what you want, you need to do a groupby
df = df.groupby(['action_id','name']).agg({'house':'first','park':'first'}).reset_index()
print(df)
action_id name house park
0 1 jess 01/01 03/01
1 2 alex 02/01 NaN
2 4 cath NaN 04/01
3 5 mary NaN 05/01
Dont forget to accept the answer if it helped you
Another option:
df2 = df.set_index(['action_id','name','address']).date.unstack().reset_index()
df2.columns.name = None
I have a very large dataset, that looks like
df = pd.DataFrame({'B': ['john smith', 'john doe', 'adam smith', 'john doe', np.nan], 'C': ['indiana jones', 'duck mc duck', 'batman','duck mc duck',np.nan]})
df
Out[173]:
B C
0 john smith indiana jones
1 john doe duck mc duck
2 adam smith batman
3 john doe duck mc duck
4 NaN NaN
I need to create a ID variable, that is unique for every B-C combination. That is, the output should be
B C ID
0 john smith indiana jones 1
1 john doe duck mc duck 2
2 adam smith batman 3
3 john doe duck mc duck 2
4 NaN NaN 0
I actually dont care about whether the index starts at zero or not, and whether the value for the missing columns is 0 or any other number. I just want something fast, that does not take a lot of memory and can be sorted quickly.
I use:
df['combined_id']=(df.B+df.C).rank(method='dense')
but the output is float64 and takes a lot of memory. Can we do better?
Thanks!
I think you can use factorize:
df['combined_id'] = pd.factorize(df.B+df.C)[0]
print df
B C combined_id
0 john smith indiana jones 0
1 john doe duck mc duck 1
2 adam smith batman 2
3 john doe duck mc duck 1
4 NaN NaN -1
Making jezrael's answer a little more general (what if the columns were not string?), you can use this compact function:
def make_identifier(df):
str_id = df.apply(lambda x: '_'.join(map(str, x)), axis=1)
return pd.factorize(str_id)[0]
df['combined_id'] = make_identifier(df[['B','C']])
jezrael's answer is great. But since this is for multiple columns, I prefer to use .ngroup() since this way NaN could remain NaN.
df['combined_id'] = df.groupby(['B', 'C'], sort = False).ngroup()
df