Find index value of a dataframe by comparing with another series - python

I am having problem while extracting index value from a data frame by comparing a dataframe column values with another list.
list=[a,b,c,d]
data frame
by comparing list with column X
X Y Z
0 a r t
1 e t y
2 c f h
3 d r t
4 b g q
this should return the index values like
X
0 a
4 b
2 c
3 d
I tried this method
z=dataframe.loc[(dataframe['X'] == list)]

You should use isin as you are comparing to a list of elements:
dataframe = pd.DataFrame(columns = ['X','Y','Z'])
dataframe['X'] = ['a','e','c','d','b']
dataframe['Y'] = ['r','t','f','r','g']
dataframe['Z'] = ['t','y','h','y','k']
mylist = ['a','b','c','d']
(always post a way to create your dataframe in your question, it will be faster to answer)
dataframe[dataframe['X'].isin(mylist)].X
0 a
2 c
3 d
4 b
Name: X, dtype: object

You need to use isin:
Make sure your list is a list of strings, then use dropna to get rid of unwanted rows and columns.
list = ['a','b','c','d']
df[df.isin(list)].dropna(how='all').dropna(axis=1)
Or if you only wanted to compare with column X.
df.X[df.X.isin(list)]
Output:
X
0 a
2 c
3 d
4 b

Related

Identify rows in dataframe with value that exists in list column of another dataframe and is duplicate

I have a pandas dataframe with ~2.5M rows. Each of these rows has a location identifier, of ~2000 locations. I want to drop duplicates, keeping the first value, of any row that has the same attributes and is within a distance of X miles of another location.
After initially filtering for duplicate values, the dataframe looks similar to the following:
index location_id attr1 attr2
1 1 A Z
2 1 B Y
3 2 A Z
4 3 B Y
5 5 A Z
6 5 B Y
I have another dataframe that contains the the location ids and a list of locations considered "nearby" looking as such:
location_id locations
1 [3,5]
2 []
3 [1,5]
5 [1,3]
How can I filter out rows 4,5, and 6 from the dataframe without filtering rows 2 and 3, using the nearby location dataframe?
EDIT: The expected output is as follows:
index location_id attr1 attr2
1 1 A Z
2 1 B Y
3 2 A Z
To anyone trying to solve a similar issue in the future, I solved it using the following steps.
grouped = duplicate_df.groupby(["attr1", "attr2"])["location_id"].apply(lambda x: x.unqiue())
grouped_df = grouped.to_frame().reset_index()
grouped_df = grouped_df[grouped_df["location_id"].apply(lambda x: len(x) > 1)]
grouped_df.reset_index(drop=True, inplace=True)
grouped_df["LocationList"] = grouped_df["location_id"]
gdf_exploded = grouped_df.explode["location_id"]
#nearly_location_list is the nearby locations data frame mentioned above
def get_overlap_set(location_id, location_list):
try:
return list(set(location_list).intersection(set(nearly_location_list[nearly_location_list["location_id"] == loaction_id]["locations"].values[0])))+[location_id]
except IndexError:
return [location_id]
gdf_exploded["overlap"] = gdf_exploded.apply(lambda x: get_overlap_set(x["location_id"], x[""]), axis=1)
gdf_exploded["overlap"] = gdf_exploded["overlap"].apply(lambda x: sorted(x))
gdf_exploded.reset_index(drop=True, inplace=True)
gdf_exploded["overlap_string"] = gdf_exploded["overlap"].apply(lambda x: ",".join(list(map(lambda y: str(y), x))))
dedupped_df = gdf_exploded.drop_duplicates(subset=["attr1", "attr2", "overlap_string"])
The code groups the items and then uses sets to determine the intersections where there is overlap

dictionary values to excel columns

I would like to convert a dictionary of key-value pairs to an excel file with column names that match the values to the corresponding columns.
For example :
I have an excel file with column names as:
a,b,c,d,e,f,g and h.
I have a dictionary like:
{1:['c','d'],2:['a','h'],3:['a','b','b','f']}.
I need the output to be:
a b c d e f g h
1 1 1
2 1 1
3 1 2 1
the 1,2,3 are the keys from the dictionary.
The rest of the columns could be either 0 or null.
I have tried splitting the dictionary and am getting
1 = ['c','d']
2 = ['a','h']
3 = ['a','b','b','f']
but, I don't know how to pass this to match with the excel file.
Your problem can be solved with pandas and collections (there may exist a more efficient solution):
import pandas as pd
from collections import Counter
d = {...} # Your dictionary
series = pd.Series(d) # Convert the dict into a Series
counts = series.apply(Counter) # Count items row-wise
counts = counts.apply(pd.Series) # Convert the counters to Series
table = counts.fillna(0).astype(int) # Fill the gaps and make the counts integer
print(table)
# a b c d f h
1 0 0 1 1 0 0
2 1 0 0 0 0 1
3 1 2 0 0 1 0
It is not clear what type of output you expect, so I leave it to you to convert the DataFrame to the output of your choice.
A simple solution only based on standard lists and dictionaries. It generates a 2D list, which is then easy to convert into a CSV file than can be loaded by Excel.
d = {1:['c','d'],2:['a','h'],3:['a','b','b','f']}
cols = dict((c,n) for n,c in enumerate('abcdefgh'))
rows = dict((k,n) for n,k in enumerate('123'))
table = [[0 for col in cols] for row in rows]
for row, values in d.items():
for col in values:
table[rows[row]][cols[col]] += 1
print(table)
# output:
# [[0,0,1,1,0,0,0,0], [1,0,0,0,0,0,0,1], [1,2,0,0,0,1,0,0]]

Convert rows of pandas Dataframe into an iterable list of strings

Imagine I have a dataframe, df, that has 2 columns, a USER_ID and a PRODUCT they bought.
df
USER_ID | PRODUCT
1 a
1 b
1 c
2 d
2 a
2 k
I want to convert this DataFrame into a new one, df2, where each row is one user and the products are aggregated into a list of string.
df2
USER_ID | PRODUCT
1 [a,b,c]
2 [d,a,k]
Finally, I’d like to be able to find the intersection between two user’s PRODUCT lists.
I’m able to create the second dataframe, but the method I’m using results in a list that is somehow not iterable.
Specifically i do:
df2 = df1.groupby(‘USER_ID)[‘PRODUCT’].agg(lambda x: x.tolist())
which gives me a series that I convert back to a dataframe.
df2 = df2.to_frame()
which gives me the df2 I’m looking for but each list of products has length = 1, thus I can’t compare 1 to another to find the intersection of products. For instance, when I execute:
s1 = df2.PRODUCT[df2.USER_ID == 1]
s2 = df2.PRODUCT[df2.USER_ID == 2]
common_elements = list(set(s1).intersection(set(s2)))
common_elements
The result is an empty list rather than [a]. What am I doing wrong?
Try this:
df3 = pd.crosstab(df2.PRODUCT,df2.USER_ID, margins= True)
print df3[df3['All']>1]
# USER_ID 1 2 All
# PRODUCT
# a 1 1 2
# All 3 3 6
My solution very simliar to #Nikil so using his.
df2 = df.groupby('USER_ID')['PRODUCT'].apply(list)
df2 = df2.reset_index()
print df2
# USER_ID PRODUCT
# 0 1 [a, b, c]
# 1 2 [d, a, k]
More Info on Crosstab, Its a dataframe.
pd.crosstab(df2.PRODUCT,df2.USER_ID, margins= True)
# USER_ID 1 2 All
# PRODUCT
# a 1 1 2
# b 1 0 1
# c 1 0 1
# d 0 1 1
# k 0 1 1
# All 3 3 6
You can perform groupby followed by finding intersection between the two lists as shown:
>>>df2 = df.groupby('USER_ID')['PRODUCT'].apply(list).reset_index()
>>>df2
USER_ID PRODUCT
0 1 [a, b, c]
1 2 [d, a, k]
>>>list(set(df2['PRODUCT'].loc[0]).intersection(df2['PRODUCT'].loc[1]))
['a']
Or in a more shorter way:
df2 = df.groupby('USER_ID')['PRODUCT'].apply(list)
>>>list(set(df2.loc[1]).intersection(df2.loc[2]))
['a']
is that what you want?
In [7]: pd.Series(np.intersect1d(df.loc[df.USER_ID == 1, 'PRODUCT'], df.loc[df.USER_ID == 2, 'PRODUCT']))
Out[7]:
0 a
dtype: object
or using index.intersection():
In [18]: (df.set_index('PRODUCT').query('USER_ID == 1').index
....: .intersection(df.set_index('PRODUCT').query('USER_ID == 2').index)
....: .to_series()
....: )
Out[18]:
PRODUCT
a a
Name: PRODUCT, dtype: object
PS i wouldn't convert your df into df2 as you most probably will have a lot of difficulties with this data model (i mean having lists in columns)
This will give you a generalizable solution to find the intersection of any two users product list without the sloppy second dataframe
from collections import defaultdict
user1 = 1
user2 = 2
products = defaultdict(set)
for record in df.to_dict('records'):
products[record['USER_ID']].add(record['PRODUCT'])
common_elements = products[user1].intersection(products[user2])]
print(common_elements)
And then if you wanted all intersections with all pairs of users
from itertools import combinations
common_elements = {(x,y): products[x].intersection(products[y]) for x,y in combinations(products.keys(),2)}

Counting differences from the consensus in each row via Pandas

I have a DataFrame that looks like this:
import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d'],'B':['a','b','c','x'],'C':['y','b','c','d']})
df
A B C
0 a a y
1 b b b
2 c c c
3 d x d
I want to identify the most common character in each row, and total the number of differences from the consensus:
A B C Consensus
0 a a y a
1 b b b b
2 c c c c
3 d x d d
Total 0 1 1 0
Running through loops is one approach, but it seems inefficient:
consensus = []
for idx in df.index:
consensus.append(df.loc[idx].value_counts().index[0])
df['Consensus'] = consensus
(and so on)
Is there a straightforward way to get the consensus and count differences from it?
You could use the mode to get the consensus value:
>>> df.mode(axis=1)
0
0 a
1 b
2 c
3 d
Note the caveats in the docs though:
Gets the mode(s) of each element along the axis selected. Empty if nothing has 2+ occurrences. Adds a row for each mode per label, fills in gaps with nan.
Note that there could be multiple values returned for the selected axis (when more than one item share the maximum frequency), which is the reason why a dataframe is returned. If you want to impute missing values with the mode in a dataframe df, you can just do this: df.fillna(df.mode().iloc[0])
To count the differences from the consensus for each column you could compare with ne and then sum:
>>> df['consensus'] = df.mode(axis=1)
>>> df.loc[:, 'A':'C'].ne(df['consensus'], axis=0).sum(axis=0)
A 0
B 1
C 1
dtype: int64

Vectorized update to pandas DataFrame?

I have a dataframe for which I'd like to update a column with some values from an array. The array is of a different lengths to the dataframe however, but I have the indices for the rows of the dataframe that I'd like to update.
I can do this with a loop through the rows (below) but I expect there is a much more efficient way to do this via a vectorized approach, but I can't seem to get the syntax correct.
In the example below I just fill the column with nan and then use the indices directly through a loop.
df['newcol'] = np.nan
j = 0
for i in update_idx:
df['newcol'][i] = new_values[j]
j+=1
if you have a list of indices already then you can use loc to perform label (row) selection, you can pass the new column name, where your existing rows are not selected these will have NaN assigned:
df.loc[update_idx, 'new_col'] = new_value
Example:
In [4]:
df = pd.DataFrame({'a':np.arange(5), 'b':np.random.randn(5)}, index = list('abcde'))
df
Out[4]:
a b
a 0 1.800300
b 1 0.351843
c 2 0.278122
d 3 1.387417
e 4 1.202503
In [5]:
idx_list = ['b','d','e']
df.loc[idx_list, 'c'] = np.arange(3)
df
Out[5]:
a b c
a 0 1.800300 NaN
b 1 0.351843 0
c 2 0.278122 NaN
d 3 1.387417 1
e 4 1.202503 2

Categories

Resources