I have a data frame df1 with one of the columns being "values". It looks like -
values
['acd3f','rt5gh8','5ty7e']
['rt5gh8','t67ui']
I have another dataframe df2 which contains two columns '0' and '1', with values like -
0 1
acd3f I am cool
rt5gh8 I am not cool
5ty7e ok_sir
t67ui no_sir
I want to modify df1 to add a new column "value_names", which should look like -
values value_names
['acd3f','rt5gh8','5ty7e'] ['I am cool','I am not cool','ok_sir']
['rt5gh8','t67ui'] ['I am not cool','no_sir']
I am trying the below code -
df1['value_names'] = df1['values'].replace(df2.set_index('0')['1'].dropna())
It doesn't seem to work and gives me an error -
KeyError: '1'
Note:
Basically, what I had before instead of df2 was a list with mapping. I converted that to data frame df2 and these column names "0" and "1" in df2 are automatically assigned.
Create a dictionary (mapping) of keys to their mapped values from df2 (Column 0 are the keys and Column 1 are their corresponding values.
Then used a nested list comprehension to look up the values and append it to df1 using assign.
df1 = pd.DataFrame({'values': [['acd3f','rt5gh8','5ty7e'], ['rt5gh8','t67ui']]})
df2 = pd.DataFrame({0: ['acd3f', 'rt5gh8', '5ty7e', 't67ui'],
1: ["I am cool", "I am not cool", "ok_sir", "no_sir"]})
mapping = {k: v for k, v in zip(df2[0], df2[1])}
>>> df1.assign(value_names=[[mapping.get(val) for val in sublist]
for sublist in df1['values'] ])
values value_names
0 [acd3f, rt5gh8, 5ty7e] [I am cool, I am not cool, ok_sir]
1 [rt5gh8, t67ui] [I am not cool, no_sir]
A simpler version (imo) of Alexander's code:
In [484]: mapping = dict(df2.values[:, :2])
In [485]: df1.assign(value_names=df1['values'].apply(lambda x: [mapping[k] for k in x]))
Out[485]:
values value_names
0 [acd3f, rt5gh8, 5ty7e] [I am cool, I am not cool, ok_sir]
1 [rt5gh8, t67ui] [I am not cool, no_sir]
You can create a mapping from the 2D np array retrieved using df2.values.
Then, use df.assign to create the value_names list.
Related
I'm new to pandas and I want to know if there is a way to map a column of lists in a dataframe to values stored in a dictionary.
Lets say I have the dataframe 'df' and the dictionary 'dict'. I want to create a new column named 'Description' in the dataframe where I can see the description of the Codes shown. The values of the items in the column should be stored in a list as well.
import pandas as pd
data = {'Codes':[['E0'],['E0','E1'],['E3']]}
df = pd.DataFrame(data)
dic = {'E0':'Error Code', 'E1':'Door Open', 'E2':'Door Closed'}
Most efficient would be to use a list comprehension.
df['Description'] = [[dic.get(x, None) for x in l] for l in df['Codes']]
output:
Codes Description
0 [E0] [Error Code]
1 [E0, E1] [Error Code, Door Open]
2 [E3] [None]
If needed you can post-process to replace the empty lists with NaN, use an alternative list comprehension to avoid non-matches: [[dic[x] for x in l if x in dic] for l in df['Codes']], but this would probably be ambiguous if you have one no-match among several matches (which one is which?).
Say I have a list:
mylist = ['a','b','c']
and a Pandas dataframe (df) that has a column named "rating". How can I get the count for number of occurrence of a rating while iterating my list? For example, here is what I need:
for item in myList
# Do a bunch of stuff in here that takes a long time
# want to do print statement below to show progress
# print df['rating'].value_counts().a <- I can do this,
# but want to use variable 'item'
# print df['rating'].value_counts().item <- Or something like this
I know I can get counts for all distinct values of 'rating', but that is not what I am after.
If you must do it this way, you can use .loc to filter the df prior to getting the size of the resulting df.
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
for item in mylist:
print(item, df.loc[df['rating']==item].size)
Output
a 2
b 1
c 3
Instead of thinking about this problem as one of going "from the list to the Dataframe" it might be easiest to flip it around:
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
ValueCounts = df['rating'].value_counts()
ValueCounts[ValueCounts.index.isin(mylist)]
Output:
c 3
a 2
b 1
Name: rating, dtype: int64
You don't even need a for loop, just do:
df['rating'].value_counts()[mylist]
Or to make it a dictionary:
df['rating'].value_counts()[['a', 'b', 'c']].to_dict()
I have two dataframes two dataframes, with two columns. The rows are value pairs, where order is not important: a-b == b-a for me. I need to compare these value pairs between the two dataframes.
I have a solution, but that is terribly slow for a dataframe with 300k
import pandas as pd
df1 = pd.DataFrame({"col1" : [1,2,3,4], "col2":[2,1,5,6]})
df2 = pd.DataFrame({"col1" : [2,1,3,4], "col2":[1,9,8,9]})
mysets = [{x[0],x[1]} for x in df1.values.tolist()]
df1sets = []
for element in mysets:
if element not in df1sets:
df1sets.append(element)
mysets = [{x[0],x[1]} for x in df2.values.tolist()]
df2sets = []
for element in mysets:
if element not in df2sets:
df2sets.append(element)
intersect_sets = [x for x in df1sets if x in df2sets]
this works, but it is terribly slow, and there must be an easier way to do this. One of my problem is that is that I cannot add a set to a set, I cannot create {{1,2}, {2,3}} etc
Pandas solution is merge with sorted values of columns, remove duplicates and convert to sets:
intersect_sets = ([set(x) for x in pd.DataFrame(np.sort(df1.to_numpy(), axis=1))
.merge(pd.DataFrame(np.sort(df2.to_numpy(), axis=1)))
.drop_duplicates()
.to_numpy()])
print (intersect_sets)
[{1, 2}]
Another idea with set of frozensets:
intersect_sets = (set([frozenset(x) for x in df1.to_numpy()]) &
set([frozenset(x) for x in df2.to_numpy()]))
print (intersect_sets)
{frozenset({1, 2})}
I am trying to take a list of lists and transform it into a dataframe such that the dataframe has only one column and each sublist takes one row in the dataframe. Below is an image of what I have attempted, but each word within each sublist is being put in different columns.
Current dataframe
Essentially, I want a table that looks like this:
How I want the dataframe to look
How about something like this, using list comprehension:
import pandas as pd
data = [[1,2,3], [4,5,6]]
# list comp. loops over each list in data (i)
# then appends every element j in i to a string
# end result is one string per row
pd.DataFrame([' '.join(str(j) for j in i) for i in data], columns=['Review'])
>>> Review
0 1 2 3
1 4 5 6
Here you go.
import pandas as pd
data=[['a b'],['c d']] # assuming each sublist has reviews
data=[ i[0] for i in data] # make one list
df = pd.DataFrame({'review':data})
print(df)
Output:
review
0 a b
1 c d
I have a large data set with a column that contains personal names, totally there are 60 names by value_counts(). I don't want to show those names when I analyze the data, instead I want to rename them to participant_1, ... ,participant_60.
I also want to rename the values in alphabetical order so that I will be able to find out who is participant_1 later.
I started with create a list of new names:
newnames = [f"participant_{i}" for i in range(1,61)]
Then I try to use the function df.replace.
df.replace('names', 'newnames')
However, I don't know where to specify that I want participant_1 replace the name that comes first in alphabetical order. Any suggestions or better solutions?
If need replace values in column in alphabetical order use Categorical.codes:
df = pd.DataFrame({
'names':list('bcdada'),
})
df['new'] = [f"participant_{i}" for i in pd.Categorical(df['names']).codes + 1]
#alternative solution
#df['new'] = [f"participant_{i}" for i in pd.CategoricalIndex(df['names']).codes + 1]
print (df)
names new
0 b participant_2
1 c participant_3
2 d participant_4
3 a participant_1
4 d participant_4
5 a participant_1
use rename
df.rename({'old_column_name':'new_column_nmae',......},axis=1,inplace=1)
You can generate the mapping using a dict comprehension like this -
mapper = {k: v for (k,v) in zip(sorted(df.columns), newnames)}
If I understood correctly you want to replace column values not column names.
Create a dict with old_names and new_names then You can use df.replace
import pandas as pd
df = pd.DataFrame()
df['names'] = ['sam','dean','jack','chris','mark']
x = ["participant_{}".format(i+1) for i in range(len(df))]
rep_dict = {k:v for k,v in zip(df['names'].sort_values(), x)}
print(df.replace(rep_dict))
Output:
names
0 participant_5
1 participant_2
2 participant_3
3 participant_1
4 participant_4