I have a large data set with a column that contains personal names, totally there are 60 names by value_counts(). I don't want to show those names when I analyze the data, instead I want to rename them to participant_1, ... ,participant_60.
I also want to rename the values in alphabetical order so that I will be able to find out who is participant_1 later.
I started with create a list of new names:
newnames = [f"participant_{i}" for i in range(1,61)]
Then I try to use the function df.replace.
df.replace('names', 'newnames')
However, I don't know where to specify that I want participant_1 replace the name that comes first in alphabetical order. Any suggestions or better solutions?
If need replace values in column in alphabetical order use Categorical.codes:
df = pd.DataFrame({
'names':list('bcdada'),
})
df['new'] = [f"participant_{i}" for i in pd.Categorical(df['names']).codes + 1]
#alternative solution
#df['new'] = [f"participant_{i}" for i in pd.CategoricalIndex(df['names']).codes + 1]
print (df)
names new
0 b participant_2
1 c participant_3
2 d participant_4
3 a participant_1
4 d participant_4
5 a participant_1
use rename
df.rename({'old_column_name':'new_column_nmae',......},axis=1,inplace=1)
You can generate the mapping using a dict comprehension like this -
mapper = {k: v for (k,v) in zip(sorted(df.columns), newnames)}
If I understood correctly you want to replace column values not column names.
Create a dict with old_names and new_names then You can use df.replace
import pandas as pd
df = pd.DataFrame()
df['names'] = ['sam','dean','jack','chris','mark']
x = ["participant_{}".format(i+1) for i in range(len(df))]
rep_dict = {k:v for k,v in zip(df['names'].sort_values(), x)}
print(df.replace(rep_dict))
Output:
names
0 participant_5
1 participant_2
2 participant_3
3 participant_1
4 participant_4
Related
Say I have a list:
mylist = ['a','b','c']
and a Pandas dataframe (df) that has a column named "rating". How can I get the count for number of occurrence of a rating while iterating my list? For example, here is what I need:
for item in myList
# Do a bunch of stuff in here that takes a long time
# want to do print statement below to show progress
# print df['rating'].value_counts().a <- I can do this,
# but want to use variable 'item'
# print df['rating'].value_counts().item <- Or something like this
I know I can get counts for all distinct values of 'rating', but that is not what I am after.
If you must do it this way, you can use .loc to filter the df prior to getting the size of the resulting df.
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
for item in mylist:
print(item, df.loc[df['rating']==item].size)
Output
a 2
b 1
c 3
Instead of thinking about this problem as one of going "from the list to the Dataframe" it might be easiest to flip it around:
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
ValueCounts = df['rating'].value_counts()
ValueCounts[ValueCounts.index.isin(mylist)]
Output:
c 3
a 2
b 1
Name: rating, dtype: int64
You don't even need a for loop, just do:
df['rating'].value_counts()[mylist]
Or to make it a dictionary:
df['rating'].value_counts()[['a', 'b', 'c']].to_dict()
I want an empty column in pandas. For example, data['dict']. I want every element in this column to be an empty dictionary. For example:
>>> data['dict']
{}
{}
{}
{}
How to write code? Thank you very much
Use a list comprehension.
For existing DataFrame:
df['dict'] = [{} for _ in range(len(df))]
For new object:
pd.DataFrame([{} for _ in range(100)])
One caution is that you lose some of the abilities of Pandas to vectorize operations when you use a complex Pandas data structure inside each (row, column) cell.
In order to avoid the same copy and create the feature problem when assign the values.
df['dict']=df.apply(lambda x : {},axis=1)
df
Out[730]:
0 1 2 dict
0 a b c {}
1 a NaN b {}
2 NaN t a {}
3 a d b {}
I have an Ordered Dictionary, where the keys are the worksheet names, and the values contain the the worksheet items. Thus, the question: How do I use each of the keys and convert to an individual dataframe?
import pandas as pd
powerbipath = 'PowerBI_Ingestion.xlsx' dfs = pd.read_excel(powerbipath, None)
values=[] for idx, eachdf in enumerate(dfs):
eachdf=dfs[eachdf]
new_list1.append(eachdf)
eachdf = pd.DataFrame(new_list1[idx])
Examples I have seen only show how to convert from an ordered dictionary to 1 pandas dataframe. I want to convert to multiple dataframes. Thus, if there are 5 keys, there will be 5 dataframes.
You may want to do something like this, (Assuming your dictionary looks like 'd') :
d = {'first': [1, 2], 'second': [3, 4]}
for i in d:
df = pd.DataFrame(d.get(i), columns=[i])
print(df)
Output looks like :
first
0 1
1 2
second
0 3
1 4
Here is a basic answer using one of these ideas
keys = df["key_column"].unique
df_array = {}
for k in keys :
df_array[k] = dfs[dfs['key_column'] == k]
There might be more efficient way to do it though.
I have a data frame df1 with one of the columns being "values". It looks like -
values
['acd3f','rt5gh8','5ty7e']
['rt5gh8','t67ui']
I have another dataframe df2 which contains two columns '0' and '1', with values like -
0 1
acd3f I am cool
rt5gh8 I am not cool
5ty7e ok_sir
t67ui no_sir
I want to modify df1 to add a new column "value_names", which should look like -
values value_names
['acd3f','rt5gh8','5ty7e'] ['I am cool','I am not cool','ok_sir']
['rt5gh8','t67ui'] ['I am not cool','no_sir']
I am trying the below code -
df1['value_names'] = df1['values'].replace(df2.set_index('0')['1'].dropna())
It doesn't seem to work and gives me an error -
KeyError: '1'
Note:
Basically, what I had before instead of df2 was a list with mapping. I converted that to data frame df2 and these column names "0" and "1" in df2 are automatically assigned.
Create a dictionary (mapping) of keys to their mapped values from df2 (Column 0 are the keys and Column 1 are their corresponding values.
Then used a nested list comprehension to look up the values and append it to df1 using assign.
df1 = pd.DataFrame({'values': [['acd3f','rt5gh8','5ty7e'], ['rt5gh8','t67ui']]})
df2 = pd.DataFrame({0: ['acd3f', 'rt5gh8', '5ty7e', 't67ui'],
1: ["I am cool", "I am not cool", "ok_sir", "no_sir"]})
mapping = {k: v for k, v in zip(df2[0], df2[1])}
>>> df1.assign(value_names=[[mapping.get(val) for val in sublist]
for sublist in df1['values'] ])
values value_names
0 [acd3f, rt5gh8, 5ty7e] [I am cool, I am not cool, ok_sir]
1 [rt5gh8, t67ui] [I am not cool, no_sir]
A simpler version (imo) of Alexander's code:
In [484]: mapping = dict(df2.values[:, :2])
In [485]: df1.assign(value_names=df1['values'].apply(lambda x: [mapping[k] for k in x]))
Out[485]:
values value_names
0 [acd3f, rt5gh8, 5ty7e] [I am cool, I am not cool, ok_sir]
1 [rt5gh8, t67ui] [I am not cool, no_sir]
You can create a mapping from the 2D np array retrieved using df2.values.
Then, use df.assign to create the value_names list.
Is it possible to select the negation of a given list from pandas dataframe?. For instance, say I have the following dataframe
T1_V2 T1_V3 T1_V4 T1_V5 T1_V6 T1_V7 T1_V8
1 15 3 2 N B N
4 16 14 5 H B N
1 10 10 5 N K N
and I want to get out all columns but column T1_V6. I would normally do that this way:
df = df[["T1_V2","T1_V3","T1_V4","T1_V5","T1_V7","T1_V8"]]
My question is on whether there is a way to this the other way around, something like this
df = df[!["T1_V6"]]
Do:
df[df.columns.difference(["T1_V6"])]
Notes from comments:
This will sort the columns. If you don't want to sort call difference with sort=False
The difference won't raise error if the dropped column name doesn't exist. If you want to raise error in case the column doesn't exist then use drop as suggested in other answers: df.drop(["T1_V6"])
`
For completeness, you can also easily use drop for this:
df.drop(["T1_V6"], axis=1)
Another way to exclude columns that you don't want:
df[df.columns[~df.columns.isin(['T1_V6'])]]
I would suggest using DataFrame.drop():
columns_to _exclude = ['T1_V6']
old_dataframe = #Has all columns
new_dataframe = old_data_frame.drop(columns_to_exclude, axis = 1)
You could use inplace to make changes to the original dataframe itself
old_dataframe.drop(columns_to_exclude, axis = 1, inplace = True)
#old_dataframe is changed
You need to use List Comprehensions:
[col for col in df.columns if col != 'T1_V6']