I want an empty column in pandas. For example, data['dict']. I want every element in this column to be an empty dictionary. For example:
>>> data['dict']
{}
{}
{}
{}
How to write code? Thank you very much
Use a list comprehension.
For existing DataFrame:
df['dict'] = [{} for _ in range(len(df))]
For new object:
pd.DataFrame([{} for _ in range(100)])
One caution is that you lose some of the abilities of Pandas to vectorize operations when you use a complex Pandas data structure inside each (row, column) cell.
In order to avoid the same copy and create the feature problem when assign the values.
df['dict']=df.apply(lambda x : {},axis=1)
df
Out[730]:
0 1 2 dict
0 a b c {}
1 a NaN b {}
2 NaN t a {}
3 a d b {}
Related
Say I have a list:
mylist = ['a','b','c']
and a Pandas dataframe (df) that has a column named "rating". How can I get the count for number of occurrence of a rating while iterating my list? For example, here is what I need:
for item in myList
# Do a bunch of stuff in here that takes a long time
# want to do print statement below to show progress
# print df['rating'].value_counts().a <- I can do this,
# but want to use variable 'item'
# print df['rating'].value_counts().item <- Or something like this
I know I can get counts for all distinct values of 'rating', but that is not what I am after.
If you must do it this way, you can use .loc to filter the df prior to getting the size of the resulting df.
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
for item in mylist:
print(item, df.loc[df['rating']==item].size)
Output
a 2
b 1
c 3
Instead of thinking about this problem as one of going "from the list to the Dataframe" it might be easiest to flip it around:
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
ValueCounts = df['rating'].value_counts()
ValueCounts[ValueCounts.index.isin(mylist)]
Output:
c 3
a 2
b 1
Name: rating, dtype: int64
You don't even need a for loop, just do:
df['rating'].value_counts()[mylist]
Or to make it a dictionary:
df['rating'].value_counts()[['a', 'b', 'c']].to_dict()
I am trying to take a list of lists and transform it into a dataframe such that the dataframe has only one column and each sublist takes one row in the dataframe. Below is an image of what I have attempted, but each word within each sublist is being put in different columns.
Current dataframe
Essentially, I want a table that looks like this:
How I want the dataframe to look
How about something like this, using list comprehension:
import pandas as pd
data = [[1,2,3], [4,5,6]]
# list comp. loops over each list in data (i)
# then appends every element j in i to a string
# end result is one string per row
pd.DataFrame([' '.join(str(j) for j in i) for i in data], columns=['Review'])
>>> Review
0 1 2 3
1 4 5 6
Here you go.
import pandas as pd
data=[['a b'],['c d']] # assuming each sublist has reviews
data=[ i[0] for i in data] # make one list
df = pd.DataFrame({'review':data})
print(df)
Output:
review
0 a b
1 c d
I have the following data frame of the form:
1 2 3 4 5 6 7 8
A C C T G A T C
C A G T T A D N
Y F V H Q A F D
I need to randomly select a column k times where k is the number of columns in the given sample. My program creates a list of empty lists of size k and then randomly selects a column from the dataframe to be appended to the list. Each list must be unique and cannot have duplicates.
From the above example dataframe, an expected output should be something like:
[[2][4][6][1][7][3][5][8]]
However I am obtaining results like:
[[1][1][3][6][7][8][8][2]]
What is the most pythonic way to go about doing this? Here is my sorry attempt:
k = len(df.columns)
k_clusters = [[] for i in range(k)]
for i in range(len(k_clusters)):
for j in range(i + 1, len(k_clusters)):
k_clusters[i].append((df.sample(1, axis=1)))
if k_clusters[i] == k_clusters[j]:
k_clusters[j].pop(0)
k_clusters[j].append(df.sample(1, axis=1)
Aside from the shuffling step, your question is very similar to How to change the order of DataFrame columns?. Shuffling can be done in any number of ways in Python:
cols = np.array(df.columns)
np.random.shuffle(cols)
Or using the standard library:
cols = list(df.columns)
random.shuffle(cols)
You do not want to do cols = df.columns.values, because that will give you write access to the underlying column name data. You will then end up shuffling the column names in-place, messing up your dataframe.
Rearranging your columns is then easy:
df = df[cols]
You can use numpy.random.shuffle to just shuffle the column indexes. Because from your question, this is what I assume you want to do.
An example:
import numpy as np
to_shuffle = np.array(df.columns)
np.random.shuffle(to_shuffle)
print(to_shuffle)
I have one dictionary of several pandas dataframes. It looks like this:
key Value
A pandas dataframe here
B pandas dataframe here
C pandas dataframe here
I need to extract dataframes from dict as a separate part and assign dict key as a name.
Desired output should be as many separate dataframes as many values of my dict have.
A = dict.values() - this is first dataframe
B = dict.values() - this is second dataframe
Note that dataframes names are dict keys.
I tried this code but without any success.
for key, value in my_dict_name.items():
key = pd.DataFrame.from_dict(value)
Any help would be appreciated.
It is not recommended, but possible:
Thanks # Willem Van Onsem for better explanation:
It is a quite severe anti-pattern, especially since it can override existing variables, and one can never exclude that scenario
a = pd.DataFrame({'a':['a']})
b = pd.DataFrame({'b':['b']})
c = pd.DataFrame({'c':['c']})
d = {'A':a, 'B':b, 'C':c}
print (d)
{'A': a
0 a, 'B': b
0 b, 'C': c
0 c}
for k, v in d.items():
globals()[k] = v
print (A)
a
0 a
I think here the best is MultiIndex if same columns or index values in each DataFrame, also dictionary of DataFrame is perfectly OK.
I have a large data set with a column that contains personal names, totally there are 60 names by value_counts(). I don't want to show those names when I analyze the data, instead I want to rename them to participant_1, ... ,participant_60.
I also want to rename the values in alphabetical order so that I will be able to find out who is participant_1 later.
I started with create a list of new names:
newnames = [f"participant_{i}" for i in range(1,61)]
Then I try to use the function df.replace.
df.replace('names', 'newnames')
However, I don't know where to specify that I want participant_1 replace the name that comes first in alphabetical order. Any suggestions or better solutions?
If need replace values in column in alphabetical order use Categorical.codes:
df = pd.DataFrame({
'names':list('bcdada'),
})
df['new'] = [f"participant_{i}" for i in pd.Categorical(df['names']).codes + 1]
#alternative solution
#df['new'] = [f"participant_{i}" for i in pd.CategoricalIndex(df['names']).codes + 1]
print (df)
names new
0 b participant_2
1 c participant_3
2 d participant_4
3 a participant_1
4 d participant_4
5 a participant_1
use rename
df.rename({'old_column_name':'new_column_nmae',......},axis=1,inplace=1)
You can generate the mapping using a dict comprehension like this -
mapper = {k: v for (k,v) in zip(sorted(df.columns), newnames)}
If I understood correctly you want to replace column values not column names.
Create a dict with old_names and new_names then You can use df.replace
import pandas as pd
df = pd.DataFrame()
df['names'] = ['sam','dean','jack','chris','mark']
x = ["participant_{}".format(i+1) for i in range(len(df))]
rep_dict = {k:v for k,v in zip(df['names'].sort_values(), x)}
print(df.replace(rep_dict))
Output:
names
0 participant_5
1 participant_2
2 participant_3
3 participant_1
4 participant_4