Say I have a list:
mylist = ['a','b','c']
and a Pandas dataframe (df) that has a column named "rating". How can I get the count for number of occurrence of a rating while iterating my list? For example, here is what I need:
for item in myList
# Do a bunch of stuff in here that takes a long time
# want to do print statement below to show progress
# print df['rating'].value_counts().a <- I can do this,
# but want to use variable 'item'
# print df['rating'].value_counts().item <- Or something like this
I know I can get counts for all distinct values of 'rating', but that is not what I am after.
If you must do it this way, you can use .loc to filter the df prior to getting the size of the resulting df.
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
for item in mylist:
print(item, df.loc[df['rating']==item].size)
Output
a 2
b 1
c 3
Instead of thinking about this problem as one of going "from the list to the Dataframe" it might be easiest to flip it around:
mylist = ['a','b','c']
df = pd.DataFrame({'rating':['a','a','b','c','c','c','d','e','f']})
ValueCounts = df['rating'].value_counts()
ValueCounts[ValueCounts.index.isin(mylist)]
Output:
c 3
a 2
b 1
Name: rating, dtype: int64
You don't even need a for loop, just do:
df['rating'].value_counts()[mylist]
Or to make it a dictionary:
df['rating'].value_counts()[['a', 'b', 'c']].to_dict()
Related
I have a pandas dataframe containing a list of strings in a column called contains_and. Now I want to select the rows from that dataframe whose words in contains_and are all contained in a given string, e.g.
example: str = "I'm really satisfied with the quality and the price of product X"
df: pd.DataFrame = pd.DataFrame({"columnA": [1,2], "contains_and": [["price","quality"],["delivery","speed"]]})
resulting in a dataframe like this:
columnA contains_and
0 1 [price, quality]
1 2 [delivery, speed]
Now, I would like to only select row 1, as example contains all words in the list in contains_and.
My initial instinct was to do the following:
df.loc[
all([word in example for word in df["contains_and"]])
]
However, doing that results in the following error:
TypeError: 'in <string>' requires string as left operand, not list
I'm not quite sure how to best do this, but it seems like something that shouldn't be all too difficult. Does someone know of a good way to do this?
One way:
df = df[df.contains_and.apply(lambda x: all((i in example) for i in x), 1)]
OUTPUT:
columnA contains_and
0 1 [price, quality]
another way is explodeing the list of candidate words and checking (per row) if they are all in the words of example which are found with str.split:
# a Series of words
ex = pd.Series(example.split())
# boolean array reduced with `all`
to_keep = df["contains_and"].explode().isin(ex).groupby(level=0).all()
# keep only "True" rows
new_df = df[to_keep]
to get
>>> new_df
columnA contains_and
0 1 [price, quality]
Based on #Nk03 answer, you could also try:
df = df[df.contains_and.apply(lambda x: any([q for q in x if q in example]))]
In my opinion is more intuitive to check if words are in example, rather than the opposite, as your first attempt shows.
I am trying to take a list of lists and transform it into a dataframe such that the dataframe has only one column and each sublist takes one row in the dataframe. Below is an image of what I have attempted, but each word within each sublist is being put in different columns.
Current dataframe
Essentially, I want a table that looks like this:
How I want the dataframe to look
How about something like this, using list comprehension:
import pandas as pd
data = [[1,2,3], [4,5,6]]
# list comp. loops over each list in data (i)
# then appends every element j in i to a string
# end result is one string per row
pd.DataFrame([' '.join(str(j) for j in i) for i in data], columns=['Review'])
>>> Review
0 1 2 3
1 4 5 6
Here you go.
import pandas as pd
data=[['a b'],['c d']] # assuming each sublist has reviews
data=[ i[0] for i in data] # make one list
df = pd.DataFrame({'review':data})
print(df)
Output:
review
0 a b
1 c d
I have several dataframes in a list, obtained after using np.array_split and I want to concat some of then into a single dataframe. In this example, I want to concat 3 dataframes contained in b (all but the 2nd one, which is the element b[1] in the list):
df = pd.DataFrame({'country':['a','b','c','d'],
'gdp':[1,2,3,4],
'iso':['x','y','z','w']})
a = np.array_split(df,4)
i = 1
b = a[:i]+a[i+1:]
desired_final_df = pd.DataFrame({'country':['a','c','d'],
'gdp':[1,3,4],
'iso':['x','z','w']})
I have tried to create an empty df and then use append through a loop for the elements in b but with no complete success:
CV = pd.DataFrame()
CV = [CV.append[(b[i])] for i in b] #try1
CV = [CV.append(b[i]) for i in b] #try2
CV = pd.DataFrame([CV.append[(b[i])] for i in b]) #try3
for i in b:
CV.append(b) #try4
I have reached to a solution which works but it is not efficient:
CV = pd.DataFrame()
CV = [CV.append(b) for i in b][0]
In this case, I get in CV three times the same dataframe with all the rows and I just get the first of them. However, in my real case, in which I have big datasets, having three times the same would result in much more time of computation.
How could I do that without repeating operations?
According to the docs, DataFrame.append does not work in-place, like lists. The resulting DataFrame object is returned instead. Catching that object should be enough for what you need:
df = pd.DataFrame()
for next_df in list_of_dfs:
df = df.append(next_df)
You may want to use the keyword argument ignore_index=True in the append call so that the indices become continuous, instead of starting from 0 for each appended DataFrame (assuming that the index of the DataFrames in the list all start from 0).
To cancatenate multiple DFs, resetting index, use pandas.concat:
pd.concat(b, ignore_index=True)
output
country gdp iso
0 a 1 x
1 c 3 z
2 d 4 w
I have a large data set with a column that contains personal names, totally there are 60 names by value_counts(). I don't want to show those names when I analyze the data, instead I want to rename them to participant_1, ... ,participant_60.
I also want to rename the values in alphabetical order so that I will be able to find out who is participant_1 later.
I started with create a list of new names:
newnames = [f"participant_{i}" for i in range(1,61)]
Then I try to use the function df.replace.
df.replace('names', 'newnames')
However, I don't know where to specify that I want participant_1 replace the name that comes first in alphabetical order. Any suggestions or better solutions?
If need replace values in column in alphabetical order use Categorical.codes:
df = pd.DataFrame({
'names':list('bcdada'),
})
df['new'] = [f"participant_{i}" for i in pd.Categorical(df['names']).codes + 1]
#alternative solution
#df['new'] = [f"participant_{i}" for i in pd.CategoricalIndex(df['names']).codes + 1]
print (df)
names new
0 b participant_2
1 c participant_3
2 d participant_4
3 a participant_1
4 d participant_4
5 a participant_1
use rename
df.rename({'old_column_name':'new_column_nmae',......},axis=1,inplace=1)
You can generate the mapping using a dict comprehension like this -
mapper = {k: v for (k,v) in zip(sorted(df.columns), newnames)}
If I understood correctly you want to replace column values not column names.
Create a dict with old_names and new_names then You can use df.replace
import pandas as pd
df = pd.DataFrame()
df['names'] = ['sam','dean','jack','chris','mark']
x = ["participant_{}".format(i+1) for i in range(len(df))]
rep_dict = {k:v for k,v in zip(df['names'].sort_values(), x)}
print(df.replace(rep_dict))
Output:
names
0 participant_5
1 participant_2
2 participant_3
3 participant_1
4 participant_4
I want an empty column in pandas. For example, data['dict']. I want every element in this column to be an empty dictionary. For example:
>>> data['dict']
{}
{}
{}
{}
How to write code? Thank you very much
Use a list comprehension.
For existing DataFrame:
df['dict'] = [{} for _ in range(len(df))]
For new object:
pd.DataFrame([{} for _ in range(100)])
One caution is that you lose some of the abilities of Pandas to vectorize operations when you use a complex Pandas data structure inside each (row, column) cell.
In order to avoid the same copy and create the feature problem when assign the values.
df['dict']=df.apply(lambda x : {},axis=1)
df
Out[730]:
0 1 2 dict
0 a b c {}
1 a NaN b {}
2 NaN t a {}
3 a d b {}