get the length of dictionary within a dataframe - python

I am currently learning pandas and would like to know how can i get filter the rows whose column (that is a dictionary) has more than 3 keys in it. For example,
data = {'id':[1,2,3], 'types': [{1: 'a', 2:'b', 3:'c'},{1: 'a', 2:'b', 3:'c', 4:'d'}, {1: 'a', 2:'b', 3:'c'}]}
df = pd.dataframe(data)
How can i get the rows where the len of dictionary in column types is > 3
I tried doing
df[len(df['types']) > 3]
but it doesnt work. Any simple solution out there?

Use Series.apply or Series.map:
df = df[df['types'].apply(len) > 3]
#alternative
#df = df[df['types'].map(len) > 3]
print (df)
id types
1 2 {1: 'a', 2: 'b', 3: 'c', 4: 'd'}
Or Series.str.len:
df = df[df['types'].str.len() > 3]

Related

Count frequencies (unique rows) from a pandas list type column

I have a dataframe (df) like this:
id col
1 [A, B, C, C]
2 [B, C, D]
3 [C, D, E]
And, I have list like this:
l = ["A", "C", "F"]
For each element in l, I want to count the unique rows they appear in df.
'A': 1, 'C': 3, 'F': 0
But I'm not getting the part where I can check if the value exists in the list-column of the dataframe.
d = {}
for i in l:
df_tmp = df[i.isin(df['col'])]['id'] ## wrong, showing error, isin is not a string attribute
d[i] = len(df_tmp)
Anyway I can fix this? Or is there a more cleaner/efficient way?
N.B. There is a similar question Frequency counts for a pandas column of lists, but it is different as I have an external list to check the frequency.
Here we are using apply method that applies given function to each element of the column (in our case it is the function that tests whether an element belongs to the list or not), then we sum True values, i.e. rows in which we found requested values and eventually save it to the dictionary. And we do it for all requested letters. I have not tested performance of this solution.
import pandas as pd
df = pd.DataFrame([
{'id': 1, 'col': ['A', 'B', 'C', 'C']},
{'id': 2, 'col': ['B', 'C', 'D']},
{'id': 3, 'col': ['C', 'D', 'E']}])
letters = ["A", "C", "D", "F"]
res = {v: df['col'].apply(lambda x: v in x).sum()
for v in letters}
# output
# {'A': 1, 'C': 3, 'D': 2, 'F': 0}
You can just check the membership in the list for each value in ['A', 'C', 'F'] and compute sum() like:
vals = ['A', 'C', 'F']
{val: df['col'].apply(lambda x: val in x).sum() for val in vals}
output:
{'A': 1, 'C': 3, 'F': 0}
You can explode col column and keep rows where value in l list then use value_counts() to count value in Series.
l = ["A", "C", "D", "F"]
col = df['col'].apply(set).explode(ignore_index=True)
out = col[col.isin(l)].value_counts().reindex(l, fill_value=0).to_dict()
# or without define `col`
out = (df['col'].apply(set).explode(ignore_index=True)
[lambda d: d.isin(l)]
.value_counts().reindex(l, fill_value=0).to_dict())
print(out)
{'A': 1, 'C': 3, 'D': 2, 'F': 0}

groupby and get entry with highest occurrence pandas [duplicate]

This question already has answers here:
GroupBy pandas DataFrame and select most common value
(13 answers)
Closed 1 year ago.
I have a dataframe having data of cities having different product types, such as :
city
product_type
A
B
A
B
A
D
A
E
X
B
X
C
X
C
X
C
I want to know what the most common product type is, for each city. For the above df, it would be product B for city A and product C for city X.
I am trying to solve this by first grouping then iterating over the groups and trying to find the product type with max occurrence but it doesn't seem to work:
d = df.groupby('city')['product_type']
prods=[]
for name,group in d:
l = [group]
prod = max(l, key=l.count)
prods.append(prod)
print(prods)# this is list of products with highest occurrence in each city
This piece of code seems to give me ALL the product types, not just the most frequent ones.
You can try something like this:
data = pd.DataFrame({
'city': ['A', 'A', 'A', 'A', 'X', 'X', 'X', 'X'],
'product_type': ['B', 'B', 'D', 'E', 'B', 'C', 'C', 'C']
})
result_dict = {city: city_data.product_type.value_counts().index[0]
for city, city_data in data.groupby('city')}
print(result_dict)
This will result in dictionary: {'A': 'B', 'X': 'C'}. Note that if more than one product has the same number of occurrences this code will only return one of them.

Creating variable number of lists from pandas dataframes

I have a pandas dataframe being generated by some other piece of code - the dataframe may have different number of columns each time it is generated: let's call them col1,col2,...,coln where n is not fixed. Please note that col1,col2,... are just placeholders, the actual names of columns can be arbitrary like TimeStamp or PrevState.
From this, I want to convert each column into a list, with the name of the list being the same as the column. So, I want a list named col1 with the entries in the first column of the dataframe and so on till coln.
How do I do this?
Thanks
It is not recommended, better is create dictionary:
d = df.to_dict('list')
And then select list by keys of dict from columns names:
print (d['col'])
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
d = df.to_dict('list')
print (d)
{'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [4, 5, 4, 5, 5, 4], 'C': [7, 8, 9, 4, 2, 3]}
print (d['A'])
['a', 'b', 'c', 'd', 'e', 'f']
import pandas as pd
df = pd.DataFrame()
df["col1"] = [1,2,3,4,5]
df["colTWO"] = [6,7,8,9,10]
for col_name in df.columns:
exec(col_name + " = " + df[col_name].values.__repr__())

Grab Updated rows of pandas column while looping through dataframe

I am trying the following:
import pandas as pd
df = pd.DataFrame({'Col1': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B'},
'Col2': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c'},
'Col3': {0: 42, 1: 28, 2: 56, 3: 62, 4: 48}})
ii = 1
for idx, row in df.iterrows():
print(row)
df.at[:, 'Col2'] = 'asd{}'.format(ii)
ii += 1
But the print statement above doesn't reflect the change df.at[:, 'Col2'] = 'asd'.format(ii). I need the print statements to reflect the change df.at[:, 'Col2'] = 'asd'.format(ii)
Edit: Since I am updating all rows of df, I was expecting the idx and row to grab new values from dataframe.
If this is not the right way to grab updated values from df through idx and row, then what is the correct approach. I need idx and row to reflect new values.
Expected output:
Col1 A
Col2 a
Col3 42
Name: 0, dtype: object
Col1 A
Col2 asd1
Col3 28
Name: 1, dtype: object
Col1 B
Col2 asd2
Col3 56
.....
From iterrows documentation:
You should never modify something you are iterating over. This is not
guaranteed to work in all cases. Depending on the data types, the
iterator returns a copy and not a view, and writing to it will have no
effect.
As per your request for an alternative solution, here is one using DataFrame.apply:
df['Col2'] = df.apply(lambda row: 'asd{}'.format(row.name), axis=1)
Other examples (also using Series.apply) that may be useful for your eventual goal: (not clear what it is yet)
df['Col2'] = df['Col2'].apply(lambda x: 'asd{}'.format(x))
df['Col2'] = df.apply(lambda row: 'asd{}'.format(row['Col3']), axis=1)
Here is something you can try,
import pandas as pd
df = pd.DataFrame({'Col1': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B'},
'Col2': {0: 'a', 1: 'a', 2: 'b', 3: 'b', 4: 'c'},
'Col3': {0: 42, 1: 28, 2: 56, 3: 62, 4: 48}})
print(
df.assign(idx=df.index)[['idx', 'Col2']]
.apply(lambda x: x['Col2'] if x['idx'] == 0 else f"asd{x['idx']}", axis=1)
)
0 a
1 asd1
2 asd2
3 asd3
4 asd4
dtype: object

Pandas comparison

I'm trying to simplify pandas and python syntax when executing a basic Pandas operation.
I have 4 columns:
a_id
a_score
b_id
b_score
I create a new label called doc_type based on the following:
a >= b, doc_type: a
b > a, doc_type: b
Im struggling in how to calculate in Pandas where a exists but b doesn't, in this case then a needs to be the label. Right now it returns the else statement or b.
I needed to create 2 additional comparison which at scale may be efficient as I already compare the data before. Looking how to improve it.
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
print df
# Replace empty string with NaN
m_score = r['a_score'] >= r['b_score']
m_doc = (r['a_id'].isnull() & r['b_id'].isnull())
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)
# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)
# Select type based on higher score
r['doc_type'] = numpy.where(m_score, 'a',
numpy.where(m_doc, numpy.nan, 'b'))
# Additional lines looking for improvement:
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].notnull())] = 'b'
df['doc_type'].loc[(df['a_id'].notnull() & df['b_id'].isnull())] = 'a'
print df
Use numpy.where, assuming your logic is:
Both exist, the doc_type will be the one with higher score;
One missing, the doc_type will be the one not null;
Both missing, the doc_type will be null;
Added an extra edge case at the last line:
import numpy as np
df = df.replace('', np.nan)
df['doc_type'] = np.where(df.b_id.isnull() | (df.a_score >= df.b_score),
np.where(df.a_id.isnull(), None, 'a'), 'b')
df
Not sure I fully understand all conditions or if this has any particular edge cases, but I think you can just do an np.argmax on the columns and swap the values for 'a' or 'b' when you're done:
In [21]: import numpy as np
In [22]: df['doc_type'] = pd.Series(np.argmax(df[["a_score", "b_score"]].values, axis=1)).replace({0: 'a', 1: 'b'})
In [23]: df
Out[23]:
a_id a_score b_id b_score doc_type
0 A 1 a 0.10 a
1 B 2 b 0.20 a
2 C 3 c 3.10 b
3 D 4 d 4.10 b
4 2 e 5.00 b
5 F f 5.99 a
6 G 7 NaN a
Use the apply method in pandas with a custom function, trying out on your dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', 'G'],
'a_score': [1, 2, 3, 4, '', 6, 7],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, None],
})
df = df.replace('',np.NaN)
def func(row):
if np.isnan(row.a_score) and np.isnan(row.b_score):
return np.NaN
elif np.isnan(row.b_score) and not(np.isnan(row.a_score)):
return 'a'
elif not(np.isnan(row.b_score)) and np.isnan(row.a_score):
return 'a'
elif row.a_score>=row.b_score:
return 'a'
elif row.b_score>row.a_score:
return 'b'
df['doc_type'] = df.apply(func,axis=1)
You can make the function as complicated as you need and include any amount of comparisons and add more conditions later if you need to.

Categories

Resources