I have a column that looks like this:
group
A
A
A
B
B
C
The value C exists sometimes but not always. This works fine when the C is present. However, if C does not occur in the column, it throws a key error.
value_counts = df.group.value_counts()
new_df["C"] = value_counts.C
I want to check whether C has a count or not. If not, I want to assign new_df["C"] a value of 0. I tried this but i still get a keyerror. What else can I try?
value_counts = df.group.value_counts()
new_df["C"] = value_counts.C
if (df.group.value_counts()['consents']):
new_df["C"] = value_counts.consents
else:
new_df["C"] = 0
One way of doing it is by converting series into dictionary and getting the key, unless not found return the default value (in your case it is 0):
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'D']})
new_df = {}
character = "C"
new_df[character] = df.group.value_counts().to_dict().get(character, 0)
output of new_df
{'C': 0}
However, I am not sure what new_df should be, it seems that it is a dictionary? Or it might be a new dataframe object?
One way could be to convert the group column to Categorical type with specified categories. eg:
df = pd.DataFrame({'group': ['A', 'A', 'A', 'B', 'B']})
print(df)
# group
# 0 A
# 1 A
# 2 A
# 3 B
# 4 B
categories = ['A', 'B', 'C']
df['group'] = pd.Categorical(df['group'], categories=categories)
df['group'].value_counts()
[out]
A 3
B 2
C 0
Name: group, dtype: int64
Related
I have two lists to start with:
delta = ['1','5']
taxa = ['2','3','4']
My dataframe will look like :
data = { 'id': [101,102,103,104,105],
'1_srcA': ['a', 'b','c', 'd', 'g'] ,
'1_srcB': ['a', 'b','c', 'd', 'e'] ,
'2_srcA': ['g', 'b','f', 'd', 'e'] ,
'2_srcB': ['a', 'b','c', 'd', 'e'] ,
'3_srcA': ['a', 'b','c', 'd', 'e'] ,
'3_srcB': ['a', 'b','1', 'd', 'm'] ,
'4_srcA': ['a', 'b','c', 'd', 'e'] ,
'4_srcB': ['a', 'b','c', 'd', 'e'] ,
'5_srcA': ['a', 'b','c', 'd', 'e'] ,
'5_srcB': ['m', 'b','c', 'd', 'e'] }
df = pd.DataFrame(data)
df
I have to do two types of checks on this dataframe. Say, Delta check and Taxa checks.
For Delta checks, based on list delta = ['1','5'] I have to compare 1_srcA vs 1_srcB and 5_srcA vs 5_srcB since '1' is in 1_srcA ,1_srcB and '5' is in 5_srcA, 5_srcB . If the values differ, I have to populate 2. For tax checks (based on values from taxa list), it should be 1. If no difference, it is 0.
So, this comparison has to happen on all the rows. df is generated based on merge of two dataframes. so, there will be only two cols which has '1' in it, two cols which has '2' in it and so on.
Conditions I have to check:
I need to check if columns containing values from delta list differs. If yes, I will populate 2.
need to check if columns containing values from taxa list differs. If yes, I will populate 1.
If condition 1 and condition 2 are satisfied, then populate 2.
If none of the conditions satisfied, then 0.
So, my output should look like:
The code I tried:
df_cols_ = df.columns.tolist()[1:]
conditions = []
res = {}
for i,col in enumerate(df_cols_):
if (i == 0) or (i%2 == 0) :
continue
var = 'cond_'+str(i)
for del_col in delta:
if del_col in col:
var = var + '_F'
break
print (var)
cond = f"df.iloc[:, {i}] != df.iloc[:, {i+1}]"
res[var] = cond
conditions.append(cond)
The res dict will look like the below. But how can i use the condition to populate?
Is the any optimal solution the resultant dataframe can be derived? Thanks.
Create helper function for filter values by DataFrame.filter and compare them for not equal, then use np.logical_or.reduce for processing list of boolean masks to one mask and pass to numpy.select:
delta = ['1','5']
taxa = ['2','3','4']
def f(x):
df1 = df.filter(like=x)
return df1.iloc[:, 0].ne(df1.iloc[:, 1])
d = np.logical_or.reduce([f(x) for x in delta])
print (d)
[ True False False False True]
t = np.logical_or.reduce([f(x) for x in taxa])
print (t)
[ True False True False True]
df['res'] = np.select([d, t], [2, 1], default=0)
print (df)
id 1_srcA 1_srcB 2_srcA 2_srcB 3_srcA 3_srcB 4_srcA 4_srcB 5_srcA 5_srcB \
0 101 a a g a a a a a a m
1 102 b b b b b b b b b b
2 103 c c f c c 1 c c c c
3 104 d d d d d d d d d d
4 105 g e e e e m e e e e
res
0 2
1 0
2 1
3 0
4 2
I have a dataframe with a unique index and columns 'users', 'tweet_time' and 'tweet_id'.
I want to count the number of duplicate tweet_time values per user.
users = ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C']
tweet_times = ['01-01-01 01:00', '02-02-02 02:00', '03-03-03 03:00', '09-09-09 09:00',
'04-04-04 04:00', '04-04-04 04:00', '05-05-05 05:00', '09-09-09 09:00',
'06-06-06 06:00', '06-06-06 06:00', '07-07-07 07:00', '07-07-07 07:00']
d = {'users': users, 'tweet_times': tweet_times}
df = pd.DataFrame(data=d)
Desired Output
A: 0
B: 1
C: 2
I manage to get the desired output (except for the A: 0) using the code below. But is there a more pythonic / efficient way to do this?
# group by both columns
df2 = pd.DataFrame(df.groupby(['users', 'tweet_times']).tweet_id.count())
# filter out values < 2
df3 = df2[df2.tweet_id > 1]
# turn multi-index level 1 into column
df3.reset_index(level=[1], inplace=True)
# final groupby
df3.groupby('users').tweet_times.count()
We can use crosstab to create a frequency table then check for counts greater than 1 to create a boolean mask then sum this mask along axis=1
pd.crosstab(df['users'], df['tweet_times']).gt(1).sum(1)
users
A 0
B 1
C 2
dtype: int64
This works,
df1 = pd.DataFrame(df.groupby(['users'])['tweet_times'].value_counts()).reset_index(level = 0)
df1.groupby('users')['tweet_times'].apply(lambda x: sum(x>1))
users
A 0
B 1
C 2
Name: tweet_times, dtype: int64
you can use a custom boolean with your groupby.
the keep=False returns True when a value is duplicated and false if not.
# df['tweet_times'] = pd.to_datetime(df['tweet_times'],errors='coerce')
df.groupby([df.duplicated(subset=['tweet_times'],keep=False),'users']
).nunique().loc[True]
tweet_times
users
A 0
B 1
C 2
There might be a simpler way, but this is all I can come up with for now :)
df.groupby("users")["tweet_times"].agg(lambda x: x.count() - x.nunique()).rename("count_dupe")
Output:
users
A 0
B 1
C 2
Name: count_dupe, dtype: int64
This looks quite pythonic to me:
df.groupby("users")["tweet_times"].count() - df.groupby("users")["tweet_times"].nunique()
Output:
users
A 0
B 1
C 2
Name: tweet_times, dtype: int64
I want to replace in PF column value by a value in another Dataframe if exists In yellow no correspondance, so leave value as it is):
and Dataframe with old value comparison and new value:
I tried to do this but does not work
unicite['CustomerID'] = np.where(unicite['CustomerId'] == Fidclients['CustomerId'],Fidclients['Newvalue'] , unicite['CustomerID'])
If I'm understanding the question correctly, you want to replace the values in CustomerID in the table unicite with the values in the column Newvalue if they exist in the column CustomerID within the table Fidclients.
I believe you'll have to merge the tables to achieve this. For example,
unicite = pd.DataFrame({'CustomerID': ['a', 'b', 'c']})
print(unicite)
CustomerID
0 a
1 b
2 c
Fidclients = pd.DataFrame({'CustomerID': ['c', 'f', 'g'], 'Newvalue': ['x', 'y', 'z']})
print(Fidclients)
CustomerID Newvalue
0 c x
1 f y
2 g z
merged = unicite.merge(Fidclients, on='CustomerID', how='left')
merged.loc[merged.Newvalue.notnull(), 'CustomerID'] = merged.Newvalue
merged.drop('Newvalue', axis=1)
CustomerID
0 a
1 b
2 x
Working with data in Python 3+ with pandas. It seems like there should be an easy way to check if two columns have a one-to-one relationship (regardless of column type), but I'm struggling to think of the best way to do this.
Example of expected output:
A B C
0 'a' 'apple'
1 'b' 'banana'
2 'c' 'apple'
A & B are one-to-one? TRUE
A & C are one-to-one? FALSE
B & C are one-to-one? FALSE
Well, you can create your own function to check it:
def isOneToOne(df, col1, col2):
first = df.groupby(col1)[col2].count().max()
second = df.groupby(col2)[col1].count().max()
return first + second == 2
isOneToOne(df, 'A', 'B')
#True
isOneToOne(df, 'A', 'C')
#False
isOneToOne(df, 'B', 'C')
#False
In case you data is more like this:
df = pd.DataFrame({'A': [0, 1, 2, 0],
'C': ["'apple'", "'banana'", "'apple'", "'apple'"],
'B': ["'a'", "'b'", "'c'", "'a'"]})
df
# A B C
#0 0 'a' 'apple'
#1 1 'b' 'banana'
#2 2 'c' 'apple'
#3 0 'a' 'apple'
Then you can use:
def isOneToOne(df, col1, col2):
first = df.drop_duplicates([col1, col2]).groupby(col1)[col2].count().max()
second = df.drop_duplicates([col1, col2]).groupby(col2)[col1].count().max()
return first + second == 2
df.groupby(col1)[col2]\
.apply(lambda x: x.nunique() == 1)\
.all()
should work fine if you want a true or false answer.
A nice way to visualize the relationship between two columns with discrete / categorical values (in case you are using Jupyter notebook) is :
df.groupby([col1, col2])\
.apply(lambda x : x.count())\
.iloc[:,0]\
.unstack()\
.fillna(0)
This matrix will tell you the correspondence between the column values in the two columns.
In case of a one-to-one relationship there will be only one non-zero value per row in the matrix.
df.groupby('A').B.nunique().max()==1 #Output: True
df.groupby('B').C.nunique().max()==1 #Output: False
Within each value in [groupby column], count the number of unique values in [other column], then check that the maximum for all such counts is one
one way to solve this ,
df['A to B']=df.groupby('B')['A'].transform(lambda x:x.nunique()==1)
df['A to C']=df.groupby('C')['A'].transform(lambda x:x.nunique()==1)
df['B to C']=df.groupby('C')['B'].transform(lambda x:x.nunique()==1)
Output:
A B C A to B A to C B to C
0 0 a apple True False False
1 1 b banana True True True
2 2 c apple True False False
To check column by column:
print (df['A to B']==True).all()
print (df['A to C']==True).all()
print (df['B to C']==True).all()
True
False
False
Here is my solution (only two or three lines of codes) to check for any number of columns to see whether they are one to one match (duplicated matches are allowed, see the example bellow).
cols = ['A', 'B'] # or any number of columns ['A', 'B', 'C']
res = df.groupby(cols).count()
uniqueness = [res.index.get_level_values(i).is_unique
for i in range(res.index.nlevels)]
all(uniqueness)
Let's make it a function and add some docs:
def is_one_to_one(df, cols):
"""Check whether any number of columns are one-to-one match.
df: a pandas.DataFrame
cols: must be a list of columns names
Duplicated matches are allowed:
a - 1
b - 2
b - 2
c - 3
(This two cols will return True)
"""
if len(cols) == 1:
return True
# You can define you own rules for 1 column check, Or forbid it
# MAIN THINGs: for 2 or more columns check!
res = df.groupby(cols).count()
# The count number info is actually bootless.
# What maters here is the grouped *MultiIndex*
# and its uniqueness in each level
uniqueness = [res.index.get_level_values(i).is_unique
for i in range(res.index.nlevels)]
return all(uniqueness)
By using this function, you can do the one-to-one match check:
df = pd.DataFrame({'A': [0, 1, 2, 0],
'B': ["'a'", "'b'", "'c'", "'a'"],
'C': ["'apple'", "'banana'", "'apple'", "'apple'"],})
is_one_to_one(df, ['A', 'B'])
is_one_to_one(df, ['A', 'C'])
is_one_to_one(df, ['A', 'B', 'C'])
# Outputs:
# True
# False
# False
I have given dataframe
Id Direction Load Unit
1 CN05059815 LoadFWD 0,0 NaN
2 CN05059815 LoadBWD 0,0 NaN
4 ....
....
and the given list.
list =['CN05059830','CN05059946','CN05060010','CN05060064' ...]
I would like to sort or group the data by a given element of the list.
For example,
The new data will have exactly the same sort as the list. The first column would start withCN05059815 which doesn't belong to the list, then the second CN05059830 CN05059946 ... are both belong to the list. With remaining the other data
One way is to use Categorical Data. Here's a minimal example:
# sample dataframe
df = pd.DataFrame({'col': ['A', 'B', 'C', 'D', 'E', 'F']})
# required ordering
lst = ['D', 'E', 'A', 'B']
# convert to categorical
df['col'] = df['col'].astype('category')
# set order, adding values not in lst to the front
order = list(set(df['col']) - set(lst)) + lst
# attach ordering information to categorical series
df['col'] = df['col'].cat.reorder_categories(order)
# apply ordering
df = df.sort_values('col')
print(df)
col
2 C
5 F
3 D
4 E
0 A
1 B
Consider below approach and example:
df = pd.DataFrame({
'col': ['a', 'b', 'c', 'd', 'e']
})
list_ = ['d', 'b', 'a']
print(df)
Output:
col
0 a
1 b
2 c
3 d
4 e
Then in order to sort the df with the list and its ordering:
df.reindex(df.assign(dummy=df['col'])['dummy'].apply(lambda x: list_.index(x) if x in list_ else -1).sort_values().index)
Output:
col
2 c
4 e
3 d
1 b
0 a