I have been looking for an answer without success (1,2,3) and a lot of the questions I have found about string aggregation involves only string aggregation when all the columns are strings. This is a mixed aggregation with some specific details.
The df:
df = pd.DataFrame({
'Group': ['Group_1', 'Group_1','Group_1', 'Group_1', 'Group_2', 'Group_2', 'Group_2', 'Group_2', 'Group_2', 'Group_2'],
'Col1': ['A','A','B',np.nan,'B','B','C','C','C','C'],
'Col2': [1,2,3,3,5,5,5,7,np.nan,7],
'Col3': [np.nan, np.nan, np.nan,np.nan,3,2,3,4,5,5],
'Col4_to_Col99': ['some value','some value','some value','some value','some value','some value','some value','some value','some value','some value']
})
The function used for the output (source):
def join_non_nan_values(elements):
return ";".join([elem for elem in elements if elem == elem]) # elem == elem will fail for Nan values
The output:
df.groupby('Group')[['Col1', 'Col2', 'Col3']].agg({'Col1': join_non_nan_values, 'Col2': 'count', 'Col3':'mean'})
The output expected:
The output for Col1 and Col2 is a counting. The left side is the value, the right side is the count.
PD: If you know a more efficient way to implement join_non_nan_values function, you are welcome! (As it takes a while for it to run actually..) Just remember that it needs to skips missing values
You can try this:
def f(x):
c = x.value_counts().sort_index()
return ";".join(f"{k}:{v}" for (k, v) in c.items())
df["Col2"] = df["Col2"].astype('Int64')
df.groupby("Group")[["Col1", "Col2", "Col3"]].agg({
"Col1": f,
"Col2": f,
"Col3": 'mean'
})
It gives:
Col1 Col2 Col3
Group
Group_1 A:2;B:1 1:1;2:1;3:2 NaN
Group_2 B:2;C:4 5:3;7:2 3.666667
You can try calling value_counts() inside groupby().apply() and convert the outcome into strings using the str.join() method. To have a Frame (not a Series) returned as an output, use as_index=False parameter in groupby().
def func(g):
"""
(i) Count the values in Col1 and Col2 columns by calling value_counts() on each column
and convert the output into strings via join() method
(ii) Calculate mean of Col3
"""
col1 = ';'.join([f'{k}:{v}' for k,v in g['Col1'].value_counts(sort=False).items()])
col2 = ';'.join([f'{int(k)}:{v}' for k,v in g['Col2'].value_counts(sort=False).items()])
col3 = g['Col3'].mean()
return col1, col2, col3
# group by Group and apply func to specific columns
result = df.groupby('Group', as_index=False)[['Col1','Col2','Col3']].apply(func)
result
Related
I have a dictionary which looks like this: di = {1: "A", 2: "B"}
I would like to apply it to the col1 column of a dataframe similar to:
col1 col2
0 w a
1 1 2
2 2 NaN
to get:
col1 col2
0 w a
1 A 2
2 B NaN
How can I best do this? For some reason googling terms relating to this only shows me links about how to make columns from dicts and vice-versa :-/
You can use .replace. For example:
>>> df = pd.DataFrame({'col2': {0: 'a', 1: 2, 2: np.nan}, 'col1': {0: 'w', 1: 1, 2: 2}})
>>> di = {1: "A", 2: "B"}
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>> df.replace({"col1": di})
col1 col2
0 w a
1 A 2
2 B NaN
or directly on the Series, i.e. df["col1"].replace(di, inplace=True).
map can be much faster than replace
If your dictionary has more than a couple of keys, using map can be much faster than replace. There are two versions of this approach, depending on whether your dictionary exhaustively maps all possible values (and also whether you want non-matches to keep their values or be converted to NaNs):
Exhaustive Mapping
In this case, the form is very simple:
df['col1'].map(di) # note: if the dictionary does not exhaustively map all
# entries then non-matched entries are changed to NaNs
Although map most commonly takes a function as its argument, it can alternatively take a dictionary or series: Documentation for Pandas.series.map
Non-Exhaustive Mapping
If you have a non-exhaustive mapping and wish to retain the existing variables for non-matches, you can add fillna:
df['col1'].map(di).fillna(df['col1'])
as in #jpp's answer here: Replace values in a pandas series via dictionary efficiently
Benchmarks
Using the following data with pandas version 0.23.1:
di = {1: "A", 2: "B", 3: "C", 4: "D", 5: "E", 6: "F", 7: "G", 8: "H" }
df = pd.DataFrame({ 'col1': np.random.choice( range(1,9), 100000 ) })
and testing with %timeit, it appears that map is approximately 10x faster than replace.
Note that your speedup with map will vary with your data. The largest speedup appears to be with large dictionaries and exhaustive replaces. See #jpp answer (linked above) for more extensive benchmarks and discussion.
There is a bit of ambiguity in your question. There are at least three two interpretations:
the keys in di refer to index values
the keys in di refer to df['col1'] values
the keys in di refer to index locations (not the OP's question, but thrown in for fun.)
Below is a solution for each case.
Case 1:
If the keys of di are meant to refer to index values, then you could use the update method:
df['col1'].update(pd.Series(di))
For example,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {0: "A", 2: "B"}
# The value at the 0-index is mapped to 'A', the value at the 2-index is mapped to 'B'
df['col1'].update(pd.Series(di))
print(df)
yields
col1 col2
1 w a
2 B 30
0 A NaN
I've modified the values from your original post so it is clearer what update is doing.
Note how the keys in di are associated with index values. The order of the index values -- that is, the index locations -- does not matter.
Case 2:
If the keys in di refer to df['col1'] values, then #DanAllan and #DSM show how to achieve this with replace:
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
print(df)
# col1 col2
# 1 w a
# 2 10 30
# 0 20 NaN
di = {10: "A", 20: "B"}
# The values 10 and 20 are replaced by 'A' and 'B'
df['col1'].replace(di, inplace=True)
print(df)
yields
col1 col2
1 w a
2 A 30
0 B NaN
Note how in this case the keys in di were changed to match values in df['col1'].
Case 3:
If the keys in di refer to index locations, then you could use
df['col1'].put(di.keys(), di.values())
since
df = pd.DataFrame({'col1':['w', 10, 20],
'col2': ['a', 30, np.nan]},
index=[1,2,0])
di = {0: "A", 2: "B"}
# The values at the 0 and 2 index locations are replaced by 'A' and 'B'
df['col1'].put(di.keys(), di.values())
print(df)
yields
col1 col2
1 A a
2 10 30
0 B NaN
Here, the first and third rows were altered, because the keys in di are 0 and 2, which with Python's 0-based indexing refer to the first and third locations.
DSM has the accepted answer, but the coding doesn't seem to work for everyone. Here is one that works with the current version of pandas (0.23.4 as of 8/2018):
import pandas as pd
df = pd.DataFrame({'col1': [1, 2, 2, 3, 1],
'col2': ['negative', 'positive', 'neutral', 'neutral', 'positive']})
conversion_dict = {'negative': -1, 'neutral': 0, 'positive': 1}
df['converted_column'] = df['col2'].replace(conversion_dict)
print(df.head())
You'll see it looks like:
col1 col2 converted_column
0 1 negative -1
1 2 positive 1
2 2 neutral 0
3 3 neutral 0
4 1 positive 1
The docs for pandas.DataFrame.replace are here.
Given map is faster than replace (#JohnE's solution) you need to be careful with Non-Exhaustive mappings where you intend to map specific values to NaN. The proper method in this case requires that you mask the Series when you .fillna, else you undo the mapping to NaN.
import pandas as pd
import numpy as np
d = {'m': 'Male', 'f': 'Female', 'missing': np.NaN}
df = pd.DataFrame({'gender': ['m', 'f', 'missing', 'Male', 'U']})
keep_nan = [k for k,v in d.items() if pd.isnull(v)]
s = df['gender']
df['mapped'] = s.map(d).fillna(s.mask(s.isin(keep_nan)))
gender mapped
0 m Male
1 f Female
2 missing NaN
3 Male Male
4 U U
Adding to this question if you ever have more than one columns to remap in a data dataframe:
def remap(data,dict_labels):
"""
This function take in a dictionnary of labels : dict_labels
and replace the values (previously labelencode) into the string.
ex: dict_labels = {{'col1':{1:'A',2:'B'}}
"""
for field,values in dict_labels.items():
print("I am remapping %s"%field)
data.replace({field:values},inplace=True)
print("DONE")
return data
Hope it can be useful to someone.
Cheers
Or do apply:
df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
Demo:
>>> df['col1']=df['col1'].apply(lambda x: {1: "A", 2: "B"}.get(x,x))
>>> df
col1 col2
0 w a
1 1 2
2 2 NaN
>>>
You can update your mapping dictionary with missing pairs from the dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'b', 'c', 'd', np.nan]})
map_ = {'a': 'A', 'b': 'B', 'd': np.nan}
# Get mapping from df
uniques = df['col1'].unique()
map_new = dict(zip(uniques, uniques))
# {'a': 'a', 'b': 'b', 'c': 'c', 'd': 'd', nan: nan}
# Update mapping
map_new.update(map_)
# {'a': 'A', 'b': 'B', 'c': 'c', 'd': nan, nan: nan}
df['col2'] = df['col1'].map(dct_map_new)
Result:
col1 col2
0 a A
1 b B
2 c c
3 d NaN
4 NaN NaN
A nice complete solution that keeps a map of your class labels:
labels = features['col1'].unique()
labels_dict = dict(zip(labels, range(len(labels))))
features = features.replace({"col1": labels_dict})
This way, you can at any point refer to the original class label from labels_dict.
As an extension to what have been proposed by Nico Coallier (apply to multiple columns) and U10-Forward(using apply style of methods), and summarising it into a one-liner I propose:
df.loc[:,['col1','col2']].transform(lambda x: x.map(lambda x: {1: "A", 2: "B"}.get(x,x))
The .transform() processes each column as a series. Contrary to .apply()which passes the columns aggregated in a DataFrame.
Consequently you can apply the Series method map().
Finally, and I discovered this behaviour thanks to U10, you can use the whole Series in the .get() expression. Unless I have misunderstood its behaviour and it processes sequentially the series instead of bitwisely.
The .get(x,x)accounts for the values you did not mention in your mapping dictionary which would be considered as Nan otherwise by the .map() method
A more native pandas approach is to apply a replace function as below:
def multiple_replace(dict, text):
# Create a regular expression from the dictionary keys
regex = re.compile("(%s)" % "|".join(map(re.escape, dict.keys())))
# For each match, look-up corresponding value in dictionary
return regex.sub(lambda mo: dict[mo.string[mo.start():mo.end()]], text)
Once you defined the function, you can apply it to your dataframe.
di = {1: "A", 2: "B"}
df['col1'] = df.apply(lambda row: multiple_replace(di, row['col1']), axis=1)
I have a table T1 as shown below (stored as dataframe df3 with columns col1, col2 and col3)
df2 has columns 'l', 'm', 'n'...
df1 has columns 'a', 'b', 'c'
col1 col2 col3
x add {'a':'df1','l':'df2','n':'df2'}
y sub {'b':'df1','m':'df2'}
z sqrt {'c': 'df1'}
Value x in col1 is to be calculated using operation add in col2 using parameters key:value pairs in col3 (a in df1, l in df2, ...)
Likewise, value y in col1 is to be calculated using operation sub in col2 using parameters in col3 (b in df1, m in df2); the number of k:v pairs in Col3 could be more OR less depending upon the operation/function defined in col 2, for sqrt for instance, there is only 1 pair
I want to get the output in form a dataframe df4 as mentioned below
x y z
df1['a']+df2['l']+df2['n'] df1['b'] - df2['m'] df1['c]
I am trying achieve this by building a function as mentioned below but I am not sure how shall I build and pass a dynamic arguments list to this function where number of arguments to be passed depends upon the number of k:v pairs assigned in col3? In my case for add I have 3 and for sub I have 2 and for sqrt, I have only 1
for ix,row in df3.iterrows():
call_operation = row['col2']
target_value = row['col1']
#df4[target_value] = getattr(module,call_operation)(df2[b],df1[a])
df4[target_value] = getattr(module,call_operation)( <dynamic argument list form col3> )
# dummy data
df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'l': [4, 5, 6],
'n': [7, 8, 9]})
# get your dfs in a list so we can call them by name
dfs = {'df1': df1, 'df2': df2}
# let's say you are in your for loop on the first row:
ix = 0
target_name = 'x'
call_operation = 'sum'
col3 = {'a': 'df1', 'l': 'df2', 'n': 'df2'}
# actual logic:
vars = []
for k, v in col3.items():
vars.append(dfs[v][k].iloc[ix])
results['target_name'].iloc[ix] = getattr(__builtin__, call_operation)(vars)
Depending on how many type of operations you have in your real data you could either use getattr(), if statements or a combination of both.
if call_operation == 'sqrt':
getattr(math, 'sqrt')(vars[0])
etc.
This doesn't feel like a proper use of pandas though, but I'm not sure of the size of your actual dataset.
Given a Pandas Series with strings, I'd like to create a DataFrame with columns for each section of the Series based on position.
For example, given this input:
s = pd.Series(['abcdef', '123456'])
ind = [2, 3, 1]
Ideally I'd get this:
target_df = pd.DataFrame({
'col1': ['ab', '12'],
'col2': ['cde', '345'],
'col3': ['f', '6']
})
One way is creating them one-by-one, e.g.:
df['col1'] = s.str[:3]
df['col2'] = s.str[3:5]
df['col3'] = s.str[5]
But I'm guessing this is slower than a single split.
I tried a regex, but not sure how to parse the result:
pd.DataFrame(s.str.split("(^(\w{2})(\w{3})(\w{1}))"))
# 0
# 0 [, abcdef, ab, cde, f, ]
# 1 [, 123456, 12, 345, 6, ]
Your regex is almost there (note Series.str.extract(expand=True) returns a DataFrame):
df = s.str.extract("^(\w{2})(\w{3})(\w{1})", expand = True)
df.columns = ['col1', 'col2', 'col3']
# col1 col2 col3
# 0 ab cde f
# 1 12 345 6
Here's a function to generalize this:
def split_series_by_position(s, ind, cols):
# Construct regex.
regex = "^(\w{" + "})(\w{".join(map(str, ind)) + "})"
df = s.str.extract(regex, expand=True)
df.columns = cols
return df
# Example which will produce the result above.
split_series_by_position(s, ind, ['col1', 'col2', 'col3'])
I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.
This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.
I am trying to get the column names which have cell values less than .2, without repeating a combination of columns. I tried this to iterate over the column names without success:
pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
print(pvals2)
print('---')
pvals2.transpose().join(pvals2, how='outer')
My goal is:
col3 col2 .01
#col2 col3 .01 #NOT INCLUDED (because it it a repeat)
A list comprehension is one way:
pvals2 = pd.DataFrame({'col1': [1, .2,.7], 'col2': [.2, 1,.01], 'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
res = [col for col in pvals2 if (pvals2[col] < 0.2).any()]
# ['col2', 'col3']
To get values as well, as in your desired output, requires more specification, as a column may have more than one value less than 0.2.
Iterate through the columns and check if any value meets your conditions:
pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]})
cols_with_small_values = set()
for col in pvals2.columns:
if any(i < 0.2 for i in pvals2[col]):
cols_with_small_values.add(col)
cols_with_small_values.add(pvals2[col].min())
print(cols_with_small_values)
RESULT: {'col3', 0.01, 'col2'}
any is a built-in. This question has a good explanation for how any works. And we can use a set to assure each column will only appear once.
We use DataFrame.min() to get the small value that caused us to select this column.
You could use stack and then filter out values < 0.2. Then keep the last duplicated value
pvals2.stack()[pvals2.stack().lt(.2)].drop_duplicates(keep='last')
col3 col2 0.01
dtype: float64
pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
pvals2.min().where(lambda x : x<0.1).dropna()
Output
col2 0.01
col3 0.01
dtype: float64