Pandas Dataframe reduce Dataframe with List comprehension - python

def check_LM_CapLetters(nex):
df_ja = nex.copy()
df_nein = nex.copy()
criterion_nein = df_nein["NAME2"].map(lambda x: x.endswith("_nein"))
criterion_ja = df_ja["NAME2"].map(lambda y: y.endswith("_ja"))
df_ja[criterion_ja]
df_nein[criterion_nein]
frames = [df_ja,df_nein]
df_found_small = pd.concat(frames)
return df_found_small
Trying to reduce my dataframe to the rows where the cell entry ends with "_ja" or "_nein".
But the output is like the two copies merged. What I want is a limitation, which only shows the rows which fulfill my criteria.
By the way... is there a more elegant and efficient way? First time dealing with list comprehensions and I'm kind of overwhelmed..
My data looks like:
Relation ID; TermID; NAME; bla;blub; TermID2; NAME2

Try the following:
- apply Pandas string function .str.endswith()
- use complex criteria for sorting
Working example:
df = pd.DataFrame (['aaa','bba','baa','cba', 'xbb'], columns = ['name2'])
df_small = df[(df.name2.str.endswith ('aa'))|(df.name2.str.endswith ('ba'))]
>>>
name2
0 aaa
1 bba
2 baa
3 cba

Related

Pandas groupy - Can I use it for different functions on different sets of rows?

I have a large pandas dataframe with many different types of observations that need different models applied to them. One column is which model to apply, and that can be mapped to a python function which accepts a dataframe and returns a dataframe. One approach would be just doing 3 steps:
split dataframe into n dataframes for n different models
run each dataframe through each function
concatenate output dataframes at the end
This just ends up not being super flexible particularly as models are added and removed. Looking at groupby it seems like I should be able to leverage that to make this look much cleaner code-wise, but I haven't been able to find a pattern that does what I'd like.
Also because of the size of this data, using apply isn't particularly useful as it would drastically slow down the runtime.
Quick example:
df = pd.DataFrame({"model":["a","b","a"],"a":[1,5,8],"b":[1,4,6]})
def model_a(df):
return df["a"] + df["b"]
def model_b(df):
return df["a"] - df["b"]
model_map = {"a":model_a,"b":model_b}
results = df.groupby("model")...
The expected result would look like [2,1,14]. Is there an easy way code-wise to do this? Note that the actual models are much more complicated and involve potentially hundreds of variables with lots of transformations, this is just a toy example.
Thanks!
You can use groupby/apply:
x.name contains the name of the group, here a and b
x contains the sub dataframe
df['r'] = df.groupby('model') \
.apply(lambda x: model_map[x.name](x)) \
.droplevel(level='model')
>>> df
model a b r
0 a 1 1 2
1 b 5 4 1
2 a 8 6 14
Or you can use np.select:
>>> np.select([df['model'] == 'a', df['model'] == 'b'],
[model_a(df), model_b(df)])
array([ 2, 1, 14])

How to filter on a pandas dataframe using contains against a list of columns, if I don't know which columns are present?

I want to filter my dataframe to look for columns containing a known string.
I know you can do something like this:
summ_proc = summ_proc[
summ_proc['data.process.name'].str.contains(indicator) |
summ_proc['data.win.eventdata.processName'].str.contains(indicator) |
summ_proc['data.win.eventdata.logonProcessName'].str.contains(indicator) |
summ_proc['syscheck.audit.process.name'].str.contains(indicator)
]
where I'm using the | operator to check against multiple columns. But there are cases where a certain column name isn't present. So 'data.process.name' might not be present every time.
I tried the following implementation:
summ_proc[summ_proc.apply(lambda x: summ_proc['data.process.name'].str.contains(indicator) if 'data.process.name' in summ_proc.columns else summ_proc)]
And that works. But I'm not sure how I can apply the OR operator to this lambda function.
I want all the rows where either data.process.name or data.win.eventdata.processName or data.win.eventdata.logonProcessName or syscheck.audit.process.name contains the indicator.
EDIT:
I tried the following approach, where I created individual frames and concated all the frames.
summ_proc1 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.process.name'].str.contains(indicator) if 'data.process.name' in summ_proc.columns else summ_proc)]
summ_proc2 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.win.eventdata.processName'].str.contains(indicator) if 'data.win.eventdata.processName' in summ_proc.columns else summ_proc)]
summ_proc3 = summ_proc[summ_proc.apply(lambda x: summ_proc['data.win.eventdata.logonProcessName'].str.contains(indicator) if 'data.win.eventdata.logonProcessName' in summ_proc.columns else summ_proc)]
frames = [summ_proc1, summ_proc2, summ_proc3]
result = pd.concat(frames)
This works, but I'm curious if there's a better more pythonic approach? Or if this current method will cause more downstream issues?
should work with something like this:
import numpy as np
columns = ['data.process.name', 'data.win.eventdata.processName']
# filter columns that are in summ_proc
available_columns = [c for c in columns if c in summ_proc.columns]
# array of Boolean values indicating if c contains indicator
ss = [summ_proc[c].str.contains(indicator) for c in available_columns]
# reduce without '|' by using 'np.logical_or'
indexer = np.logical_or.reduce(ss)
result = summ_proc[indexer]

Save a pandas dataframe inside another dataframe

I have a tricky case. Can't wrap my head around it.
I have a pandas dataframe like below:
In [3]: df = pd.DataFrame({'stat_101':[31937667515, 47594388534, 43568256234], 'group_id_101':[1,1,1], 'level_101':[1,2,2], 'stat_102':['00005#60-78','00005#60-78','00005#60-78'], 'avg_104':[27305.34552, 44783.49401, 22990.77442]})
In [4]: df
Out[4]:
stat_101 group_id_101 level_101 stat_102 avg_104
0 31937667515 1 1 00005#60-78 27305.34552
1 47594388534 1 2 00005#60-78 44783.49401
2 43568256234 1 2 00005#60-78 22990.77442
I want to group this on 'group_id_101','stat_102' columns and create another dataframe which will be storing the result of the grouped dataframe inside it.
Expected output:
In [27]: res = pd.DataFrame({'new_stat_101':[1], 'stat_102':['00005#60-78'], 'new_avg':['Dataframe_obj']})
In [28]: res
Out[28]:
new_stat_101 stat_102 new_avg
0 1 00005#60-78 Dataframe_obj
Where the Dataframe_obj will be another dataframe with rows like below:
stat_101 level_101 avg_104
0 31937667515 1 27305.34552
1 47594388534 2 44783.49401
2 43568256234 2 22990.77442
What is the best way to do this? Should I be saving a dataframe inside another dataframe or there's a more cleaner way of doing it?
Hope my question is clear.
Let's try
g = ['group_id_101', 'stat_102']
idx, dfs = zip(*df.groupby(g))
pd.DataFrame({'new_avg': dfs}, index=pd.MultiIndex.from_tuples(idx, names=g))
new_avg
group_id_101 stat_102
1 00005#60-78 stat_101 group_id_101 level_101 st...
"new_avg" is a column of DataFrames accessible by index.
Obligatory disclaimer: This is blatant abuse of DataFrames, you should typically not store objects that cannot take advantage of pandas vectorization.

Using Python, Pandas and Apply/Lambda, how can I write a function which creates multiple new columns?

Apologies for the messy title: Problem as follows:
I have some data frame of the form:
df1 =
Entries
0 "A Level"
1 "GCSE"
2 "BSC"
I also have a data frame of the form:
df2 =
Secondary Undergrad
0 "A Level" "BSC"
1 "GCSE" "BA"
2 "AS Level" "MSc"
I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
return Attribute
I apply this function over all entries in df1, and all columns in df2, using the following iteration:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
This yields an output df which looks something like:
df1 =
Entries Secondary undergrad
0 "A Level" 1 0
1 "GCSE" 1 0
2 "AS Level" 1 0
I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:
Desired dataframe:
Entries Secondary Words Found undergrad Words Found
0 "A Level" 1 "A Level" 0
1 "GCSE" 1 "GCSE" 0
2 "AS Level" 1 "AS Level" 0
If I do:
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
if Yes == 0:
Attribute = 0
return Attribute,Words_Present
My function therefore now has multiple outputs. So applying the following:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
My Output Looks like this:
Entries Secondary undergrad
0 "A Level" [1,"A Level"] 0
1 "GCSE" [1, "GCSE"] 0
2 "AS Level" [1, "AS Level"] 0
The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.
Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?
I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:
Applying function with multiple arguments to create a new pandas column
Create multiple columns in Pandas Dataframe from one function
Apply pandas function to column to create multiple new columns?
For example, I have also tried:
for i in df2:
terms = df2[i].values.tolist()
[df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])
but this simply yields errors such as:
raise ValueError('Length of values does not match length of ' 'index')
Is there a way to use apply, but still extract the extra information directly into multiple columns?
Many thanks, apologies for the length.
The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().
However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.
It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:
df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1
EDIT:
In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.
However, if you have exact matches, you can do everything by a combination of pivots and joins:
keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
df[colname] = (df.Entries == colname) * 1 # Your indicator variable
df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']
The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.
Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!
For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:
secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
if s.Entries.isin(secondary_words):
return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
elif s.Entries.isin(undergrad_words ):
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
else:
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})
This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.

Creating a New Pandas Grouped Object

In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?
If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.

Categories

Resources