pandas - remove specific sequence from column - python

I want to remove specific sequences from my column, because they appear a lot and don't give me a lot of extra information. The database consists of edges between nodes. In this case, there will be an edge between node 1 and node 1, node 1 and node 2, node 2 and node 3.....
However, the edge 1-5 happens around 80.000 times in the real database. I want to filter those out, only keeping the 'not so common' interactions.
Lets say my dataframe looks like this
>>> datatry
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 1 23
5 5 22
6 3 144
7 5 33
What I have so far was removing a sequence that was only repeating itself:
c1 = datatry['num'].eq('1')
c2 = datatry['num'].eq(datatry['num'].shift(1))
datatry2 = datatry[(c1 & ~c2) | ~(c1)]
How could I alter the code above (that removes all the rows that repeat the integer 1 and keeps only the first row with the value 1) to code that removes all rows that are a specific sequence? For example: a 1 and then a 5? In this case, I want to remove both the row with value 1 and the row with value 5 that appear in that sequence. My end result would ideally be:
>>> datatry
num line
0 1 56
1 1 90
2 2 66
3 3 4
4 3 144
5 5 33

Here is one way:
import numpy as np
import pandas as pd
def find_drops(seq, df):
if seq:
m = np.logical_and.reduce([df.num.shift(-i).eq(seq[i]) for i in range(len(seq))])
if len(seq) == 1:
return pd.Series(m, index=df.index)
else:
return pd.Series(m, index=df.index).replace({False: np.NaN}).ffill(limit=len(seq)-1).fillna(False)
else:
return pd.Series(False, index=df.index)
find_drops([1], df)
#0 True
#1 True
#2 False
#3 False
#4 True
#5 False
#6 False
#7 False
#dtype: bool
find_drops([1,1,2,3], df)
#0 True
#1 True
#2 True
#3 True
#4 False
#5 False
#6 False
#7 False
#dtype: bool
Then just use those Series to slice df[~find_drops([1,5], df)]

Did you look at duplicated? That has a default value of keep=first. So you can simply do:
datatry.loc[datatry['num'].duplicated(), :]

Related

Create a column Counting the consecutive True values on multi-index

Let df be a dataframe of boolean values with a two column index. I want to calculate the value for every id. For example, this is how it would look on this specific case.
value consecutive
id Week
1 1 True 1
1 2 True 2
1 3 False 0
1 4 True 1
1 5 True 2
2 1 False 0
2 2 False 0
2 3 True 1
This is my solution:
def func(id,week):
M = df.loc[id]
M= df.loc[id][:week+1]
consecutive_list = list()
S=0
for index,row in M.iterrows():
if row['value']:
S+=1
else:
S=0
consecutive_list.append(S)
return consecutive_list[-1]
Then we generate the column "consecutive" as a list on the following way:
Consecutive_list = list()
for k in df.index:
id = k[0]
week=k[1]
Consecutive_list.append(func(id,week))
df['consecutive'] = Consecutive_list
I would like to know if there is a more Pythonic way to do this.
EDIT: I wrote the "consecutive" column in order to show what I expect this to be.
If you are trying to add the consecutive column to the df, this should work:
df.assign(consecutive = df['value'].groupby(df['value'].diff().ne(0).cumsum()).cumsum())
Output:
value consecutive
1 a True 1
b True 2
2 a False 0
b True 1
3 a True 2
b False 0
4 a False 0
b True 1

For each unique value in a pandas DataFrame column, how can I randomly select a proportion of rows?

Python newbie here.
Imagine a csv file that looks something like this:
(...except that in real life, there are 20 distinct names in the Person column, and each Person has 300-500 rows. Also, there are multiple data columns, not just one.)
What I want to do is randomly flag 10% of each Person's rows and mark this in a new column. I came up with a ridiculously convoluted way to do this--it involved creating a helper column of random numbers and all sorts of unnecessarily complicated jiggery-pokery. It worked, but was crazy. More recently, I came up with this:
import pandas as pd
df = pd.read_csv('source.csv')
df['selected'] = ''
names= list(df['Person'].unique()) #gets list of unique names
for name in names:
df_temp = df[df['Person']== name]
samp = int(len(df_temp)/10) # I want to sample 10% for each name
df_temp = df_temp.sample(samp)
df_temp['selected'] = 'bingo!' #a new column to mark the rows I've randomly selected
df = df.merge(df_temp, how = 'left', on = ['Person','data'])
df['temp'] =[f"{a} {b}" for a,b in zip(df['selected_x'],df['selected_y'])]
#Note: initially instead of the line above, I tried the line below, but it didn't work too well:
#df['temp'] = df['selected_x'] + df['selected_y']
df = df[['Person','data','temp']]
df = df.rename(columns = {'temp':'selected'})
df['selected'] = df['selected'].str.replace('nan','').str.strip() #cleans up the column
As you can see, essentially I'm pulling out a temporary DataFrame for each Person, using DF.sample(number) to do the randomising, then using DF.merge to get the 'marked' rows back into the original DataFrame. And it involved iterating through a list to create each temporary DataFrame...and my understanding is that iterating is kind of lame.
There's got to be a more Pythonic, vectorising way to do this, right? Without iterating. Maybe something involving groupby? Any thoughts or advice much appreciated.
EDIT: Here's another way that avoids merge...but it's still pretty clunky:
import pandas as pd
import math
#SETUP TEST DATA:
y = ['Alex'] * 2321 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #shuffle (optional--just to show order doesn't matter)
percent = 10 #CHANGE AS NEEDED
#Add a 'helper' column with random numbers
df['rand'] = np.random.random(df.shape[0])
df = df.sample(frac=1) #this shuffles data, just to show order doesn't matter
#CREATE A HELPER LIST
helper = pd.DataFrame(df.groupby('persons'['rand'].count()).reset_index().values.tolist()
for row in helper:
df_temp = df[df['persons'] == row[0]][['persons','rand']]
lim = math.ceil(len(df_temp) * percent*0.01)
row.append(df_temp.nlargest(lim,'rand').iloc[-1][1])
def flag(name,num):
for row in helper:
if row[0] == name:
if num >= row[2]:
return 'yes'
else:
return 'no'
df['flag'] = df.apply(lambda x: flag(x['persons'], x['rand']), axis=1)
You could use groupby.sample, either to pick out a sample of the whole dataframe for further processing, or to identify rows of the dataframe to mark if that's more convenient.
import pandas as pd
percentage_to_flag = 0.5
# Toy data: 8 rows, persons A and B.
df = pd.DataFrame(data={'persons':['A']*4 + ['B']*4, 'data':range(8)})
# persons data
# 0 A 0
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 5
# 6 B 6
# 7 B 7
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").sample(frac=percentage_to_flag,
random_state=random_state)
# persons data
# 1 A 1
# 2 A 2
# 7 B 7
# 6 B 6
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
# persons data marked
# 0 A 0 False
# 1 A 1 True
# 2 A 2 True
# 3 A 3 False
# 4 B 4 False
# 5 B 5 False
# 6 B 6 True
# 7 B 7 True
If you really do not want the sub-sampled dataframe df_sample you can go straight to marking a sample of the original dataframe:
# Mark random sample in original dataframe with minimal intermediate data.
df["marked2"] = False
df.loc[df.groupby("persons")["data"].sample(frac=percentage_to_flag,
random_state=random_state).index,
"marked2"] = True
# persons data marked marked2
# 0 A 0 False False
# 1 A 1 True True
# 2 A 2 True True
# 3 A 3 False False
# 4 B 4 False False
# 5 B 5 False False
# 6 B 6 True True
# 7 B 7 True True
If I understood you correctly, you can achieve this using:
df = pd.DataFrame(data={'persons':['A']*10 + ['B']*10, 'col_1':[2]*20})
percentage_to_flag = 0.5
a = df.groupby(['persons'])['col_1'].apply(lambda x: pd.Series(x.index.isin(x.sample(frac=percentage_to_flag, random_state= 5, replace=False).index))).reset_index(drop=True)
df['flagged'] = a
Input:
persons col_1
0 A 2
1 A 2
2 A 2
3 A 2
4 A 2
5 A 2
6 A 2
7 A 2
8 A 2
9 A 2
10 B 2
11 B 2
12 B 2
13 B 2
14 B 2
15 B 2
16 B 2
17 B 2
18 B 2
19 B 2
Output with 50% flagged rows in each group:
persons col_1 flagged
0 A 2 False
1 A 2 False
2 A 2 True
3 A 2 False
4 A 2 True
5 A 2 True
6 A 2 False
7 A 2 True
8 A 2 False
9 A 2 True
10 B 2 False
11 B 2 False
12 B 2 True
13 B 2 False
14 B 2 True
15 B 2 True
16 B 2 False
17 B 2 True
18 B 2 False
19 B 2 True
This is TMBailey's answer, tweaked so it works in my Python version. (Didn't want to edit someone else's answer but if I'm doing it wrong I'll take this down.) This works really great and really fast!
EDIT: I've updated this based on additional suggestion by TMBailey to replace frac=percentage_to_flag with n=math.ceil(percentage_to_flag * len(x)). This ensures that rounding doesn't pull the sampled %age under the 'percentage_to_flag' threshhold. (For what it's worth, you can replace it with frac=(math.ceil(percentage_to_flag * len(x)))/len(x) too).
import pandas as pd
import math
percentage_to_flag = .10
# Toy data:
y = ['Alex'] * 2321 + ['Eddie'] * 876 + ['Doug'] * 34123 + ['Chuck'] * 2012 + ['Bob'] * 9281
z = ['xyz'] * len(y)
df = pd.DataFrame({'persons': y, 'data' : z})
df = df.sample(frac = 1) #optional shuffle, just to show order doesn't matter
# Pick out random sample of dataframe.
random_state = 41 # Change to get different random values.
df_sample = df.groupby("persons").apply(lambda x: x.sample(n=(math.ceil(percentage_to_flag * len(x))),random_state=random_state))
#had to use lambda in line above
df_sample = df_sample.reset_index(level=0, drop=True) #had to add this to simplify multi-index DF
# Mark the random sample in the original dataframe.
df["marked"] = False
df.loc[df_sample.index, "marked"] = True
And then to check:
pp = df.pivot_table(index="persons", columns="marked", values="data", aggfunc='count', fill_value=0)
pp.columns = ['no','yes']
pp = pp.append(pp.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
pp['% selected'] = 100 * pp.yes/pp.Total
print(pp)
OUTPUT:
no yes Total % selected
persons
Alex 2088 233 2321 10.038776
Bob 8352 929 9281 10.009697
Chuck 1810 202 2012 10.039761
Doug 30710 3413 34123 10.002051
Eddie 788 88 876 10.045662
Total 43748 4865 48613 10.007611
Works like a charm.

Need to do eval on pandas dataframe at row level

I have a scenario where my pandas data frame have a condition stored as string which I need to execute and store result as different column. Below example will help you understand better;
Existing DataFrame:
ID Val Cond
1 5 >10
1 15 >10
Expected DataFrame:
ID Val Cond Result
1 5 >10 False
1 15 >10 True
As you see and I need to concatenate Val and Cond and do eval at row level.
If your conditions are formed from the basic operations (<, <=, ==, !=, >, >=), then we can do this more efficiently using getattr. We use .str.extract to parse the condition and separate the comparison and the value. Using our dictionary we map the comparison to the Series attributes that we can then call for each unique comparison separately in a simple groupby.
import pandas as pd
print(df)
ID Val Cond
0 1 5 >10
1 1 15 >10
2 1 20 ==20
3 1 25 <=25
4 1 26 <=25
# All operations we might have.
d = {'>': 'gt', '<': 'lt', '>=': 'ge', '<=': 'le', '==': 'eq', '!=': 'ne'}
# Create a DataFrame with the LHS value, comparator, RHS value
tmp = pd.concat([df['Val'],
df['Cond'].str.extract('(.*?)(\d+)').rename(columns={0: 'cond', 1: 'comp'})],
axis=1)
tmp[['Val', 'comp']] = tmp[['Val', 'comp']].apply(pd.to_numeric)
# Val cond comp
#0 5 > 10
#1 15 > 10
#2 20 == 20
#3 25 <= 25
#4 26 <= 25
#5 10 != 10
# Aligns on row Index
df['Result'] = pd.concat([getattr(gp['Val'], d[idx])(gp['comp'])
for idx, gp in tmp.groupby('cond')])
# ID Val Cond Result
#0 1 5 >10 False
#1 1 15 >10 True
#2 1 20 ==20 True
#3 1 25 <=25 True
#4 1 26 <=25 False
#5 1 10 !=10 False
Simple, but inefficient and dangerous, is to eval on each row, creating a string of your condition. eval is dangerous as it can evaluate any code, so only use if you truly trust and know the data.
df['Result'] = df.apply(lambda x: eval(str(x.Val) + x.Cond), axis=1)
# ID Val Cond Result
#0 1 5 >10 False
#1 1 15 >10 True
#2 1 20 ==20 True
#3 1 25 <=25 True
#4 1 26 <=25 False
#5 1 10 !=10 False
You can also do something like this:
df["Result"] = [eval(x + y) for x, y in zip(df["Val"].astype(str), df["Cond"]]
Make the "Result" column by concatenating the strings df["Val"] and df["Cond"], then applying eval to that.

Finding contiguous, non-unique slices in Pandas series without iterating

I'm trying to parse a logfile of our manufacturing process. Most of the time the process is run automatically but occasionally, the engineer needs to switch into manual mode to make some changes and then switches back to automatic control by the reactor software. When set to manual mode the logfile records the step as being "MAN.OP." instead of a number. Below is a representative example.
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
ser_orig = pd.Series(steps)
which results in
0 1
1 2
2 2
3 MAN.OP.
4 MAN.OP.
5 2
6 2
7 3
8 3
9 MAN.OP.
10 MAN.OP.
11 4
12 4
dtype: object
I need to detect the 'MAN.OP.' and make them distinct from each other. In this example, the two regions with values == 2 should be one region after detecting the manual mode section like this:
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
I have code that iterates over this series and produces the correct result when the series is passed to my object. The setter is:
#step_series.setter
def step_series(self, ss):
"""
On assignment, give the manual mode steps a unique name. Leave
the steps done on recipe the same.
"""
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
counter = 0
continuous = False
for i in ss.index:
if continuous and ss.at[i] != manual_mode:
continuous = False
counter += 1
elif not continuous and ss.at[i] == manual_mode:
continuous = True
ss.at[i] = new_manual_mode_text.format(str(counter))
elif continuous and ss.at[i] == manual_mode:
ss.at[i] = new_manual_mode_text.format(str(counter))
self._step_series = ss
but this iterates over the entire dataframe and is the slowest part of my code other than reading the logfile over the network.
How can I detect these non-unique sections and rename them uniquely without iterating over the entire series? The series is a column selection from a larger dataframe so adding extra columns is fine if needed.
For the completed answer I ended up with:
#step_series.setter
def step_series(self, ss):
pd.options.mode.chained_assignment = None
manual_mode = "MAN.OP."
new_manual_mode_text = "Manual_Mode_{}"
newManOp = (ss=='MAN.OP.') & (ss != ss.shift())
ss[ss == 'MAN.OP.'] = 'Manual_Mode_' + (newManOp.cumsum()-1).astype(str)
self._step_series = ss
Here's one way:
steps = [1,2,2,'MAN.OP.','MAN.OP.',2,2,3,3,'MAN.OP.','MAN.OP.',4,4]
steps = pd.Series(steps)
newManOp = (steps=='MAN.OP.') & (steps != steps.shift())
steps[steps=='MAN.OP.'] += seq.cumsum().astype(str)
>>> steps
0 1
1 2
2 2
3 MAN.OP.1
4 MAN.OP.1
5 2
6 2
7 3
8 3
9 MAN.OP.2
10 MAN.OP.2
11 4
12 4
dtype: object
To get the exact format you listed (starting from zero instead of one, and changing from "MAN.OP." to "Manual_mode_"), just tweak the last line:
steps[steps=='MAN.OP.'] = 'Manual_Mode_' + (seq.cumsum()-1).astype(str)
>>> steps
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object
There a pandas enhancement request for contiguous groupby, which would make this type of task simpler.
There is s function in matplotlib that takes a boolean array and returns a list of (start, end) pairs. Each pair represents a contiguous region where the input is True.
import matplotlib.mlab as mlab
regions = mlab.contiguous_regions(ser_orig == manual_mode)
for i, (start, end) in enumerate(regions):
ser_orig[start:end] = new_manual_mode_text.format(i)
ser_orig
0 1
1 2
2 2
3 Manual_Mode_0
4 Manual_Mode_0
5 2
6 2
7 3
8 3
9 Manual_Mode_1
10 Manual_Mode_1
11 4
12 4
dtype: object

creating histograms in pandas

I'm trying to create a histogram based on the following groupby,
dfm.groupby(['ID', 'Readings', 'Condition']).size:
578871001 20110603 True 1
20110701 True 1
20110803 True 1
20110901 True 1
20110930 True 1
..
324461897 20130214 False 1
20130318 False 1
20130416 False 1
20130516 False 1
20130617 False 1
532674350 20110616 False 1
20110718 False 1
20110818 False 1
20110916 False 1
20111017 False 1
20111115 False 1
20111219 False 1
However, I'm trying to format the output by Condition and group the number of ID and Readings. Something like this,
True
# of Readings: # of ID
1 : 5
2 : 8
3 : 15
4 : 10
5 : 4
I've tried grouping just by ID and Readings, and transforming by Condition, but have not gotten very far.
Edit:
This is what the dataframe looked like before the groupby:
CustID Condtion Month Reading Consumption
0 108000601 True June 20110606 28320.0
1 108007000 True July 20110705 13760.0
2 108007000 True August 20110804 16240.0
3 108008000 True September 20110901 12560.0
4 108008000 True October 20111004 12400.0
5 108000601 False November 20111101 9440.0
6 108090000 False December 20111205 12160.0
Is this what you are trying to achieve with your groupby? I've included Counter to track the count of each reading. For example, for Condtion = False, there are two CustIDs with a single reading, so the output of the first row is:
Condtion
False 1 2 # One reading, two observations of one reading.
Then, for Condtion = True, there is one customer with one reading (108000601) and two customers with two readings each. The output for this group is:
Condtion
True 1 1 # One customer with one reading.
2 2 # Two customers with two readings each.
from collections import Counter
gb = df.groupby(['Condtion', 'CustID'], as_index=False).Reading.count()
>>> gb
Condtion CustID Reading
0 False 108000601 1
1 False 108090000 1
2 True 108000601 1
3 True 108007000 2
4 True 108008000 2
>>> gb.groupby('Condtion').Reading.apply(lambda group: Counter(group))
Condtion
False 1 2
True 1 1
2 2
dtype: float64
Or, chained together as a single statement:
gb = (df
.groupby(['Condtion', 'CustID'], as_index=False)['Reading']
.count()
.groupby('Condtion')['Reading']
.apply(lambda group: Counter(group))
)

Categories

Resources