my data cleaning script is slow, any ideas on how to improve?

my data cleaning script is slow, any ideas on how to improve? - python

I have a Data(csv format) where the first column is an epoch timestamp(strictly increasing) and the other columns are cumulative rows(just increasing or equal).
Sample is as below:
df = pandas.DataFrame([[1515288240, 100, 50, 90, 70],[1515288241, 101, 60, 95, 75],[1515288242, 110, 70, 100, 80],[1515288239, 110, 70, 110, 85],[1515288241, 110, 75, 110, 85],[1515288243,110,70,110,85]],columns =['UNIX_TS','A','B','C','D'])
df =
id UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80
3 1515288239 110 70 110 85
4 1515288241 110 75 110 85
5 1515288243 110 70 110 85
import pandas as pd
def clean(df,column_name,equl):
i=0
while(df.shape[0]-2>=i):
if df[column_name].iloc[i]>df[column_name].iloc[i+1]:
df.drop(df[column_name].iloc[[i+1]].index,inplace=True)
continue
elif df[column_name].iloc[i]==df[column_name].iloc[i+1] and equl==1:
df.drop(df[column_name].iloc[[i+1]].index,inplace=True)
continue
i+=1
clean(df,'UNIX_TS',1)
for col in df.columns[1:]:
clean(df,col,0)
df =
id UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80
My script works as intended but its too slow, anybody has ideas about how to improve its speed.
I wrote a script to remove all the invalid data based on 2 rules:
Unix_TS must be strictly increasing(because its a time, it cannot flow back or pause),
other columns are increasing and can be constant for example is in one row it is 100 and the next row it can be >=100 but not less.
Based on the rules the index 3 & 4 are invalid because unix_ts 1515288239 is 1515288241 are less than the index 2.
index 5 is wrong because the value B decreased

IIUC, can use
cols = ['A', 'B', 'C', 'D']
mask_1 = df['UNIX_TS'] > df['UNIX_TS'].cummax().shift().fillna(0)
mask_2 = mask_2 = (df[cols] >= df[cols].cummax().shift().fillna(0)).all(1)
df[mask_1 & mask_2]
Outputs
UNIX_TS A B C D
0 1515288240 100 50 90 70
1 1515288241 101 60 95 75
2 1515288242 110 70 100 80

Related

How to name the column when using value_count function in pandas?

I was counting the no of occurrence of angle and dist by the code below:
g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
the output:
current_angle current_dist 0
-50 30 1
-50 40 2
-50 41 6
-50 45 4
try1:
g.columns = ['angle','Distance','count','Percentage Missed'] - result was no change in the name of column
try2:
When I print the columns using print(g.columns) ended with error AttributeError: 'Series' object has no attribute 'columns'
I want to rename the column 0 as count and add a new column to the dataframe g as percent missed which is calculated by 100 - value in column 0
Expected output
current_angle current_dist count percent missed
-50 30 1 99
-50 40 2 98
-50 41 6 94
-50 45 4 96
1:How to modify the code? I mean instead of value_counts, is there any other function that can give the expected output?
2. How to get the expected output with the current method?
EDIT 1(exceptional case)
data:
angle
distance
velocity
0
124
-3
50
24
-25
50
34
25
expected output:
count is calculated based on distance
angle
distance
velocity
count
percent missed
0
124
-3
1
99
50
24
-25
1
99
50
34
25
1
99

First add Series.reset_index, because DataFrame.value_counts return Series, so possible use parameter name for change column 0 to count column and then subtract 100 to new column by Series.rsub for subtract from right side like 100 - df['count']:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
Or if need also set new columns names use DataFrame.set_axis:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index(name='count')
.set_axis(['angle','Distance','count'], axis=1)
.assign(**{'percent missed': lambda x: x['count'].rsub(100)}))
If need assign new columns names here is alternative solution:
df = (new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False)
.reset_index())
df.columns = ['angle','Distance','count']
df['percent missed'] = df['count'].rsub(100)

Assuming a DataFrame as input (if not reset_index first), simply use rename and a subtraction:
df = df.rename(columns={'0': 'count'}) # assuming string '0' here, else use 0
df['percent missed'] = 100 - df['count']
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96
alternative: using groupby.size:
(new_df
.groupby(['current_angle','current_dist']).size()
.reset_index(name='count')
.assign(**{'percent missed': lambda d: 100-d['count']})
)
output:
current_angle current_dist count percent missed
0 -50 30 1 99
1 -50 40 2 98
2 -50 41 6 94
3 -50 45 4 96

determine the range of a value using a look up table

I have a df with numbers:
numbers = pd.DataFrame(columns=['number'], data=[
50,
65,
75,
85,
90
])
and a df with ranges (look up table):
ranges = pd.DataFrame(
columns=['range','range_min','range_max'],
data=[
['A',90,100],
['B',85,95],
['C',70,80]
]
)
I want to determine what range (in second table) a value (in the first table) falls in. Please note ranges overlap, and limits are inclusive.
Also please note the vanilla dataframe above has 3 ranges, however this dataframe gets generated dynamically. It could have from 2 to 7 ranges.
Desired result:
numbers = pd.DataFrame(columns=['number','detected_range'], data=[
[50,'out_of_range'],
[65, 'out_of_range'],
[75,'C'],
[85,'B'],
[90,'overlap'] * could be A or B *
])
I solved this with a for loop but this doesn't scale well to a big dataset I am using. Also code is too extensive and inelegant. See below:
numbers['detected_range'] = nan
for i, row1 in number.iterrows():
for j, row2 in ranges.iterrows():
if row1.number<row2.range_min and row1.number>row2.range_max:
numbers.loc[i,'detected_range'] = row1.loc[j,'range']
else if (other cases...):
...and so on...
How could I do this?

You can use a bit of numpy vectorial operations to generate masks, and use them to select your labels:
import numpy as np
a = numbers['number'].values # numpy array of numbers
r = ranges.set_index('range') # dataframe of min/max with labels as index
m1 = (a>=r['range_min'].values[:,None]).T # is number above each min
m2 = (a<r['range_max'].values[:,None]).T # is number below each max
m3 = (m1&m2) # combine both conditions above
# NB. the two operations could be done without the intermediate variables m1/m2
m4 = m3.sum(1) # how many matches?
# 0 -> out_of_range
# 2 -> overlap
# 1 -> get column name
# now we select the label according to the conditions
numbers['detected_range'] = np.select([m4==0, m4==2], # out_of_range and overlap
['out_of_range', 'overlap'],
# otherwise get column name
default=np.take(r.index, m3.argmax(1))
)
output:
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
edit:
It works with any number of intervals in ranges
example output with extra['D',50,51]:
number detected_range
0 50 D
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap

Pandas IntervalIndex fits in here; however, since your data has overlapping points, a for loop is the approach I'll use here (for unique, non-overlapping indices, pd.get_indexer is a fast approach):
intervals = pd.IntervalIndex.from_arrays(ranges.range_min,
ranges.range_max,
closed='both')
box = []
for num in numbers.number:
bools = intervals.contains(num)
if bools.sum()==1:
box.append(ranges.range[bools].item())
elif bools.sum() > 1:
box.append('overlap')
else:
box.append('out_of_range')
numbers.assign(detected_range = box)
number detected_range
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap

firstly，explode the ranges:
df1=ranges.assign(col1=ranges.apply(lambda ss:range(ss.range_min,ss.range_max),axis=1)).explode('col1')
df1
range range_min range_max col1
0 A 90 100 90
0 A 90 100 91
0 A 90 100 92
0 A 90 100 93
0 A 90 100 94
0 A 90 100 95
0 A 90 100 96
0 A 90 100 97
0 A 90 100 98
0 A 90 100 99
1 B 85 95 85
1 B 85 95 86
1 B 85 95 87
1 B 85 95 88
1 B 85 95 89
1 B 85 95 90
secondly，judge wether each of numbers in first df
def function1(x):
df11=df1.loc[df1.col1==x]
if len(df11)==0:
return 'out_of_range'
if len(df11)>1:
return 'overlap'
return df11.iloc[0,0]
numbers.assign(col2=numbers.number.map(function1))
number col2
0 50 out_of_range
1 65 out_of_range
2 75 C
3 85 B
4 90 overlap
the logic is simple and clear

Sample dataframe by value in column and keep all rows

I want to sample a Pandas dataframe using values in a certain column, but I want to keep all rows with values that are in the sample.
For example, in the dataframe below I want to randomly sample some fraction of the values in b, but keep all corresponding rows in a and c.
d = pd.DataFrame({'a': range(1, 101, 1),'b': list(range(0, 100, 4))*4, 'c' :list(range(0, 100, 2))*2} )
Desired example output from a 16% sample:
Out[66]:
a b c
0 1 0 0
1 26 0 50
2 51 0 0
3 76 0 50
4 4 12 6
5 29 12 56
6 54 12 6
7 79 12 56
8 18 68 34
9 43 68 84
10 68 68 34
11 93 68 84
12 19 72 36
13 44 72 86
14 69 72 36
15 94 72 86
I've tried sampling the series and merging back to the main data, like this:
In [66]: pd.merge(d, d.b.sample(int(.16 * d.b.nunique())))
This creates the desired output, but it seems inefficient. My real dataset has millions of values in b and hundreds of millions of rows. I know I could also use some version of ``isin```, but that also is slow.
Is there a more efficient way to do this?

I really doubt that isin is slow:
uniques = df.b.unique()
# this maybe the bottle neck
samples = np.random.choice(uniques, replace=False, size=int(0.16*len(uniques)) )
# sampling here
df[df.b.isin(samples)]
You can profile the steps above. In case samples=... is slow, you can try:
idx = np.random.rand(len(uniques))
samples = uniques[idx<0.16]
Those took about 100 ms on my system on 10 million rows.
Note: d.b.sample(int(.16 * d.b.nunique())) does not sample 0.16 of the unique values in b.

Efficient method to adjust column values where equal to x - Python

The following multiplies all values in a column where rows are equal to a specific value. Using below, where row is in Item is equal to Up, I want to multiply all other columns by 2. I'm passing this to a single column at at time. Is there a more efficient way to process this?
import pandas as pd
df = pd.DataFrame({
'Item' : ['Up','Up','Down','Up','Down','Up'],
'A' : [50, 50, 60, 60, 40, 30],
'B' : [60, 70, 60, 50, 50, 60],
})
df.loc[df['Item'] == 'Up', 'A'] = df['A'] * 2
df.loc[df['Item'] == 'Up', 'B'] = df['B'] * 2
Out:
Item A B
0 Up 100 120
1 Up 100 140
2 Down 60 60
3 Up 120 100
4 Down 40 50
5 Up 60 120

You can do:
df.loc[df['Item'] == 'Up', ['A','B']] *= 2
Output:
Item A B
0 Up 100 120
1 Up 100 140
2 Down 60 60
3 Up 120 100
4 Down 40 50
5 Up 60 120

Pandas first 5 and last 5 rows in single iloc operation

I need to check df.head() and df.tail() many times.
When using df.head(), df.tail() jupyter notebook dispalys the ugly output.
Is there any single line command so that we can select only first 5 and last 5 rows:
something like:
df.iloc[:5 | -5:] ?
Test example:
df = pd.DataFrame(np.random.rand(20,2))
df.iloc[:5]
Update
Ugly but working ways:
df.iloc[(np.where( (df.index < 5) | (df.index > len(df)-5)))[0]]
or,
df.iloc[np.r_[np.arange(5), np.arange(df.shape[0]-5, df.shape[0])]]

Try look at numpy.r_
df.iloc[np.r_[0:5, -5:0]]
Out[358]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993
Also head + tail is not a bad solution
df.head(5).append(df.tail(5))
Out[362]:
0 1
0 0.899673 0.584707
1 0.443328 0.126370
2 0.203212 0.206542
3 0.562156 0.401226
4 0.085070 0.206960
15 0.082846 0.548997
16 0.435308 0.669673
17 0.426955 0.030303
18 0.327725 0.340572
19 0.250246 0.162993

df.query("index<5 | index>"+str(len(df)-5))
Here's a way to query the index. You can change the values to whatever you want.

Another approach (per this SO post)
uses only Pandas .isin()
Generate some dummy/demo data
df = pd.DataFrame({'a':range(10,100)})
print(df.head())
a
0 10
1 11
2 12
3 13
4 14
print(df.tail())
a
85 95
86 96
87 97
88 98
89 99
print(df.shape)
(90, 1)
Generate list of required indexes
ls = list(range(5)) + list(range(len(df)-5, len(df)))
print(ls)
[0, 1, 2, 3, 4, 85, 86, 87, 88, 89]
Slice DataFrame using list of indexes
df_first_last_5 = df[df.index.isin(ls)]
print(df_first_last_5)
a
0 10
1 11
2 12
3 13
4 14
85 95
86 96
87 97
88 98
89 99

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

my data cleaning script is slow, any ideas on how to improve? - python

Related

How to name the column when using value_count function in pandas?

determine the range of a value using a look up table

Sample dataframe by value in column and keep all rows

Efficient method to adjust column values where equal to x - Python

Pandas first 5 and last 5 rows in single iloc operation

Categories

Resources