Pandas removing duplicate range of data

Pandas removing duplicate range of data - python

Hi all I have the following dataframe:
df1
WL WM WH WP
1 low medium high premium
2 26 26 15 14
3 32 32 18 29
4 41 41 19 42
5 apple dog fur napkins
6 orange cat tesla earphone
7 NaN rat tobias controller
8 NaN NaN phone
9 low medium high
10 1 3 5
11 2 4 6
12 low medium high
13 4 8 10
14 5 9 11
Is there a way to remove low + 2 rows such that the output is such:
df1
WL WM WH WP
1 low medium high premium
2 26 26 15 14
3 32 32 18 29
4 41 41 19 42
5 apple dog fur napkins
6 orange cat tesla earphone
7 NaN rat tobias controller
8 NaN NaN phone
Unfortunately the code must be dynamic because I have multiple dataframes and the placement for 'low' is different in each. My initial attempt:
df1 = df1[~df1.iloc[:,0].isin(['LOW'])+2].reset_index(drop=True)
however this is not quite what I am looking for. Any help is appreciated

You can use:
#get index values where low
a = df.index[df.iloc[:,0] == 'low']
size = 2
#all index values (without first [1:])
#min is for last rows of df for avoid select non existed values
arr = [np.arange(i, min(i+size+1,len(df)+1)) for i in a[1:]]
idx = np.unique(np.concatenate(arr))
print (idx)
[ 9 10 11 12 13 14]
#remove rows
df = df.drop(idx)
print (df)
WL WM WH WP
1 low medium high premium
2 26 26 15 14
3 32 32 18 29
4 41 41 19 42
5 apple dog fur napkins
6 orange cat tesla earphone
7 NaN rat tobias controller
8 NaN NaN phone NaN

Related

how to Replace column values with several conditions

I have a column as follows:
A B
0 0 20.00
1 1 35.00
2 2 75.00
3 3 29.00
4 4 125.00
5 5 16.00
6 6 52.50
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.20
17 17 27.44
18 18 57.01
19 19 29.88
I want to change the values of the column as follows
if 0<B<10.0, then Replace the cell value of B by "0 to 10"
if 10.1<B<20.0, then Replace the cell value of B by "10 to 20"
continue like this until the maximum range achieved.
I have tried
ds['B'] = np.where(ds['B'].between(10.0,20.0), "10 to 20", ds['B'])
But once I perform this operation, the DataFrame is occupied by the string "10 to 20" so I cannot perform this operation again for the remaining values of the DataFrame. After this step, the DataFrame looks like this:
A B
0 0 10 to 20
1 1 35.0
2 2 75.0
3 3 29.0
4 4 125.0
5 5 10 to 20
6 6 52.5
7 7 nan
8 8 nan
9 9 nan
10 10 nan
11 11 nan
12 12 nan
13 13 239.91
14 14 22.87
15 15 52.74
16 16 37.2
17 17 27.44
18 18 57.01
19 19 29.88
And the following line: ds['B'] = np.where(ds['B'].between(20.0,30.0), "20 to 30", ds['B']) will throw TypeError: '>=' not supported between instances of 'str' and 'float'
How can i code this to change all of the values in the DataFrame to these strings of ranges all at once?

Build your bins and labels and use pd.cut:
bins = np.arange(0, df["B"].max() // 10 * 10 + 10, 10).astype(int)
labels = [' to '.join(t) for t in zip(bins[:-1].astype(str), bins[1:].astype(str))]
df["B"] = pd.cut(df["B"], bins=bins, labels=labels)
>>> df
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 NaN
8 8 NaN
9 9 NaN
10 10 NaN
11 11 NaN
12 12 NaN
13 13 NaN
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30

This can be done with much less code as this is actually just a matter of string formatting.
ds['B'] = ds['B'].apply(lambda x: f'{int(x/10) if x>=10 else ""}0 to {int(x/10)+1}0' if pd.notnull(x) else x)

You can create a custom function that maps each range to a string. For example, 19.0 will be mapped to "10 to 20", and then apply this function to each row.
I've written the code so that the minimum and maximum of the range is generalizable to the DataFrame, and takes on values that are multiples of 10.
import numpy as np
import pandas as pd
## copy and paste your DataFrame
ds = pd.read_clipboard()
# floor to nearest multiple of 10
ds_min = ds['B'].min()//10*10
# ceiling to the nearest multiple of 10
ds_max = round(ds['B'].max(),-1)
ranges = np.linspace(ds_min, ds_max, ((ds_max-ds_min)/10)+1)
def map_value_to_string(value):
for idx in range(1,len(ranges)):
low_value, high_value = ranges[idx-1], ranges[idx]
if low_value < value <= high_value:
return f"{int(low_value)} to {int(high_value)}"
else:
continue
ds['B'] = ds['B'].apply(lambda x: map_value_to_string(x))
Output:
>>> ds
A B
0 0 10 to 20
1 1 30 to 40
2 2 70 to 80
3 3 20 to 30
4 4 120 to 130
5 5 10 to 20
6 6 50 to 60
7 7 None
8 8 None
9 9 None
10 10 None
11 11 None
12 12 None
13 13 230 to 240
14 14 20 to 30
15 15 50 to 60
16 16 30 to 40
17 17 20 to 30
18 18 50 to 60
19 19 20 to 30

Trying to divide two columns with different row counts

The main goal I am trying to accomplish is dividing these columns by each other, in python.
Okay so I have this excel file that I import called 'data', data has two columns that I need but they are in different tabs which is not a problem. I get both columns I need and name them the first is called 'city_area' it has 27 numbers in it ( it is printed out 0 - 26) the second column I retrieve is called 'population'. population has a gap at the top of the column before the numbers are listed. I use [6:] to get the data I actually need, population is then printed out and it has 27 numbers as well in it ( but these are listed 6 - 32). When I try to divide these two only the numbers that match up 6 - 26 are divided the rest result is NaN. Both are floats as well, any suggestions?
population1 = (df2['Unnamed: 5'])
population = population1[6:].astype(float)
population
6 664000.0
7 3557000.0
8 619000.0
9 3351000.0
10 13974000.0
11 8238000.0
12 2393000.0
13 3474000.0
14 5750000.0
15 6199000.0
16 2866000.0
17 2304000.0
18 19522000.0
19 7136000.0
20 3595886.0
21 10261856.0
22 8518000.0
23 3041000.0
24 15593000.0
25 3051000.0
26 10984000.0
27 1567000.0
28 405000.0
29 5974000.0
30 41164000.0
31 2007000.0
32 1337000.0
city_area = (df3['SQ_KM'])
city_area
0 8835.000
1 511.000
2 6407.000
3 11400.000
4 693.800
5 313.800
6 5802.000
7 93.380
8 739.000
9 827.141
10 3292.000
11 8096.000
12 330.900
13 1059.400
14 211.500
15 432.170
16 218.000
17 1390.000
18 1260.000
19 169.300
20 496.800
21 34091.000
22 5687.000
23 675.400
24 1520.000
25 181.857
26 2220.000
Name: SQ_KM, dtype: float64
answer = population/city_area
answer
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 114.443295
7 38091.668451
8 837.618403
9 4051.304433
10 4244.835966
11 1017.539526
12 7231.792082
13 3279.214650
14 27186.761229
15 14343.892450
16 13146.788991
17 1657.553957
18 15493.650794
19 42150.029533
20 7238.095813
21 301.013640
22 1497.802005
23 4502.517027
24 10258.552632
25 16776.918128
26 4947.747748
27 NaN
28 NaN
29 NaN
30 NaN
31 NaN
32 NaN
dtype: float64

Pandas - insert rows where data is missing

I have a dataset, here is an example:
df = DataFrame({"Seconds_left":[5,10,15,25,30,35,5,10,15,30], "Team":["ATL","ATL","ATL","ATL","ATL","ATL","SAS","SAS","SAS","SAS"], "Fouls": [1,2,3,3,4,5,5,4,1,1]})
Fouls Seconds_left Team
0 1 5 ATL
1 2 10 ATL
2 3 15 ATL
3 3 25 ATL
4 4 30 ATL
5 5 35 ATL
6 5 5 SAS
7 4 10 SAS
8 1 15 SAS
9 1 30 SAS
Now I would like to insert rows where data in the Seconds_left column is missing:
Id Fouls Seconds_left Team
0 1 5 ATL
1 2 10 ATL
2 3 15 ATL
3 NaN 20 ATL
4 3 25 ATL
5 4 30 ATL
6 5 35 ATL
7 5 5 SAS
8 4 10 SAS
9 1 15 SAS
10 NaN 20 SAS
11 NaN 25 SAS
12 1 30 SAS
13 NaN 35 SAS
I tried already with reindexing etc. but obviously it does not function as there are duplicates.
Has somebody got any idea how to solve this?
Thanks!

Create a MultiIndex and reindex + reset_index:
idx = pd.MultiIndex.from_product([df['Team'].unique(),
np.arange(5, df['Seconds_left'].max()+1, 5)],
names=['Team', 'Seconds_left'])
df.set_index(['Team', 'Seconds_left']).reindex(idx).reset_index()
Out:
Team Seconds_left Fouls
0 ATL 5 1.0
1 ATL 10 2.0
2 ATL 15 3.0
3 ATL 20 NaN
4 ATL 25 3.0
5 ATL 30 4.0
6 ATL 35 5.0
7 SAS 5 5.0
8 SAS 10 4.0
9 SAS 15 1.0
10 SAS 20 NaN
11 SAS 25 NaN
12 SAS 30 1.0
13 SAS 35 NaN

An approach using groupby and merge:
df_left = pd.DataFrame({'Seconds_left':[5,10,15,20,25,30,35]})
df_out = df.groupby('Team', as_index=False).apply(lambda x: x.merge(df_left, how='right', on='Seconds_left'))
df_out['Team'] = df_out['Team'].fillna(method='ffill')
df_out = df_out.reset_index(drop=True).sort_values(by=['Team','Seconds_left'])
print(df_out)
Output:
Fouls Seconds_left Team
0 1.0 5 ATL
1 2.0 10 ATL
2 3.0 15 ATL
6 NaN 20 ATL
3 3.0 25 ATL
4 4.0 30 ATL
5 5.0 35 ATL
7 5.0 5 SAS
8 4.0 10 SAS
9 1.0 15 SAS
11 NaN 20 SAS
12 NaN 25 SAS
10 1.0 30 SAS
13 NaN 35 SAS

import pandas as pd
import numpy as np
df = pd.DataFrame(columns = ['a', 'b'])
df.loc[len(df)] = [1,np.NaN]

pandas drop row below each row containing an 'na'

i have a dataframe with, say, 4 columns [['a','b','c','d']], to which I add another column ['total'] containing the sum of all the other columns for each row. I then add another column ['growth of total'] with the growth rate of the total.
some of the values in [['a','b','c','d']] are blank, rendering the ['total'] column invalid for these rows. I can easily get rid of these rows with df.dropna(how='any').
However, my growth rate will be invalid not only for rows with missing values in [['a','b','c','d']], but also for the following row. How do I drop all these rows?

IIUC correctly you can use notnull with all to mask off any rows with NaN and any rows that follow NaN rows:
In [43]:
df = pd.DataFrame({'a':[0,np.NaN, 2, 3,np.NaN], 'b':[np.NaN, 1,2,3,4], 'c':[0, np.NaN,2,3,4]})
df
Out[43]:
a b c
0 0 NaN 0
1 NaN 1 NaN
2 2 2 2
3 3 3 3
4 NaN 4 4
In [44]:
df[df.notnull().all(axis=1) & df.shift().notnull().all(axis=1)]
Out[44]:
a b c
3 3 3 3

Here's one option that I think does what you're looking for:
In [76]: df = pd.DataFrame(np.arange(40).reshape(10,4))
In [77]: df.ix[1,2] = np.nan
In [78]: df.ix[6,1] = np.nan
In [79]: df['total'] = df.sum(axis=1, skipna=False)
In [80]: df
Out[80]:
0 1 2 3 total
0 0 1 2 3 6
1 4 5 NaN 7 NaN
2 8 9 10 11 38
3 12 13 14 15 54
4 16 17 18 19 70
5 20 21 22 23 86
6 24 NaN 26 27 NaN
7 28 29 30 31 118
8 32 33 34 35 134
9 36 37 38 39 150
In [81]: df['growth'] = df['total'].iloc[1:] - df['total'].values[:-1]
In [82]: df
Out[82]:
0 1 2 3 total growth
0 0 1 2 3 6 NaN
1 4 5 NaN 7 NaN NaN
2 8 9 10 11 38 NaN
3 12 13 14 15 54 16
4 16 17 18 19 70 16
5 20 21 22 23 86 16
6 24 NaN 26 27 NaN NaN
7 28 29 30 31 118 NaN
8 32 33 34 35 134 16
9 36 37 38 39 150 16

Calculating rolling_std on 4 columns in python pandas to calculate a Bollinger Band?

I'm just getting into Pandas, trying to do what I would do in excel easily just with a large data set. I have a selection of futures price data that I have input into Pandas using:
df = pd.read_csv('TData1.csv')
this gives me a DataFrame. The data is in the form below:
Date,Time,Open,High,Low,Close,Volume,Tick Count
02/01/2013,05:01:00,1443.00,1443.75,1438.25,1440.25,20926,4652
02/01/2013,05:02:00,1440.25,1441.75,1440.00,1441.25,7261,1781
02/01/2013,05:03:00,1441.25,1443.25,1441.00,1443.25,5010,1014
Now what I'm essentially trying to do is calculate a Bollinger band in pandas. If I was in excel I would select the whole block of 'High', 'Low', 'Open' and 'Close' columns for say 20 rows and calculate the standard deviation.
I see pandas has the rolling_std function which can calculate the rolling standard deviation but just on one column. How do I get Python Pandas to calculate a rolling standard deviation on the 'High', 'Low', 'Open' and 'Close' column for say 20 periods?
Thanks.

You can call rolling_std on whole DataFrame or on subset:
>>> pd.rolling_std(df[['high','open','close','low']], 5)
like this:
>>> df = pd.DataFrame({'high':np.random.randint(15,25,size=10), 'close':np.random.randint(15,25,size=10), 'low':np.random.randint(15,25,size=10), 'open':np.random.randint(15,25,size=10), 'a':list('abcdefghij')})
>>> df
a close high low open
0 a 16 20 18 15
1 b 21 23 22 15
2 c 20 23 21 23
3 d 19 24 24 17
4 e 23 19 20 17
5 f 15 16 19 17
6 g 19 24 23 19
7 h 21 18 17 22
8 i 22 22 17 15
9 j 19 20 17 18
>>> pd.rolling_std(df[['high','open','close','low']], 5)
high open close low
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 2.167948 3.286335 2.588436 2.236068
5 3.391165 3.033150 2.966479 1.923538
6 3.563706 2.607681 2.863564 2.073644
7 3.633180 2.190890 2.966479 2.880972
8 3.193744 2.645751 3.162278 2.489980
9 3.162278 2.588436 2.683282 2.607681

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas removing duplicate range of data - python

Related

how to Replace column values with several conditions

Trying to divide two columns with different row counts

Pandas - insert rows where data is missing

pandas drop row below each row containing an 'na'

Calculating rolling_std on 4 columns in python pandas to calculate a Bollinger Band?

Categories

Resources