Select specific rows from a groupby DataFrame - python

I have some data in the following format:
56.00 101.85 52.40 101.85 56.000000 101.850000 1
56.00 100.74 50.60 100.74 56.000000 100.740000 2
56.00 100.74 52.10 100.74 56.000000 100.740000 3
56.00 102.96 52.40 102.96 56.000000 102.960000 4
56.00 100.74 55.40 100.74 56.000000 100.740000 5
56.00 103.70 54.80 103.70 56.000000 103.700000 6
56.00 101.85 53.00 101.85 56.000000 101.850000 7
56.00 102.22 52.10 102.22 56.000000 102.220000 8
56.00 101.11 55.40 101.11 56.000000 101.110000 9
56.00 101.11 54.80 101.11 56.000000 101.110000 10
56.00 101.85 52.40 101.85 56.000000 101.850000 1
56.00 100.74 50.60 100.74 56.000000 100.740000 2
........
What I need are the data for a specific id (last column).
With numpy I used to do:
d=loatxt('filename')
wanted = d[ d[:,6]==id ]
Now I' m learning Pandas and found out, that pandas.read_csv() is really faster that loadtxt().
So logically I was wondering if there is a possibility to do he same filtering with pandas (maybe it is even faster).
My first thought was trying groupby as follows:
p=pd.read_csv('filename', sep= ' ', header=None, names=['a', 'b', 'x', 'y', 'c', 'd', 'id'])
d = p.groupby(['id'])
#[ i, g in p.groupby(['id']) if i ==1] # syntax error, why?
The question is: Is there a relatively easy way to do the selection from p of the rows of let's say id==1?
EDIT
Trying the proposed solution:
%timeit t_1 = n[ n[:,6]==1 ][:,2:4]
10 loops, best of 3: 60.8 ms per loop
%timeit t_2 = p[ p['id'] == 1 ][['x', 'y']]
10 loops, best of 3: 70.9 ms per loop
It seems that numpy is here a bit faster that Pandas
That means the fastest way to work in this case is:
1) First read the data with Pandas read_csv
2) Convert the data to numpy.array
3) and than the work.
Is this conclusion correct?

You can do just the same as you did with numpy, just now refering to the column by its name:
wanted = d[d['id'] == id]

Related

can someone turn columns from a different dataset as values in another dataset by matching values from a column in the first dataset with the second

sorry if I ain't clear, but got a challenge,
[this is the sample data I have generated to try to make my challenge clear] 1
sample data 1
B
V
S
F
K
0.32
10.32
11.32
12.32
13.32
1.32
11.32
12.32
13.32
14.32
2.32
12.32
13.32
14.32
15.32
3.32
13.32
14.32
15.32
16.32
4.32
14.32
15.32
16.32
17.32
5.32
15.32
16.32
17.32
18.32
6.32
16.32
17.32
18.32
19.32
7.32
17.32
18.32
19.32
20.32
8.32
18.32
19.32
20.32
21.32
9.32
19.32
20.32
21.32
22.32
10.32
20.32
21.32
22.32
23.32
my expected output
K
L
M
1
2.32
2
3.32
3
4.32
4
5.32
5
6.32
6
13.32
7
14.32
8
15.32
9
16.32
10
17.32
the second image explains the outcome
I would like to know how I would create another column M in dataset 2 that will return the name of the column from dataset 1 contains the values in column L (which is in dataset 2)
I have tried the code below but it wasn't adding up to it since I had got this error and I thought someone here will help with this, thanks in advance!
spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]```
returned the following error
```~\AppData\Local\Temp/ipykernel_25368/552331776.py in <module>
----> 1 spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]
~\AppData\Local\Temp/ipykernel_25368/552331776.py in <listcomp>(.0)
----> 1 spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]
TypeError: 'in <string>' requires string as left operand, not Series```
I created two dataframes here.For loop searches for matches and records column names for each row. You can also take dataframes from my message: df, df1 and place them in your message.
import pandas as pd
import numpy as np
df = pd.DataFrame({'B':[0.32,1.32,2.32,3.32,4.32,5.32,6.32,7.32,8.32,9.32,10.32],
'V':[10.32,11.32,12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32],
'S':[11.32,12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32],
'F':[12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32,22.32],
'K':[13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32,22.32,23.32]})
print(df)
df1 = pd.DataFrame({'K':[1,2,3,4,5,6,7,8,9,10],
'L':[2.32,3.32,4.32,5.32,6.32,13.32,14.32,15.32,16.32,17.32]})
M = []
for k in range(0, len(df1)):
i, c = np.where(df == df1['L'][k])#Get the indexes of the columns where there was a match
ttt = df.columns[c]#Get the name of the columns
M.append(','.join(list(ttt)))#Output values to a comma-separated string if there are more than one values
df1['M'] = M #Adding a column with the received values
print(df1)

Slicing pandas dataframe by ordered values into clusters

I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together
Time Value
0 56610.41341 8.55
1 56587.56394 5.27
2 56590.62965 6.81
3 56598.63790 5.47
4 56606.52203 6.71
5 56980.44206 4.75
6 56592.53327 6.53
7 57335.52837 0.74
8 56942.59094 6.96
9 56921.63669 9.16
10 56599.52053 6.14
11 56605.50235 5.20
12 57343.63828 3.12
13 57337.51641 3.17
14 56593.60374 5.69
15 56882.61571 9.50
I tried sorting this and taking time difference of two consecutive points with
df = df.sort_values("Time")
df['t_dif'] = df['Time'] - df['Time'].shift(-1)
And it gives
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
0 56610.41341 8.55 -272.20230
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
5 56980.44206 4.75 -355.08631
7 57335.52837 0.74 -1.98804
13 57337.51641 3.17 -6.12187
12 57343.63828 3.12 NaN
Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this?
I could loop the rows but this is frowned upon so is there a smarter solution?
Edit: Here is a example:
df1:
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
df2:
0 56610.41341 8.55 -272.20230
df3:
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
...
etc.
I think you can just
df1 = df[df['t_dif']<30]
df2 = df[df['t_dif']>=30]
def split_dataframe(df, value):
df = df.sort_values("Time")
df = df.reset_index()
df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs()
indxs = df.index[df['t_dif'] > value].tolist()
indxs.append(-1)
indxs.append(len(df))
indxs.sort()
frames = []
for i in range(1, len(indxs)):
val = df.iloc[indxs[i] + 1: indxs[i]]
frames.append(val)
return frames
Returns the correct dataframes as a list

Remove substring in a column pandas

I have a dataframe where one column has strings that sometimes contain a word and parentheses around the value I want to keep. How do I do remove them? Here's what I have:
import pandas as pd
df = pd.read_csv("Espacios_#cronista.csv")
del df['Espacio']
df[df['Tamano'].str.contains("Variable")]
Output I have:
Tamano Subastas Imp Fill_rate
0 Variable (300x600) 43 13 5.99
1 Variable (266x600) 43 5 4.44
2 266x600 43 5 4.44
Output I need:
Tamano Subastas Imp Fill_rate
0 300x600 43 13 5.99
1 266x600 43 5 4.44
2 266x600 43 5 4.44
This is a good use case for pd.Series.str.extract
pipelined
Meaning, assign creates a copy. You can use fillna to fill in spots that became NaN.
pat = 'Variable\s*\((.*)\)'
df.assign(Tamano=df.Tamano.str.extract(pat, expand=False).fillna(df.Tamano))
Tamano Subastas Imp Fill_rate
0 300x600 43 13 5.99
1 266x600 43 5 4.44
2 266x600 43 5 4.44
in place
Meaning we alter df
pat = 'Variable\s*\((.*)\)'
df.update(df.Tamano.str.extract(pat, expand=False))
df
Tamano Subastas Imp Fill_rate
0 300x600 43 13 5.99
1 266x600 43 5 4.44
2 266x600 43 5 4.44
IIUC, this should work
cond = df.Tamano.str.contains("Variable")
df.loc[cond, "Tamano"] = df.Tamano.str.extract("((?<=\()[^)]*)", expand=False)
Tamano Subastas Imp Fill_rate
0 300x600 43 13 5.99
1 266x600 43 5 4.44
2 266x600 43 5 4.44
This will select the rows fit the condition: df.Tamano.str.contains("Variable") to do replacement. The regular expression (?<=\() means will look for ( and match what is behind. The matching criterion [^)]* is to match any that is not ), and thus will stop when meeting a ). piRSquared's regular expression is more simple and easy to understand.

Nested if loop with DataFrame is very,very slow

I have 10 million rows to go through and it will take many hours to process, I must be doing something wrong
I converted the names of my df variables for ease in typing
Close=df['Close']
eqId=df['eqId']
date=df['date']
IntDate=df['IntDate']
expiry=df['expiry']
delta=df['delta']
ivMid=df['ivMid']
conf=df['conf']
The below code works fine, just ungodly slow, any suggestions?
print(datetime.datetime.now().time())
for i in range(2,1000):
if delta[i]==90:
if delta[i-1]==50:
if delta[i-2]==10:
if expiry[i]==expiry[i-2]:
df.Skew[i]=ivMid[i]-ivMid[i-2]
print(datetime.datetime.now().time())
14:02:11.014396
14:02:13.834275
df.head(100)
Close eqId date IntDate expiry delta ivMid conf Skew
0 37.380005 7 2008-01-02 39447 1 50 0.3850 0.8663
1 37.380005 7 2008-01-02 39447 1 90 0.5053 0.7876
2 36.960007 7 2008-01-03 39448 1 50 0.3915 0.8597
3 36.960007 7 2008-01-03 39448 1 90 0.5119 0.7438
4 35.179993 7 2008-01-04 39449 1 50 0.4055 0.8454
5 35.179993 7 2008-01-04 39449 1 90 0.5183 0.7736
6 33.899994 7 2008-01-07 39452 1 50 0.4464 0.8400
7 33.899994 7 2008-01-07 39452 1 90 0.5230 0.7514
8 31.250000 7 2008-01-08 39453 1 10 0.4453 0.7086
9 31.250000 7 2008-01-08 39453 1 50 0.4826 0.8246
10 31.250000 7 2008-01-08 39453 1 90 0.5668 0.6474 0.1215
11 30.830002 7 2008-01-09 39454 1 10 0.4716 0.7186
12 30.830002 7 2008-01-09 39454 1 50 0.4963 0.8479
13 30.830002 7 2008-01-09 39454 1 90 0.5735 0.6704 0.1019
14 31.460007 7 2008-01-10 39455 1 10 0.4254 0.6737
15 31.460007 7 2008-01-10 39455 1 50 0.4929 0.8218
16 31.460007 7 2008-01-10 39455 1 90 0.5902 0.6411 0.1648
17 30.699997 7 2008-01-11 39456 1 10 0.4868 0.7183
18 30.699997 7 2008-01-11 39456 1 50 0.4965 0.8411
19 30.639999 7 2008-01-14 39459 1 10 0.5117 0.7620
20 30.639999 7 2008-01-14 39459 1 50 0.4989 0.8804
21 30.639999 7 2008-01-14 39459 1 90 0.5887 0.6845 0.077
22 29.309998 7 2008-01-15 39460 1 10 0.4956 0.7363
23 29.309998 7 2008-01-15 39460 1 50 0.5054 0.8643
24 30.080002 7 2008-01-16 39461 1 10 0.4983 0.6646
At this rate it will take 7.77 hrs to process
Basically, the whole point of numpy & pandas is to avoid loops like the plague, and do things in a vectorial way. As you noticed, without that, speed is gone.
Let's break your problem into steps.
The Conditions
Here, your your first condition can be written like this:
df.delta == 90
(Note how this compares the entire column at once. This is much much faster than your loop!).
and the second one can be written like this (using shift):
df.delta.shift(1) == 50
The rest of your conditions are similar.
Note that to combine conditions, you need to use parentheses. So, the first two conditions, together, should be written as:
(df.delta == 90) & (df.delta.shift(1) == 50)
You should be able to now write an expression combining all your conditions. Let's call it cond, i.e.,
cond = (df.delta == 90) & (df.delta.shift(1) == 50) & ...
The assignment
To assign things to a new column, use
df['skew'] = ...
We just need to figure out what to put on the right-hand-sign
The Right Hand Side
Since we have cond, we can write the right-hand-side as
np.where(cond, df.ivMid - df.ivMid.shift(2), 0)
What this says is: when condition is true, take the second term; when it's not, take the third term (in this case I used 0, but do whatever you like).
By combining all of this, you should be able to write a very efficient version of your code.

Appending data row from one dataframe to another with respect to date

I am brand new to pandas and working with two dataframes. My goal is to append the non-date values of df_ls (below) column-wise to their nearest respective date in df_1. Is the only way to do this with a traditional for-loop or is their some more effective built-in method/function. I have googled this extensively without any luck and have only found ways to append blocks of dataframes to other dataframes. I haven't found a way to search through a dataframe and append a row in another dataframe at the nearest respective date. See example below:
Example of first dataframe (lets call it df_ls):
DATE ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 1999-07-04 0.070771 1.606958 1.292280 0.128069 0.103018
1 1999-07-20 0.030795 2.326290 1.728147 0.099020 0.073595
2 1999-08-21 0.022819 2.492871 1.762536 0.096888 0.068502
3 1999-09-06 0.014613 2.792271 1.894225 0.090590 0.061445
4 1999-10-08 0.004978 2.781847 1.790768 0.089291 0.057521
5 1999-10-24 0.003144 2.818474 1.805257 0.090623 0.058054
6 1999-11-09 0.000859 3.146100 1.993941 0.092787 0.058823
7 1999-12-11 0.000912 2.913604 1.656642 0.097239 0.055357
8 1999-12-27 0.000877 2.974692 1.799949 0.098282 0.059427
9 2000-01-28 0.000758 3.092533 1.782112 0.095153 0.054809
10 2000-03-16 0.002933 2.969185 1.727465 0.083059 0.048322
11 2000-04-01 0.016814 2.366437 1.514110 0.089720 0.057398
12 2000-05-03 0.047370 1.847763 1.401930 0.109767 0.083290
13 2000-05-19 0.089432 1.402798 1.178798 0.137965 0.115936
14 2000-06-04 0.056340 1.807828 1.422489 0.118601 0.093328
Example of second dataframe (let's call it df_1)
Sample Date Value
0 2000-05-09 1.68
1 2000-05-09 1.68
2 2000-05-18 1.75
3 2000-05-18 1.75
4 2000-05-31 1.40
5 2000-05-31 1.40
6 2000-06-13 1.07
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
In the end, my goal is to have something like this (Note the appended values are values closest to the Sample Date, even though they dont match up perfectly):
Sample Date Value ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
1 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
2 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
3 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
4 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
5 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
6 2000-06-13 1.07 ETC.... ETC.... ETC ...
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
Thanks for any and all help. As I said I am new to this and I have experience with this sort of thing in MATLAB but PANDAS is a new to me.
Thanks

Categories

Resources