I have a dataframe which have several columns. I want to extract rows by combination of values from two specific columns, so I use the set_index() property to index the dataframe by those columns. I figured that after doing so, I will have a direct access (O(1)) to rows for a given combination of keys. Currently, It does not seem like the case, It takes quite some time for a df.ix[ix1,ix2] operation to take place.
Example:
Say I have the following dataframe:
In [228]: df
Out[228]:
ID1 ID2 score
752476 5626887150_0 5626887150_6 96
752477 5626887150_0 5626887150_7 95
752478 5626887150_0 5626887150_2 95
752479 5626887150_0 5626887150_8 93
752480 5626887150_0 5626887150_1 89
752481 5626887150_0 2142280814_5 88
752482 5626887150_0 5626887150_3 84
752483 5626887150_0 6625625104_5 82
752484 5626887150_0 2142280814_4 81
And say I want to look at the score column in different ID1,ID2 combinations. To easily do this, i'm setting ID1 and ID2 as indexes and obtain the following result:
In [230]: df = df.set_index(['ID1','ID2'])
Out[230]:
score
ID1 ID2
5626887150_0 5626887150_6 96
5626887150_7 95
5626887150_2 95
5626887150_8 93
5626887150_1 89
2142280814_5 88
5626887150_3 84
6625625104_5 82
2142280814_4 81
Now I can easy access my data with ID1,ID2 combinations (e.g. df.ix['5626887150_0','5626887150_6']), that's true. BUT, it does not seem to be an O(1) access. It seems to take quite some time to return a value on a large dataframe.
So what exactly is the set_index() method doing? and is there a way to force an O(1) acess to the data?
Related
I'm trying to filter rows based on a relative simple criteria. If the value for Open is less than the max value for the column until that row, it gets dropped, otherwise the row stays and is the reference value for the new max.
This is the starting example dataframe:
import pandas as pd
import numpy as np
d = {'Date':['22-01-2019','23-01-2019','24-01-2019','25-01-2019','26-01-2019'],'Open': [40,54,54,79,67], 'Close': [43,53,65,65,61]}
df = pd.DataFrame(data=d)
print(df)
In this case I would like to do the filtering on the column Open:
Date Open Close
0 22-01-2019 40 43 #Max is 40
1 23-01-2019 54 53 #54 is higher than 40 so it stays
2 24-01-2019 54 65 #This is not higher than the previous max, should get dropped
3 25-01-2019 79 80 #This is higher than 54, so it stays
4 26-01-2019 67 61 #This is not higher than 79, should get dropped
The only way I could come up to solve the problem with a for loop iterating over each row in particular, defining an auxiliary variable that records does comparison, and returns a boolean series. However it's extremely inefficient when dealing with more than 100k rows. The final goal is to perform the same filter on the Close column and join them to know in which days (the original data is every 15 minutes) both Open and Close values have risen above the highest value ever (previously) recorded.
Finally the output should look like this:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 80
If doing the same operation for the Close column it should look like:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
2 24-01-2019 54 65
3 25-01-2019 79 80
The final goal (which I would know how to do once the I can get through the filtering part, but just sharing for the sake of the full case) is:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 80
My solution is:
max_v = 0
list_for_filtering = []
for i, value in df.iterrows():
if value['Open'] > max_v:
max_v = value['Open']
list_for_filtering.append(True)
else:
pass
list_for_filtering.append(False)
df['T/F'] = list_for_filtering
And filter keeping only the True values
One simple solution is to compare "Open" with the shifted cummax:
# thanks to Andy L. for the simplification!
df[df['Open'] > df['Open'].cummax().shift(fill_value=-np.inf)]
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 65
Where,
df['Open'].cummax().shift()
0 NaN
1 40.0
2 54.0
3 54.0
4 79.0
Name: Open, dtype: float64
I started learning pandas and stuck at below issue:
I have two large DataFrames:
df1=
ID KRAS ATM
TCGA-3C-AAAU-01A-11R-A41B-07 101 32
TCGA-3C-AALI-01A-11R-A41B-07 101 75
TCGA-3C-AALJ-01A-31R-A41B-07 102 65
TCGA-3C-ARLJ-01A-61R-A41B-07 87 54
df2=
ID BRCA1 ATM
TCGA-A1-A0SP 54 65
TCGA-3C-AALI 191 8
TCGA-3C-AALJ 37 68
The ID is the index in both df. First, I want to cut the name of the ID to only the first 10 digits ( convert TCGA-3C-AAAU-01A-11R-A41B-07 to TCGA-3C-AAAU) in df1. Then I want to produce a new df from df1 which has the ID that exist in df2.
df3 should look:
ID KRAS ATM
TCGA-3C-AALI 101 75
TCGA-3C-AALJ 102 65
I tried different ways but failed. Any suggestions on this, please?
Here is one way using vectorised functions:
# truncate to first 10 characters, or 12 including '-'
df1['ID'] = df1['ID'].str[:12]
# filter for IDs in df2
df3 = df1[df1['ID'].isin(df2['ID'])]
Result
ID KRAS ATM
1 TCGA-3C-AALI 101 75
2 TCGA-3C-AALJ 102 65
Explanation
Use .str accessor to limit df1['ID'] to first 12 characters.
Mask df1 to include only IDs found in df2.
IIUC TCGA-3C-AAAU this contain 12 character :-)
df3=df1.assign(ID=df1.ID.str[:12]).loc[lambda x:x.ID.isin(df2.ID),:]
df3
Out[218]:
ID KRAS ATM
1 TCGA-3C-AALI 101 75
2 TCGA-3C-AALJ 102 65
I am working with Pandas data frame for one of my projects.
I have a column name Count having integer values in that column.
I have 720 values for each hour i.e 24 * 30 days.
I want to run a loop which can get initially first 24 values from the data frame and put in a new column and then take the next 24 and put in the new column and then so on.
for example:
input:
34
45
76
87
98
34
output:
34 87
45 98
76 34
here it is a row of 6 and I am taking first 3 values and putting them in the first column and the next 3 in the 2nd one.
Can someone please help with writing a code/program for the same. It would be of great help.
Thanks!
You can also try numpy's reshape method performed on pd.Series.values.
s = pd.Series(np.arange(720))
df = pd.DataFrame(s.values.reshape((30,24)).T)
Or split (specify how many arrays you want to split into),
df = pd.DataFrame({"day" + str(i): v for i, v in enumerate(np.split(s.values, 30))})
I read the data into a DataFrame and called it data. I have the following query in python:
data[data["gender"]=="male"].groupby('age').city.nunique().sort_values(ascending=False)
age
29 86
24 85
21 81
25 81
20 81
28 78
27 78
now I want to find those groups whose size is more than 80. how can I do that in python?
The result of your aggregation and sorting call is a pandas series whose index are the groups you are looking for. So to find the groups with greater than a certain cutOffvalue
cutOffValue = 80
counts = data[data["gender"]=="male"].groupby('age').city.nunique().sort_values(ascending=False)
groups = counts[counts > cutOffValue].index
And of course, if you want it as a list or set, you could easily cast the final value
groups = list(groups)
I have a mysql database table consisting of 8 columns as given
ID C1 C2 C3 C4 C5 C6 C7
1 25 33 76 87 56 76 47
2 67 94 90 56 77 32 84
3 53 66 24 93 33 88 99
4 73 34 52 85 67 82 77
5 78 55 52 100 78 68 32
6 67 35 60 93 88 53 66
I need to fetch 3 rows of all the column except the ID column at a time. So far I did this code in python which fetches me the rows with ID values 1,2,3.
ex = MySQLdb.connect(host,port,user,passwd,db)
with ex:
ex_cur = ex.cursor()
ex.execute("SELECT C1,C2,C3,C4,C5,C6,C7 FROM table LIMIT 0, 3;")
In the second cycle I need to fetch rows with ID values 2,3,4, third cycle fetches rows with ID values 3,4,5 which should continue till the end of the database. What query should I use to iterate through the table so as to get the desired set of rows.
I believe there are three ways of doing this: (I'm going to explain at a very high level)
You can create a queue with a size limit of 3 and read in the rows as a stream. Once the queue reaches the max size of 3, do your processing, pop off the first element in your queue, and proceed with the stream. (More efficient)
You would need an iterator and reset your cursor for every set of 3 IDs that you have to do.
Since your table is relatively small (would not suggest this for larger tables), load the whole database into a data structure/into memory. Perhaps make an object for the rows and use an ORM to map rows to objects. Then you would simply have to iterate through each object, or set of 3 objects, and do the necessary processing.