subsetting a Python DataFrame

subsetting a Python DataFrame - python

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:
k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))
Now, I want to do similar stuff in Python. this is what I have got so far:
import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")
#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
data.set_index('Product')
k = data.ix[[p.id, 'Time']]
# then, index this subset with Time and do more subsetting..
I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:
k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))
thanks.

I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:
For now, you'll have to reference the DataFrame instance:
k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.
In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:
With query() you'd do it like this:
df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')
Here's a simple example:
In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})
In [10]: df
Out[10]:
gender price
0 m 89
1 f 123
2 f 100
3 m 104
4 m 98
5 m 103
6 f 100
7 f 109
8 f 95
9 m 87
In [11]: df.query('gender == "m" and price < 100')
Out[11]:
gender price
0 m 89
4 m 98
9 m 87
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')

Just for someone looking for a solution more similar to R:
df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]
No need for data.loc or query, but I do think it is a bit long.

I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
df[df['Year'] < 2014]
To return all the rows where this is the case. You can add different conditions.
df[df['Year'] < 2014][df['Color' == 'Red']
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]

Regarding some points mentioned in previous answers, and to improve readability:
No need for data.loc or query, but I do think it is a bit long.
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.
I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.
q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time
df.loc[q_product & q_start & q_end, c('Time,Product')]
# c is just a convenience
c = lambda v: v.split(',')

Creating an Empty Dataframe with known Column Name:
Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)
Creating a dataframe from csv:
df = pd.DataFrame('...../file_name.csv')
Creating a dynamic filter to subset a dtaframe:
i = 12
df[df['ActivitiID'] <= i]
Creating a dynamic filter to subset required columns of dtaframe
df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Related

Pandas for Loop Optimization(Vectorization) when looking at previous row value

I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.
The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.
def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
print('Calculating Gains')
df = Data
df.fillna(0, inplace=True)
df['MODE'] = ""
df['Nominal'] = ""
df.iloc[0, df.columns.get_loc('MODE')] = 0
for i in range(1, (len(df.index))):
print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
df['MODE'].loc[i] = 1
elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
df['MODE'].loc[i] = 1
else:
df['MODE'].loc[i] = 0
df[''] = (df['C']/6)
for i in range(len(df.index)):
print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
df['Nominal'].loc[i] = rated/6
else:
df['Nominal'].loc[i] = 0
df["Upgrade"] = df[""] - df["Nominal"]
return df
A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))
df = pd.DataFrame()
df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)
MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150
finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)

Your second loop doesn't evaluate the prior row, so you should be able to use this instead
df['Nominal'] = 0
df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
For your first loop, the elif statements looks to evaluate this
((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it
df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
then you may be able to collapse the other conditions into one statement use |
Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.

For vectorizing it I suggest you first shift your column in another one :
df['MODE_1'] = df['MODE'].shift(1)
and then use :
(df['MODE_1'].loc[i] == 1)
After that you should be able to vectorize

Python pandas pyhaystack

I am using a module called pyhaystack to retrieve data (rest API) from a building automation system based on 'tags.' Python will return a dictionary of the data. Im trying to use pandas with an If Else statement further below that I am having trouble with. The pyhaystack is working just fine to get the data...
This connects me to the automation system: (works just fine)
from pyhaystack.client.niagara import NiagaraHaystackSession
import pandas as pd
session = NiagaraHaystackSession(uri='http://0.0.0.0', username='Z', password='z', pint=True)
This code finds my tags called 'znt', converts dictionary to Pandas, and filters for time: (works just fine for the two points)
znt = session.find_entity(filter_expr='znt').result
znt = session.his_read_frame(znt, rng= '2018-01-01,2018-02-12').result
znt = pd.DataFrame.from_dict(znt)
znt.index.names=['Date']
znt = znt.fillna(method = 'ffill').fillna(method = 'bfill').between_time('08:00','17:00')
What I am most interested in is the column name, where ultimately I want Python to return the column named based on conditions:
print(znt.columns)
print(znt.values)
Returns:
Index(['C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT', 'C.Drivers.NiagaraNetwork.points.A-Section.AV2.AV2ZN~2dT'], dtype='object')
[[ 65.9087 66.1592]
[ 65.9079 66.1592]
[ 65.9079 66.1742]
...,
[ 69.6563 70.0198]
[ 69.6563 70.2873]
[ 69.5673 70.2873]]
I am most interested in this name of the Pandas dataframe. C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT
For my two arrays, I am subtracting the value of 70 for the data in the data frames. (works just fine)
znt_sp = 70
deviation = znt - znt_sp
deviation = deviation.abs()
deviation
And this is where I am getting tripped up in Pandas. I want Python to print the name of the column if the deviation is greater than four else print this zone is Normal. Any tips would be greatly appreciated..
if (deviation > 4).any():
print('Zone %f does not make setpoint' % deviation)
else:
print('Zone %f is Normal' % deviation)
The columns names in Pandas are the:
C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-Section.AV1.AV1ZN~2dT

I think DataFrame would be a good way to handle what you want.
Starting with znt you can make all the calculation there :
deviation = znt - 70
deviation = deviation.abs()
# and the cool part is filtering in the df
problem_zones =
deviation[deviation['C.Drivers.NiagaraNetwork.Adams_Friendship.points.A-
Section.AV1.AV1ZN~2dT']>4]
You can play with this and figure out a way to iterate through columns, like :
for each in df.columns:
# if in this column, more than 10 occurences of deviation GT 4...
if len(df[df[each]>4]) > 10:
print('This zone have a lot of troubles : ', each)
edit
I like adding columns to a DataFrame instead of just building an external Series.
df[‘error_for_a’] = df[a] - 70
This open possibilities and keep everything together. One could use
df[df[‘error_for_a’]>4]
Again, all() or any() can be useful but in a real life scenario, we would probably need to trig the “fault detection” when a certain number of errors are present.
If the schedule has been set ‘occupied’ at 8hAM.... maybe the first entries won’t be correct.... (any would trig an error even if the situation gets better 30minutes later). Another scenario would be a conference room where error is tiny....but as soon as there are people in it...things go bad (all() would not see that).

Solution:
You can iterate over columns
for col in df.columns:
if (df[col] > 4).any(): # or .all if needed
print('Zone %s does not make setpoint' % col)
else:
print('Zone %s is Normal' % col)
Or by defining a function and using apply
def _print(x):
if (x > 4).any():
print('Zone %s does not make setpoint' % x.name)
else:
print('Zone %s is Normal' % x.name)
df.apply(lambda x: _print(x))
# you can even do
[_print(df[col]) for col in df.columns]
Advice:
maybe you would keep the result in another structure, change the function to return a boolean series that "is normal":
def is_normal(x):
return not (x > 4).any()
s = df.apply(lambda x: is_normal(x))
# or directly
s = df.apply(lambda x: not (x > 4).any())
it will return a series s where index is column names of your df and values a boolean corresponding to your condition.
You can then use it to get all the Normal columns names s[s].index or the non-normal s[~s].index
Ex : I want only the normal columns of my df: df[s[s].index]
A complete example
For the example I will use a sample df with a different condition from yours (I check if no element is lower than 4 - Normal else Does not make the setpoint )
df = pd.DataFrame(dict(a=[1,2,3],b=[2,3,4],c=[3,4,5])) # A sample
print(df)
a b c
0 1 2 3
1 2 3 4
2 3 4 5
Your use case: Print if normal or not - Solution
for col in df.columns:
if (df[col] < 4).any():
print('Zone %s does not make setpoint' % col)
else:
print('Zone %s is Normal' % col)
Result
Zone a is Normal
Zone b is does not make setpoint
Zone c is does not make setpoint
To illustrate my Advice : Keep the is_normal columns in a series
s = df.apply(lambda x: not (x < 4).any()) # Build the series
print(s)
a True
b False
c False
dtype: bool
print(df[s[~s].index]) #Falsecolumns df
b c
0 2 3
1 3 4
2 4 5
print(df[s[s].index]) #Truecolumns df
a
0 1
1 2
2 3

Pandas: filter on multiple columns [duplicate]

This question already has answers here:
selecting across multiple columns with pandas
(3 answers)
Closed 5 years ago.
I am working in Pandas, and I want to apply multiple filters to a data frame across multiple fields.
I am working with another, more complex data frame, but I am simplifying the contex for this question. Here is the setup for a sample data frame:
dates = pd.date_range('20170101', periods=16)
rand_df = pd.DataFrame(np.random.randn(16,4), index=dates, columns=list('ABCD'))
Applying one filter to this data frame is well documented and simple:
rand_df.loc[lambda df: df['A'] < 0]
Since the lambda looks like a simple boolean expression. It is tempting to do the following. This does not work, since, instead of being a boolean expression, it is a callable. Multiple of these cannot combine as boolean expressions would:
rand_df.loc[lambda df: df['A'] < 0 and df[‘B’] < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-dfa05ab293f9> in <module>()
----> 1 rand_df.loc[lambda df: df['A'] < 0 and df['B'] < 0]
I have found two ways to successfully implement this. I will add them to the potential answers, so you can comment directly on them as solutions. However, I would like to solicit other approaches, since I am not really sure that either of these is a very standard approach for filtering a Pandas data frame.

In [3]: rand_df.query("A < 0 and B < 0")
Out[3]:
A B C D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199 1.202082 0.136657
2017-01-08 -1.445050 -0.367112 -2.617743 0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507
or:
In [6]: rand_df[rand_df[['A','B']].lt(0).all(1)]
Out[6]:
A B C D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199 1.202082 0.136657
2017-01-08 -1.445050 -0.367112 -2.617743 0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507
PS You will find a lot of examples in the Pandas docs

rand_df[(rand_df.A < 0) & (rand_df.B <0)]

To use the lambda, don't pass the entire column.
rand_df.loc[lambda x: (x.A < 0) & (x.B < 0)]
# Or
# rand_df[lambda x: (x.A < 0) & (x.B < 0)]
A B C D
2017-01-12 -0.460918 -1.001184 -0.796981 0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120
You can speed up the evaluation by using boolean numpy arrays
c1 = rand_df.A.values > 0
c2 = rand_df.B.values > 0
rand_df[c1 & c2]
A B C D
2017-01-12 -0.460918 -1.001184 -0.796981 0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120

Here is an approach that “chains” use of the ‘loc’ operation:
rand_df.loc[lambda df: df['A'] < 0].loc[lambda df: df['B'] < 0]

Here is an approach which includes writing a method to do the filtering. I am sure that some filters will be sufficiently complex or complicated that the method is the best way to go (this case is not so complex.) Also, when I am using Pandas and I write a “for” loop, I feel like I am doing it wrong.
def lt_zero_ab(df):
result = []
for index, row in df.iterrows():
if row['A'] <0 and row['B'] <0:
result.append(index)
return result
rand_df.loc[lt_zero_ab]

Merging DataFrames on multiple conditions - not specifically on equal values

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?

I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.

You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

Pandas filter rows based on multiple conditions

I have some values in the risk column that are neither, Small, Medium or High. I want to delete the rows with the value not being Small, Medium and High. I tried the following:
df = df[(df.risk == "Small") | (df.risk == "Medium") | (df.risk == "High")]
But this returns an empty DataFrame. How can I filter them correctly?

I think you want:
df = df[(df.risk.isin(["Small","Medium","High"]))]
Example:
In [5]:
import pandas as pd
df = pd.DataFrame({'risk':['Small','High','Medium','Negligible', 'Very High']})
df
Out[5]:
risk
0 Small
1 High
2 Medium
3 Negligible
4 Very High
[5 rows x 1 columns]
In [6]:
df[df.risk.isin(['Small','Medium','High'])]
Out[6]:
risk
0 Small
1 High
2 Medium
[3 rows x 1 columns]

Another nice and readable approach is the following:
small_risk = df["risk"] == "Small"
medium_risk = df["risk"] == "Medium"
high_risk = df["risk"] == "High"
Then you can use it like this:
df[small_risk | medium_risk | high_risk]
or
df[small_risk & medium_risk]

You could also use query:
df.query('risk in ["Small","Medium","High"]')
You can refer to variables in the environment by prefixing them with #. For example:
lst = ["Small","Medium","High"]
df.query("risk in #lst")
If the column name is multiple words, e.g. "risk factor", you can refer to it by surrounding it with backticks ` `:
df.query('`risk factor` in #lst')
query method comes in handy if you need to chain multiple conditions. For example, the outcome of the following filter:
df[df['risk factor'].isin(lst) & (df['value']**2 > 2) & (df['value']**2 < 5)]
can be derived using the following expression:
df.query('`risk factor` in #lst and 2 < value**2 < 5')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

subsetting a Python DataFrame - python

Just for someone looking for a solution more similar to R: df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']] No need for data.loc or query, but I do think it is a bit long.

Related

Pandas for Loop Optimization(Vectorization) when looking at previous row value

Python pandas pyhaystack

Pandas: filter on multiple columns [duplicate]

Merging DataFrames on multiple conditions - not specifically on equal values

Pandas filter rows based on multiple conditions

Categories

Resources