Pandas get rows by its values from dataframes - python

I have a reference dataframe:
ex:
time latitude longtitude pm2.5
0 . 0 0 0
1 . 0 5 1
......
And I have a query with
ex:
time latitude longtitude
0 . 1 3
1 . 0 5
.......
I want to get the pm2.5 which matches the rows in query.
I have used the iteration of rows but it seems very slow.
predications_phy = []
for index, row in X_test.iterrows():
Y = phyDf[(phyDf["time"] == row["time"]) & (phyDf["latitude"] == row["latitude"]) & (phyDf["longtitude"] == row["longtitude"])]
predications_phy.append(Y)
What is the efficient and correct way to get the rows?

Given reference dataframe df1 and query dataframe df2, you can perform a left merge to extract your result:
res = df2.merge(df1, how='left')
print(res)
# time latitude longtitude pm2.5
# 0 0 1 3 NaN
# 1 1 0 5 1.0
Loops are highly discouraged unless your operation cannot be vectorised.

Related

Python repeatable cycle for picking only first values equal 1

I have the df which has index with dates and values 0 or 1. I need to filter every first 1 from this data frame.
For example:
2019-11-27 0
2019-11-29 0
2019-12-02 0
2019-12-03 1
2019-12-04 1
2019-12-05 1
2020-06-01 0
2020-06-02 0
2020-06-03 1
2020-06-04 1
2020-06-05 1
So I want to get:
2019-12-03 1
2020-06-03 1
Assuming you want the first date with value 1 of the dataframe ordered by date ascending, a window operation might be the best way to do this:
df['PrevValue'] = df['value'].rolling(2).agg(lambda rowset: int(rowset.iloc[0]))
This line of code adds an extra column named "PrevValue" to the dataframe containing the value of the previous row or "NaN" for the first row.
Next, you could query the data as follows:
df_filtered = df.query("value == 1 & PrevValue == 0")
Resulting in the following output:
date value PrevValue
3 2019-12-03 1 0.0
8 2020-06-03 1 0.0
i built function that can satisfy your requirements
important note you should change the col argument it might cause you problem
def funfun (df , col="values"):
'''
df : dataframe
col (str) : please insert the name of column that you want to scan
'''
a = []
c = df.to_dict()
for i in range (len(c[col]) -1 ) :
b=c[col][i] , c[col][i+1]
if b == (0, 1) :
a.append(df.iloc[i+1])
return a
results

Delete rows of Dataframe based on multiple conditions from different Dataframe

I have two large Dataframes. The first one contains data, consisting of a date column and a location column, followed by several data column. The second DataFrame consists of a date column and a location column. I want to remove all the rows where the date and the location of df1 match df2.
I have tried a few ways to fix this, including drop statements, drop statements within for loops and redefining the dataframe based on multiple conditions. They all don't work
date = pd.to_datetime(['2019-01-01','2019-01-01','2019-01-02','2019-01-02','2019-01-03','2019-01-03'],format='%Y-%m-%d')
location = [1,2,1,2,1,2]
nr = [8,10,15,2,20,38]
df1 = pd.DataFrame(columns=['date','location','nr'])
df1['date']=date
df1['location']=location
df1['nr']=nr
this results in the following dataframe:
date location nr
0 2019-01-01 1 8
1 2019-01-01 2 10
2 2019-01-02 1 15
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
the second dataframe:
date2 = pd.to_datetime(['2019-01-01','2019-01-02'],format='%Y-%m-%d')
location2 = [2,1]
df2 = pd.DataFrame(columns=['date','location'])
df2['date']=date2
df2['location']=location2
resulting in the following dataframe:
date location
0 2019-01-01 2
1 2019-01-02 1
then the drop statement:
for i in range(len(df2)):
dayA = df2['date'].iloc[i]
placeA = df2['location'].iloc[i]
df1.drop(df1.loc[(df1['date']==dayA)& (df1['location']==placeA)],inplace=True)
which results in this case in the error code in the example :
KeyError: "['date' 'location' 'nr'] not found in axis"
However in my larger dataframe it results in the error:
TypeError: 'NoneType' object is not iterable
what I need however is
date location nr
0 2019-01-01 1 8
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
what am I doing wrong
df1.loc[(df1['date']==dayA)& (df1['location']==placeA)] is the dataframe consisting of rows where the date and location match. drop is expecting the index where they match. So you need df1.loc[(df1['date']==dayA)& (df1['location']==placeA)].index. However, this is a very inefficient method. You can use merge instead as the other answers discuss. Another method would be df1 = df1.loc[~df1[['date','location']].apply(tuple,axis=1).isin(zip(df2.date,df2.location))].
I would use pandas merge and a little trick:
df2['temp'] = 2
df = pd.merge(df1, df2, how='outer', on=['date', 'location'])
df = df[pd.isna(df.temp)]
del df['temp']
Problem is with this line:
df1.drop(df1.loc[(df1['date']==dayA)& (df1['location']==placeA)],inplace=True)
You can achieve your target like this:
df1 = df1.loc[~((df1['date']==dayA) & (df1['location']==placeA))]
Basically, everytime you find a match for each row, you essentially remove it from the df1 dataframe.
Output:
date location nr
0 2019-01-01 1 8
3 2019-01-02 2 2
4 2019-01-03 1 20
5 2019-01-03 2 38
Use pandas merge:
This should work
df1['index_col'] = df1.index
df = df1.merge(df2,on=['date','location'],how='left')
df = df.dropna()
df = df[df1.columns]
result_df = df1[~df.index_col.isin(df1.index_col)]

Pandas/Numpy shift rows into column based on existence

I have a dataframe like so:
col_a | col b
0 1
0 2
0 3
1 1
1 2
I want to convert it to:
col_a | 1 | 2 | 3
0 1 1 1
1 1 1 0
Unfortunately, most questions/answers revolving around this topic simply pivot it
Background: For Scikit, I want to use the existence of values in column b as an attribute/feature (like a sort of manual CountVectorizer, but for row values in this case instead of text)
Use get_dummies with creating first column to index, last use max per index for return only 1/0 values in output:
df = pd.get_dummies(df.set_index('col_a')['col b'], prefix='', prefix_sep='').max(level=0)
print (df)
1 2 3
col_a
0 1 1 1
1 1 1 0
You can use Groupby.cumcount and use it as columns for a pivoted dataframe, which can be obtained using pd.croostab and by default computes a frequency table of the factors :
cols = df.groupby('col_a').cumcount()
pd.crosstab(index = df.col_a, columns = cols)
col_0 0 1 2
col_a
0 1 1 1
1 1 1 0

reset a recurring multiindex in pandas

I have a pandas data frame in python coming from a pd.concat with a recurring multiindex:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
0 0 8880873
1 1000521
1 0 1135488
1 5388773
No, I will reset only the first index of the multiIndex, so that I get a recurring number on the index. Something like this:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
In general, I have around 5 Mio records and not the biggest machine. So I'm looking for a memory efficient solution for that.
ignore_index=True in pd.concat do not works, because then I lose the Multiindex.
Many thanks
You can convert first level by get_level_values to_series, then compare it with shifted values and add cumsum for count and last use MultiIndex.from_arrays:
a = df.index.get_level_values(0).to_series()
a = a.ne(a.shift()).cumsum() - 1
mux = pd.MultiIndex.from_arrays([a, df.index.get_level_values(1)], names=df.index.names)
df.index = mux
Or:
df = df.set_index(mux)
print (df)
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773

how to transpose multiple level pandas dataframe based only on outer index

the below is my dataframe with two level indexing. I want 'only' the outer index to be transposed as columns. My desired output would be 2X2 dataframe instead of a 4X1 dataframe as is the case now. Can any of you please help?
0
0 0 232
1 3453
1 0 443
1 3241
Given you have the multi index you can use unstack() on level 0.
import pandas as pd
import numpy as np
index = pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0),(1,1)])
df = pd.DataFrame([[1],[2],[3],[4]] , index=index, columns=[0])
print df.unstack(level=[0])
0
0 1
0 1 3
1 2 4
One way to do this would be to reset the index and then pivot the table indexing on the level_1 of the index, and using level_0 as the columns and 0 as the values. Example -
df.reset_index().pivot(index='level_1',columns='level_0',values=0)
Demo -
In [66]: index = pd.MultiIndex.from_tuples([(0,0),(0,1),(1,0),(1,1)])
In [67]: df = pd.DataFrame([[1],[2],[3],[4]] , index=index, columns=[0])
In [68]: df
Out[68]:
0
0 0 1
1 2
1 0 3
1 4
In [69]: df.reset_index().pivot(index='level_1',columns='level_0',values=0)
Out[69]:
level_0 0 1
level_1
0 1 3
1 2 4
Later on, if you want you can set the .name attribute for the index as well as the columns to empty string or whatever you want , if you don't want the level_* there.

Categories

Resources