Index columns disappeared after lambda function in Pandas

Index columns disappeared after lambda function in Pandas - python

I wanted to calculate the percent of some object in one hour ('Time') so I have tried to write a lambda function, and I think it does the job, but index columns disappeared, columns that dataframe is grouped by.
df = df.groupby(['id', 'name', 'time', 'object', 'type'], as_index=True, sort=False)['col1', 'col2', 'col3', 'col4', 'col5'].apply(lambda x: x * 100 / 3600).reset_index()
After that code I print df.columns and got this:
Index([u'index', u'col1', col2', u'col3',
u'col4', u'col5'],
dtype='object')
If there is a need I am going to write some table with values for each column.
Thanks in advance.

Moving the loop outward, will make the code run significantly faster:
for c in ['col1', 'col2', 'col3', 'col4', 'col5']:
df[c] *= 100. / 3600
This is because the individual loops' calculations will be done in a vectorized way.
This also won't modify the index in any way.

pd.DataFrame.groupby is used to aggregate data, not to apply a function to multiple columns.
For simple functions, you should look for a vectorised solution. For example:
# set up simple dataframe
df = pd.DataFrame({'id': [1, 2, 1], 'name': ['A', 'B', 'A'],
'col1': [5, 6, 8], 'col2': [9, 4, 5]})
# apply logic in a vectorised way on multiple columns
df[['col1', 'col2']] = df[['col1', 'col2']].values * 100 / 3600
If you wish to set your index as multiple columns, and are keen to use pd.DataFrame.apply, this is possible as two separate steps. For example:
df = df.set_index(['id', 'name'])
df[['col1', 'col2']] = df[['col1', 'col2']].apply(lambda x: x * 100 / 3600)

You apply .reset_index() which resets the index. Take a look at the pandas documentation and you'll see, that .reset_index() transfers the index to the columns.

Data from Jpp
df[['col1','col2']]*=100/3600
df
Out[110]:
col1 col2 id name
0 0.138889 0.250000 1 A
1 0.166667 0.111111 2 B
2 0.222222 0.138889 1 A

Related

Merge two dataframes on multiple columns but only merge on columns if both not NaN

I'm looking to merge two dataframes across multiple columns but with some additional conditions.
import pandas as pd
df1 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X',None,'Z','V'],
'optional_col3': [None,'def', 'ghi','jkl']
})
df2 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X','Y','Z','W'],
'optional_col3': ['abc', 'def', 'ghi','mno']
})
I would like to always join on col1 but then try to also join on optional_col2 and optional_col3. In df1, the value can be NaN for both columns but it is always populated in df2. I would like the join to be valid when the col1 + one of optional_col2 or optional_col3 match.
This would result in ['a', 'b', 'c'] joining due to exact col2, col3, and exact match, respectively.
In SQL I suppose you could write the join as this, if it helps explain further:
select
*
from
df1
inner join
df2
on df1.col1 = df2.col2
AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)
I've messed around with pd.merge but can't figure how to do a complex operation like this. I think I can do a merge on ['col1', 'optional_col2'] then a second merge on ['col1', 'optional_col_3'] then union and drop duplicates?
Expected DataFrame would be something like:
merged_df = pd.DataFrame({
'col1': ['a', 'b', 'c'],
'optional_col_2': ['X', 'Y', 'Z'],
'optional_col_3': ['abc', 'def', 'ghi']
})

This solution works by creating an extra column called "temp" in both dataframes. In df11 it will be a column of true values. In df2 the values will be true if there is a match between either of the optional columns. I'm not clear whether you consider a NaN value to be matchable or not, if so then you need to fill in the NaNs of columns in df1 with values from df2 before comparing to fulfill your criteria around missing values (this is what is below). If this is not required then drop the fillna calls in the example below.
df1["temp"] = True
optional_col2_match = df1["optional_col2"].fillna(df2["optional_col2"]).eq(df2["optional_col2"])
optional_col3_match = df1["optional_col3"].fillna(df2["optional_col3"]).eq(df2["optional_col3"])
df2["temp"] = optional_col2_match | optional_col3_match
Then use the "temp" column in the merge, and then drop it - it has served its purpose
pd.merge(df1, df2, on=["col1", "temp"]).drop(columns="temp")
This gives the following result
col1 optional_col2_x optional_col3_x optional_col2_y optional_col3_y
0 a X abc X abc
1 b Y def Y def
2 c Z ghi Z ghi
You will need to decide what to do here. In the example you gave there are no rows which match on just one of optional_col2 and optional_col2, which is why a 3 column solution looks reasonable. This won't generally be the case.

Mean and standard deviation with multiple dataframes

I have multiple dataframes having the same columns and the same number of observations:
For example
d1 = {'ID': ['A','B','C','D'], 'Amount':
[1,2,3,4]}
df1 =pd.DataFrame(data=d1)
d2 = {'ID': ['A','B','C','D'], 'Amount':
[6,0,1,5]}
df2 =pd.DataFrame(data=d2)
d3 = {'ID': ['A','B','C','D'], 'Amount':
[8,1,2,3]}
df3 =pd.DataFrame(data=d3)
I need to drop one column (D) and its corresponding value in each of the dataframes and then, for each variable, calculating the mean and standard deviation.
The expected output should be
avg std
A 5 ...
B ... ...
C ... ...
Generally, for one dataframe, I would use drop columns and then I would compute the average using mean() and the standard deviation std().
How can I do this in an easy and fast way with multiple dataframes? (I have at least 10 of them).

Use concat with remove D in DataFrame.query and aggregate by GroupBy.agg with named aggregations:
df = (pd.concat([df1, df2, df3])
.query('ID != "D"')
.groupby('ID')
.agg(avg=('Amount', 'mean'), std=('Amount', 'std')))
print (df)
avg std
ID
A 5 3.605551
B 1 1.000000
C 2 1.000000
Or remove D in last step by DataFrame.drop:
df = (pd.concat([df1, df2, df3])
.groupby('ID')
.agg(avg=('Amount', 'mean'), std=('Amount', 'std'))
.drop('D'))

You can use pivot_table as well:
import numpy as np
pd.concat([df1, df2, df3]).pivot_table(index='ID', aggfunc=[np.mean, np.std]).drop('D')

DataFrame 'groupby' is fixing group columns with index

I have used a simple 'groupby' to condense rows in a Pandas dataframe:
df = df.groupby(['col1', 'col2', 'col3']).sum()
In the new DataFrame 'df', the three columns that were used in the 'groupby' function are now fixed within the index and are no longer column indexes 0, 1 and 2 - what was previously column index 4 is now column index 0.
How do I stop this from happening / reinclude the three 'groupby' columns along with the original data?

Try -
df = df.groupby(['col1', 'col2', 'col3'], as_index = False).sum()
#or
df = df.groupby(['col1', 'col2', 'col3']).sum().reset_index()

Try resetting the index
df = df.reset_index()

Pandas dataframe get column names with values in cell

I am trying to get the column names which have cell values less than .2, without repeating a combination of columns. I tried this to iterate over the column names without success:
pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
print(pvals2)
print('---')
pvals2.transpose().join(pvals2, how='outer')
My goal is:
col3 col2 .01
#col2 col3 .01 #NOT INCLUDED (because it it a repeat)

A list comprehension is one way:
pvals2 = pd.DataFrame({'col1': [1, .2,.7], 'col2': [.2, 1,.01], 'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
res = [col for col in pvals2 if (pvals2[col] < 0.2).any()]
# ['col2', 'col3']
To get values as well, as in your desired output, requires more specification, as a column may have more than one value less than 0.2.

Iterate through the columns and check if any value meets your conditions:
pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]})
cols_with_small_values = set()
for col in pvals2.columns:
if any(i < 0.2 for i in pvals2[col]):
cols_with_small_values.add(col)
cols_with_small_values.add(pvals2[col].min())
print(cols_with_small_values)
RESULT: {'col3', 0.01, 'col2'}
any is a built-in. This question has a good explanation for how any works. And we can use a set to assure each column will only appear once.
We use DataFrame.min() to get the small value that caused us to select this column.

You could use stack and then filter out values < 0.2. Then keep the last duplicated value
pvals2.stack()[pvals2.stack().lt(.2)].drop_duplicates(keep='last')
col3 col2 0.01
dtype: float64

pvals2=pd.DataFrame({'col1': [1, .2,.7],
'col2': [.2, 1,.01],
'col3': [.7,.01,1]},
index = ['col1', 'col2', 'col3'])
pvals2.min().where(lambda x : x<0.1).dropna()
Output
col2 0.01
col3 0.01
dtype: float64

Python pandas merge with OR logic

I'm searching and haven't found an answer to this question, can you perform a merge of pandas dataframes using OR logic? Basically, the equivalent of a SQL merge using "where t1.A = t2.A OR t1.A = t2.B".
I have a situation where I am pulling information from one database into a dataframe (df1) and I need to merge it with information from another database, which I pulled into another dataframe (df2), merging based on a single column (col1). If these always used the same value when they matched, it would be very straightforward. The situation I have is that sometimes they match and sometimes they use a synonym. There is a third database that has a table that provides a lookup between synonyms for this data entity (col1 and col1_alias), which could be pulled into a third dataframe (df3). What I am looking to do is merge the columns I need from df1 and the columns I need from df2.
As stated above, in cases where df1.col1 and df2.col1 match, this would work...
df = df1.merge(df2, on='col1', how='left')
However, they don't always have the same value and sometimes have the synonyms. I thought about creating df3 based on when df3.col1 was in df1.col1 OR df3.col1_alias was in df1.col1. Then, creating a single list of values from df3.col1 and df3.col1_alias (list1) and selecting df2 based on df2.col1 in list1. This would give me the rows from df2 I need but, that still wouldn't put me in position to merge df1 and df2 matching the appropriate rows. I think if there an OR merge option, I can step through this and make it work, but all of the following threw a syntax error:
df = df1.merge((df3, left_on='col1', right_on='col1', how='left')|(df3, left_on='col1', right_on='col1_alias', how='left'))
and
df = df1.merge(df3, (left_on='col1', right_on='col1')|(left_on='col1', right_on='col1_alias'), how='left')
and
df = df1.merge(df3, left_on='col1', right_on='col1'|right_on='col1_alias', how='left')
and several other variations. Any guidance on how to perform an OR merge or suggestions on a completely different approach to merging df1 and df2 using the synonyms in two columns in df3?

I think I would do this as two merges:
In [11]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]], columns=["A", "B"])
In [12]: df2 = pd.DataFrame([[1, 7], [2, 8], [4, 9]], columns=["C", "D"])
In [13]: res = df.merge(df2, left_on="B", right_on="C", how="left")
In [14]: res.update(df.merge(df2, left_on="A", right_on="C", how="left"))
In [15]: res
Out[15]:
A B C D
0 1 2 1.0 7.0
1 3 4 4.0 9.0
2 5 6 NaN NaN
As you can see this picks A = 1 -> D = 7 rather than B = 2 -> D = 8.
Note: For more extensibility (matching different columns) it might make sense to pull out a single column, although they're both the same in this example:
In [21]: res = df.merge(df2, left_on="B", right_on="C", how="left")["C"]
In [22]: res.update(df.merge(df2, left_on="A", right_on="C", how="left")["C"])
In [23]: res
Out[23]:
0 1.0
1 4.0
2 NaN
Name: C, dtype: float64

#will this work?
df = pd.concat([df1.merge(df3, left_on='col1', right_on='col1', how='left'), df1.merge(df3, left_on='col1', right_on='col1_alias', how='left')]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Index columns disappeared after lambda function in Pandas - python

Moving the loop outward, will make the code run significantly faster: for c in ['col1', 'col2', 'col3', 'col4', 'col5']: df[c] *= 100. / 3600 This is because the individual loops' calculations will be done in a vectorized way. This also won't modify the index in any way.

You apply .reset_index() which resets the index. Take a look at the pandas documentation and you'll see, that .reset_index() transfers the index to the columns.

Data from Jpp df[['col1','col2']]*=100/3600 df Out[110]: col1 col2 id name 0 0.138889 0.250000 1 A 1 0.166667 0.111111 2 B 2 0.222222 0.138889 1 A

Related

Merge two dataframes on multiple columns but only merge on columns if both not NaN

Mean and standard deviation with multiple dataframes

DataFrame 'groupby' is fixing group columns with index

Pandas dataframe get column names with values in cell

Python pandas merge with OR logic

Categories

Resources