Python, Pandas: Access multiple columns in lambda-function - python

Good day,
I would like to ask if it's possible to access more than one column in a lambda-function inside a pandas-dataframe or if there's an alternative!?
For example my dataframe is looking something like this:
value_a | value_b | value_c
1 | 17 | 8
2 | 253 | 9
3 | 89 | 8
...
I also got a function that is calculating with some of the data:
def some_function(a, b):
...do something:
return c
Now I want to use lambda-function to calculate together with the function but include the data from two columns. Something like this...
df['value_d'] = df['value_b'].apply(lambda x: some_function(x, df['value_c']))
Is it possible to access more than one column inside such a function or is there a better solution?
Hoping my question is understandable.
Thanks to all of you and have a great day!

use apply over whole df
df['value_d'] = df.apply(lambda row: some_function(row['value_b'],row['value_c']), axis=1)

Related

pandas crosstab for two columns

I am trying to make a contingency table using pd.crosstab from my local dataframe. Imagine we asked 3 people in 2 separate groups the question of whether they like ice cream or not, and here is the result in a dataframe:
group1 | group2
------------------
yes | no
no | maybe
yes | no
And i would like the contingency table to look like this:
| group1 | group2
----------------------------
yes | 2 | 0
no | 1 | 2
maybe | 0 | 1
I have played around with pandas and evidently referenced many different resources, including the docs and other posts, but couldn't figure this out. Does anyone have any ideas? Thanks!
Pandas has a crosstab function that solve this; first you have to melt the dataframe:
box = df.melt()
pd.crosstab(box.value, box.variable)
variable group1 group2
value
maybe 0 1
no 1 2
yes 2 0
For performance, it is possible that groupby will be faster, even if it involves a few more steps:
box.groupby(["variable", "value"]).size().unstack("variable", fill_value=0)

Python - Groupby a DataFrameGroupBy object

I have a panda dataframe in Python at which I am applying a groupby. And then I want to apply a new groupby + sum on the previous result. To be more specific, first I am doing:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
And then I want to do:
check_df = check_df.groupby(['market'])['number_of_rooms'].sum()
So, I am getting the following error:
AttributeError: Cannot access callable attribute 'groupby' of 'DataFrameGroupBy'
objects, try using the 'apply' method
My initial data look like that:
hotel_code | market | number_of_rooms | ....
---------------------------------------------
001 | a | 200 | ...
001 | a | 200 |
002 | a | 300 | ...
Notice that I may have duplicates of pairs like (a - 200), that's why I want need the first groupby.
What I want in the end is something like that:
Market | Rooms
--------------
a | 3000
b | 250
I'm just trying to translate the following sql query into python:
select a.market, sum(a.number_of_rooms)
from (
select market, number_of_rooms
from opinmind_dev..cg_mm_booking_dataset_full
group by hotel_code, market, number_of_rooms
) as a
group by market ;
Any ideas how I can fix that? If you need any more info, let me know.
ps. I am new to Python and data science
IIUC, instead of:
check_df = data_df.groupby(['hotel_code', 'dp_id', 'market', 'number_of_rooms'])
[['market', 'number_of_rooms']]
You should simply do:
check_df = data_df.drop_duplicates(subset=['hotel_code', 'dp_id', 'market', 'number_of_rooms'])\
.loc[:, ['market', 'number_of_rooms']]\
.groupby('market')\
.sum()
df = pd.DataFrame({'Market': [1,1,1,2,2,2,3,3], 'Rooms':range(8), 'C':np.random.rand(8)})
Market Rooms C
0 1 0 0.187793
1 1 1 0.325284
2 1 2 0.095147
3 2 3 0.296781
4 2 4 0.022262
5 2 5 0.201078
6 3 6 0.160082
7 3 7 0.683151
You need to move the column selection away from the grouped DataFrame. Either of the following should work.
df.groupby('Market').sum()[['Rooms']]
df[['Rooms']].groupby(df['Market']).sum()
Rooms
Market
1 3
2 12
3 13
If you select using ['Rooms'] instead of [['Rooms']] you will get a Series instead of a DataFrame.
The dataframes produced use market as their index. If you want to convert it to a normal data column, use:
df.reset_index()
Market Rooms
0 1 3
1 2 12
2 3 13
If I understand your question correctly, You could simply do -
data_df.groupby('Market').agg({'Rooms': np.sum}) OR
data_df.groupby(['market'], as_index=False).agg({'Rooms': np.sum})
data_df = pd.DataFrame({'Market' : ['A','B','C','B'],
'Hotel' : ['H1','H2','H4','H5'],
'Rooms' : [20,40,50,34]
})
data_df.groupby('Market').agg({'Rooms': np.sum})

Unpacking Nested List in Python to a DataFrame (Unsuccessful)

I have written a function which takes in data from a database, returns this into a list which then has the following format:
df_master = []
#x = arbitrary data from DB
for i in db_list:
df_tmp = df_tmp.append(ReadDBValues(i, interval, start_date, end_date))
df_master.append(df_tmp)
However, this also means flattening the data is somewhat troublesome.
I have used the following approach:
flat = [item for sublist in df_master for item in sublist]
Which yields [1,0,0,1] as in, it returns the 4 columns but not the associated values with each column.
I was hoping to be able to convert this into a dataframe as such:
W | X | Y | Z ....
1 | 2 | 3 | 4 ...
| | | ....
I have been using this as my reference:
Making a flat list out of list of lists in Python
But, I can't seem to flatten more than the first two columns.
Could I please get any further guidance?
Thank you very much.
EDIT: I have now managed to create a 'unique' index for the data so I retain the column names. However, the problem is that say there are two columns; 1400 rows in the first column and 1400 in the second.
The code will do the following:
Date | Val X | Val Y
.... 1398 NaN
.... 1399 NaN
1400 NaN
NaN 1
NaN 2
When instead it should be:
Date | Val X | Val Y
.... 1398 523
.... 1399 242
1400 112
Any ideas?
EDIT: Using a GroupBy Index has not proven successful either and results in just NaN values appearing.
(df_master.groupby(df_master.index).sum())
Can anyone please point me in the right direction?

Pandas: Storing Dataframe in Dataframe

I am rather new to Pandas and am currently running into a problem when trying to insert a Dataframe inside a Dataframe.
What I want to do:
I have multiple simulations and corresponding signal files and I want all of them in one big DataFrame. So I want a DataFrame which has all my simulation parameters and also my signals as an nested DataFrame. It should look something like this:
SimName | Date | Parameter 1 | Parameter 2 | Signal 1 | Signal 2 |
Name 1 | 123 | XYZ | XYZ | DataFrame | DataFrame |
Name 2 | 456 | XYZ | XYZ | DataFrame | DataFrame |
Where SimName is my Index for the big DataFrame and every entry in Signal 1 and Signal 2 is an individuall DataFrame.
My idea was to implement this like this:
big_DataFrame['Signal 1'].loc['Name 1']
But this results in an ValueError:
Incompatible indexer with DataFrame
Is it possible to have this nested DataFrames in Pandas?
Nico
The 'pointers' referred to at the end of ns63sr's answer could be implemented as a class, e.g...
Definition:
class df_holder:
def __init__(self, df):
self.df = df
Set:
df.loc[0,'df_holder'] = df_holder(df)
Get:
df.loc[0].df_holder.df
the docs say that only Series can be within a DataFrame. However, passing DataFrames seems to work as well. Here is an exaple assuming that none of the columns is in MultiIndex:
import pandas as pd
signal_df = pd.DataFrame({'X': [1,2,3],
'Y': [10,20,30]} )
big_df = pd.DataFrame({'SimName': ['Name 1','Name 2'],
'Date ':[123 , 456 ],
'Parameter 1':['XYZ', 'XYZ'],
'Parameter 2':['XYZ', 'XYZ'],
'Signal 1':[signal_df, signal_df],
'Signal 2':[signal_df, signal_df]} )
big_df.loc[0,'Signal 1']
big_df.loc[0,'Signal 1'][X]
This results in:
out1: X Y
0 1 10
1 2 20
2 3 30
out2: 0 1
1 2
2 3
Name: X, dtype: int64
In case nested dataframes are not properly working, you may implement some sort of pointers that you store in big_df that allow you to access the signal dataframes stored elsewhere.
Instead of big_DataFrame['Signal 1'].loc['Name 1'] you should use
big_DataFrame.loc['Name 1','Signal 1']

How can I concatenate 3 columns into 1 using Excel and Python?

I am working on a project in GIS software where I need to have a column containing dates in the format YYYY-MM-DD. Currently, in Excel I have 3 columns: 1 with the year, 1 with the month and 1 with the day. Looks like this:
| A | B | C |
| 2012 | 1 | 1 |
| 2012 | 2 | 1 |
| 2012 | 3 | 1 |
...etc...
And I need it to look like this:
| A |
| 2012-01-01|
| 2012-02-01|
| 2012-03-01|
I have several workbooks that I need in the same format so I figured that perhaps python would be a useful tool so that I didn't have to manually concatenate everything in Excel.
So, my question is, is there a simple way to not only concatenate these three columns, but to also add a zero in front of the month and day numbers?
I have been experimenting a little bit with the python library openpyxl, but have not come up with anything useful so far. Any help would be appreciated, thanks.
If you're going to be staying in excel, you may as well just use the excel macro scripting. If your year, month, and day are in columns A, B and C, you can just type this in column D to concatenate them, then format it as a date and adjust the padding.
=$A1 & "-" & $B1 & "-" & $C1
Try this:
def DateFromABC(self,ws):
import datetime
# Start reading from row 1
for i,row in enumerate(ws.rows,1):
sRow = str(i)
# datetime goes into column 'D'
Dcell = ws['D'+sRow]
# Set datetime from 'ABC'
Dcell.value = datetime.date(year=ws['A'+sRow].value,
month=ws['B'+sRow].value,
day=ws['C'+sRow].value
)
print('i=%s,tpye=%s, value=%s' % (i,str(type(Dcell.value)),Dcell.value ) )
#end for
#end def DateFromABC
Tested with Python:3.4.2 - openpyxl:2.4.1 - LibreOffice: 4.3.3.2

Categories

Resources