Pandas: Nesting Dataframes - python

Hello I want to store a dataframe in another dataframe cell.
I have a data that looks like this
I have daily data which consists of date, steps, and calories. In addition, I have minute by minute HR data of a specific date. Obviously it would be easy to put the minute by minute data in 2 dimensional list but I'm fearing that would be harder to analyze later.
What would be the best practice when I want to have both data in one dataframe? Is it even possible to even nest dataframes?
Any better ideas ? Thanks!

Yes, it seems possible to nest dataframes but I would recommend instead rethinking how you want to structure your data, which depends on your application or the analyses you want to run on it after.
How to "nest" dataframes into another dataframe
Your dataframe containing your nested "sub-dataframes" won't be displayed very nicely. However, just to show that it is possible to nest your dataframes, take a look at this mini-example:
Here we have 3 random dataframes:
>>> df1
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057
>>> df2
0 1 2
0 0.090917 0.457668 0.598548
1 0.748639 0.729935 0.680409
2 0.301244 0.024004 0.361283
>>> df3
0 1 2
0 0.200375 0.059798 0.665323
1 0.086708 0.320635 0.594862
2 0.299289 0.014134 0.085295
We can make a main dataframe that includes these dataframes as values in individual "cells":
df = pd.DataFrame({'idx':[1,2,3], 'dfs':[df1, df2, df3]})
We can then access these nested datframes as we would access any value in any other dataframe:
>>> df['dfs'].iloc[0]
0 1 2
0 0.614679 0.401098 0.379667
1 0.459064 0.328259 0.592180
2 0.916509 0.717322 0.319057

Related

Splitting column of a really big dataframe in two (or more) new cols

Problem
Hey there! I'm having some trouble trying to split one column of my dataframe in two (or even more) new columns. I think this depends on the fact that the dataframe I'm working with comes from a really big csv file, almost 10gb worth of space. Once it is loaded into a Pandas dataframe, this is represented by ~60mil of rows and 5 cols.
Example
Initially, the dataframes looks something like this:
In [1]: df
Out[1]:
category other_col
0 animal.cat 5
1 animal.dog 3
2 clothes.shirt.sports 6
3 shoes.laces 1
4 None 0
I want to first remove the rows of the df for which the category is not defined (i.e., the last one), and then split the category column in three new columns based on where the dot appears: one for the main category, one for the first subcategory and another one for the last subcategory (if that actually exists). Finally, I want to merge the whole dataframe back together.
In other words, this is what I want to obtain:
In [2]: df_after
Out[2]:
other_col main_cat sub_category_1 sub_category_2
0 5 animal cat None
1 3 animal dog None
2 6 clothes shirt sports
3 1 shoes laces None
My approach
My approach for this was the following:
df = df[df['category'].notnull()]
df_wt_cat = df.drop(columns=['category'])
df_cat_subcat = df['category'].str.split('.', expand=True).rename(columns={0: 'main_cat', 1: 'sub_category_1', 2: 'sub_category_2', 3: 'sub_category_3'})
df_after = pd.concat([df_wt_cat, df_cat_subcat], axis=1)
which seems to work just fine with small datasets, but it sucks up too much memory when this is applied on a dataframe that big and the Jupyter kernel just dies.
I've tried to read the dataframe in chunks, but I'm not really sure how should I proceed after that; I've obviously tried searching this kind of problem here on stack overflow, but I didn't manage to find anything useful.
Any help is appreciated!
split and join methods do the job:
results = df['category'].str.split(".", expand = True))
df_after = df.join(results)
after doing that you can freely filter resulting dataframe

Group by ids, sort by date and get values as list on big data python

I have a big data (30 milions rows).
Each table has id,date,value.
I need to go over each id and per these id get a list of values sorted by date so the first value is the list will be the older date.
Example:
ID DATE VALUE
1 02/03/2020 300
1 04/03/2020 200
2 04/03/2020 456
2 01/03/2020 300
2 05/03/2020 78
Desire table:
ID VALUE_LIST_ORDERED
1 [300,200]
2 [300,456,78]
I can do it by for loop, by apply but its not effictive and with milion of users it's not feasible.
I thought about using group by and sort the dates but I dont know of to make a list and if so, groupby on pandas df is the best way?
I would love to get some suggestions on how to do it and which kind of df/technology to use.
Thank you!
what you need to do is to order your data using pandas.dataframe.sort_values and then apply the groupby method
I don't have huge data set to test this code on, but I believe this would do the trick:
sorted = data.sort_values('DATE')
result = data.groupby('ID').VALUE.apply(np.array)
and since it's Python you can always put everything in one statement
print(data.sort_values('DATE').data.groupby('ID').VALUE.apply(np.array))

Select columns in a DataFrame conditional on row

I am attempting to generate a dataframe (or series) based on another dataframe, selecting a different column from the first frame dependent on the row using another series. In the below simplified example, I want the frame1 values from 'a' for the first three rows, and 'b for the final two (the picked_values series).
frame1=pd.DataFrame(np.random.randn(10).reshape(5,2),index=range(5),columns=['a','b'])
picked_values=pd.Series(['a','a','a','b','b'])
Frame1
a b
0 0.283519 1.462209
1 -0.352342 1.254098
2 0.731701 0.236017
3 0.022217 -1.469342
4 0.386000 -0.706614
Trying to get to the series:
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
I was hoping values[picked_values] would work, but this ends up with five columns.
In the real-life example, picked_values is a lot larger and calculated.
Thank you for your time.
Use df.lookup
pd.Series(frame1.lookup(picked_values.index,picked_values))
0 0.283519
1 -0.352342
2 0.731701
3 -1.469342
4 -0.706614
dtype: float64
Here's a NumPy based approach using integer indexing and Series.searchsorted:
frame1.values[frame1.index, frame1.columns.searchsorted(picked_values.values)]
# array([0.22095278, 0.86200616, 1.88047197, 0.49816937, 0.10962954])

Why repeating a pd.series does not work as expected?

I have just started working with python 3.7 and I am trying to create a series e.g from 0 to 23 and repeat it. Using
rep1 = pd.Series(range(24))
I figured out how to make the first 24 values and I wanted to "copy-paste" it many times so that the final series is the original 5 times, one after the other. The result with rep = pd.Series.repeat(rep1, 5) gives me a result that looks like this and it's not what I want
0 0 0 0 0 1 1 1 1 1 2 2 2 2 2 ...
What I seek for is the 0-23 range multiple times. Any advice?
you can try this:
pd.concat([rep1]*5)
This will repeat your series 5 times.
Another solution using numpy.tile:
import numpy as np
rep = pd.Series(np.tile(rep1, 5))
If you want the repeated Series as one data object then use a pandas DataFrame for this. A DataFrame is multiple pandas Series in one object, sharing an index.
So firstly I am creating a python list, of 0-23, 5 times.
Then I put this into a DataFrame and optionally transpose so that I have the rows going down rather than across in this example.
import pandas as pd
lst = [list(range(0,24))] * 5
rep = pd.DataFrame(lst).transpose()
You could use a list to generate directly your Series.
rep = pd.Series(list(range(24))*5)

Merges and joins in pandas

I am joining two DataFrame tables that show sums of elements from two different months.
Here is df1:
Query ValueA0
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000
1 IO2_CollateralCalledforReceipt 60000.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000
Here is df2:
Query ValueB0
0 IO1_DerivativeReceivables_ChathamLocal 336705.200000
1 IO2_CollateralCalledforReceipt 20920.000000
2 OO1_DerivativePayables_ChathamLocal 11299.130000
Note that the queries are the same, but the values are different.
I tried to join them with the following code:
import pandas as pd
pd.merge(df1, df2, on='Query')
This was my result:
Query ValueA0 \
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000
1 IO2_CollateralCalledforReceipt 60000.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000
ValueB0
0 336705.200000
1 20920.000000
2 11299.130000
This is what I was expecting:
Query ValueA0 ValueB0
0 IO1_DerivativeReceivables_ChathamLocal 673437.850000 336705.200000
1 IO2_CollateralCalledforReceipt 60000.000000 20920.000000
2 OO1_DerivativePayables_ChathamLocal 73537.550000 11299.130000
How do I do this? The join seems fairly simple. I have tried several variations of joins and always end up with the tables appearing as though they are separated. Is this correct?
See this. I had a datframe and it showed me like this, but the datframe was one intact data frame...
The join is correct - no further information is needed.

Categories

Resources