Save a pandas dataframe inside another dataframe - python

I have a tricky case. Can't wrap my head around it.
I have a pandas dataframe like below:
In [3]: df = pd.DataFrame({'stat_101':[31937667515, 47594388534, 43568256234], 'group_id_101':[1,1,1], 'level_101':[1,2,2], 'stat_102':['00005#60-78','00005#60-78','00005#60-78'], 'avg_104':[27305.34552, 44783.49401, 22990.77442]})
In [4]: df
Out[4]:
stat_101 group_id_101 level_101 stat_102 avg_104
0 31937667515 1 1 00005#60-78 27305.34552
1 47594388534 1 2 00005#60-78 44783.49401
2 43568256234 1 2 00005#60-78 22990.77442
I want to group this on 'group_id_101','stat_102' columns and create another dataframe which will be storing the result of the grouped dataframe inside it.
Expected output:
In [27]: res = pd.DataFrame({'new_stat_101':[1], 'stat_102':['00005#60-78'], 'new_avg':['Dataframe_obj']})
In [28]: res
Out[28]:
new_stat_101 stat_102 new_avg
0 1 00005#60-78 Dataframe_obj
Where the Dataframe_obj will be another dataframe with rows like below:
stat_101 level_101 avg_104
0 31937667515 1 27305.34552
1 47594388534 2 44783.49401
2 43568256234 2 22990.77442
What is the best way to do this? Should I be saving a dataframe inside another dataframe or there's a more cleaner way of doing it?
Hope my question is clear.

Let's try
g = ['group_id_101', 'stat_102']
idx, dfs = zip(*df.groupby(g))
pd.DataFrame({'new_avg': dfs}, index=pd.MultiIndex.from_tuples(idx, names=g))
new_avg
group_id_101 stat_102
1 00005#60-78 stat_101 group_id_101 level_101 st...
"new_avg" is a column of DataFrames accessible by index.
Obligatory disclaimer: This is blatant abuse of DataFrames, you should typically not store objects that cannot take advantage of pandas vectorization.

Related

How to merge and get the count of non-matching entries efficiently?

I have two data frames like as shown below
import pandas as pd
import numpy as np
source_df = pd.DataFrame({'source_value':['21ABCDE1','22CDEF2','23DEF3','24FGH4']})
client_df = pd.DataFrame({'source_value':['21ABCDE1','29SB','21ABCDE1','29SB','25FG','25FG','31DE','35DE']})
What I would like to do is find the number of source_values which are present in client_df but not present in the source_df.
Please note that there can be duplicates in the client_df but not in source_df
Basically I have to identify them as they are not valid (because they are missing in the parent dataframe which is source_df)
I tried the below but it is for matching entries.
pd.merge(
source_df,client_df,
how="inner",
on = 'source_value').groupby('source_value').size()
How can I elegantly do it for non-matching entries because I have several millions of data (At least 10 million records and can go up to 15 million).
I expect my output to be like as shown below
One option is to get the source_value from source_df as a set and then filter and count values in client_df:
source_values = set(source_df.source_value.to_list())
client_df.source_value[lambda x: ~x.isin(source_values)].value_counts()
#29SB 2
#25FG 2
#35DE 1
#31DE 1
#Name: source_value, dtype: int64
In []: client_df[~client_df.source_value.isin(source_df.source_value)].value_counts()
Out[]:
source_value
29SB 2
25FG 2
35DE 1
31DE 1
dtype: int64
Using Left Merge with Indicator to identify ones in client only
In[]: merge_df = client_df.merge(source_df, how='left', indicator=True)
merge_df[merge_df['_merge']=='left_only'].groupby('source_value').size()
Out[]:
source_value
25FG 2
29SB 2
31DE 1
35DE 1
client_df.groupby('source_value')['source_value'].count().loc[pd.Index(client_df.source_value).difference(source_df.source_value)]

python for loop using index to create values in dataframe

I have a very simple for loop problem and I haven't found a solution in any of the similar questions on Stack. I want to use a for loop to create values in a pandas dataframe. I want the values to be strings that contain a numerical index. I can make the correct value print, but I can't make this value get saved in the dataframe. I'm new to python.
# reproducible example
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1['file'] = "data_{i}.txt".format(i=i)
This loop prints the exact value that I want to put into the 'file' column of df1, but when I look at df1, it only uses the last value for the index.
x file
0 0 data_4.txt
1 1 data_4.txt
2 2 data_4.txt
3 3 data_4.txt
4 4 data_4.txt
I have tried using enumerate, but can't find a solution with this. I assume everyone will yell at me for posting a duplicate question, but I have not found anything that works and if someone points me to a solution that solves this problem, I'll happily remove this question.
There are better ways to create a DataFrame, but to answer your question:
Replace the last line in your code:
df1['file'] = "data_{i}.txt".format(i=i)
with:
df1.loc[i, 'file'] = "data_{0}.txt".format(i)
For more information, read about the .loc here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
On the same page, you can read about accessors like .at and .iloc as well.
You can do list-comprehension:
df1['file'] = ["data_{i}.txt".format(i=i) for i in range(5)]
print(df1)
Prints:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
OR at the creating of DataFrame:
df1 = pd.DataFrame({'x':range(5), 'file': ["data_{i}.txt".format(i=i) for i in range(5)]})
print(df1)
OR:
df1 = pd.DataFrame([{'x':i, 'file': "data_{i}.txt".format(i=i)} for i in range(5)])
print(df1)
I've found success with the .at method
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
df1.at[i, 'file'] = "data_{i}.txt".format(i=i)
Returns:
x file
0 0 data_0.txt
1 1 data_1.txt
2 2 data_2.txt
3 3 data_3.txt
4 4 data_4.txt
when you assign a variable to a dataframe column the way you do -
using the df['colname'] = 'val', it assigns the val across all rows.
That is why you are seeing only the last value.
Change your code to:
import pandas as pd
df1 = pd.DataFrame({'x':range(5)})
# for loop to add a row with an index
to_assign = []
for i in range(5):
print("data_{i}.txt".format(i=i)) # this prints the value that I want
to_assign.append(data_{i}.txt".format(i=i))
##outside of the loop - only once - to all dataframe rows
df1['file'] = to_assign.
As a thought, pandas has a great API for performing these type of actions without for loops.
You should start practicing those.

Why there is an extra index when using apply in Pandas

When I use apply to a user defined function in Pandas, it looks like python is creating an additional array. How could I get rid of it? Here is my code:
def fnc(group):
x = group.C.values
out = x[np.where(x < 0)]
return pd.DataFrame(out)
data = pd.DataFrame({'A':np.random.randint(1, 3, 10),
'B':3,
'C':np.random.normal(0, 1, 10)})
data.groupby(by=['A', 'B']).apply(fnc).reset_index()
There is this weird Level_2 index created. Is there a way to avoid creating it when running my function?
A B level_2 0
0 1 3 0 -1.054134802
1 1 3 1 -0.691996447
2 2 3 0 -1.068693768
3 2 3 1 -0.080342046
4 2 3 2 -0.181869799
As such, you will have no way to avoid level_2 appearing. This is because the result of your grouping is a dataframe with several items in it: pandas is cool enough to understand your wish is to broadcast these items across the grouped keys, yet it is taking the index of the dataframe as an additional level to guarantee coherent output data. So dropping level=-1 at the end of your processing explicitly is expected.
If you want to avoid to reset that extra index, but still have some post processing, another way would be to call transform instead of apply, and get the returned data from fnc being the entire group vector where you put np.nan for results to exclude. Then, your dataframe will not have an extra level, but you'll need to call dropna() afterwards.

Creating a New Pandas Grouped Object

In some transformations, I seem to be forced to break from the Pandas dataframe grouped object, and I would like a way to return to that object.
Given a dataframe of time series data, if one groups by one of the values in the dataframe, we are given an underlying dictionary from key to dataframe.
Being forced to make a Python dict from this, the structure cannot be converted back into a Dataframe using the .from_dict() because the structure is key to dataframe.
The only way to go back to Pandas without some hacky column renaming is, to my knowledge, by converting it back to a grouped object.
Is there any way to do this?
If not, how would I convert a dictionary of instance to dataframe back into a Pandas datastructure?
EDIT ADDING SAMPLE::
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(randn(len(rng)), index=rng), 'b':pd.Series(randn(len(rng)), index=rng)})
// now have dataframe with 'a's and 'b's in time series
for k, v in df.groupby('a'):
df_dict[k] = v
// now we apply some transformation that cannot be applied view aggregate, transform, or apply
// how do we get this back into a groupedby object?
If I understand OP's question correctly, you want to group a dataframe by some key(s), do different operations on each group (possibly generating new columns, etc.) and then go back to the original dataframe.
Modifying you example (group by random integers instead of floats which are usually unique):
np.random.seed(200)
rng = pd.date_range('1/1/2000', periods=10, freq='10m')
df = pd.DataFrame({'a':pd.Series(np.random.randn(len(rng)), index=rng), 'b':pd.Series(np.random.randn(len(rng)), index=rng)})
df['group'] = np.random.randint(3,size=(len(df)))
Usually, If I need single values for each columns per group, I'll do this (for example, sum of 'a', mean of 'b')
In [10]: df.groupby('group').aggregate({'a':np.sum, 'b':np.mean})
Out[10]:
a b
group
0 -0.214635 -0.319007
1 0.711879 0.213481
2 1.111395 1.042313
[3 rows x 2 columns]
However, if I need a series for each group,
In [19]: def func(sub_df):
sub_df['c'] = sub_df['a'] * sub_df['b'].shift(1)
return sub_df
....:
In [20]: df.groupby('group').apply(func)
Out[20]:
a b group c
2000-01-31 -1.450948 0.073249 0 NaN
2000-11-30 1.910953 1.303286 2 NaN
2001-09-30 0.711879 0.213481 1 NaN
2002-07-31 -0.247738 1.017349 2 -0.322874
2003-05-31 0.361466 1.911712 2 0.367737
2004-03-31 -0.032950 -0.529672 0 -0.002414
2005-01-31 -0.221347 1.842135 2 -0.423151
2005-11-30 0.477257 -1.057235 0 -0.252789
2006-09-30 -0.691939 -0.862916 2 -1.274646
2007-07-31 0.792006 0.237631 0 -0.837336
[10 rows x 4 columns]
I'm guess you want something like the second example. But the original question wasn't very clear even with your example.

return default if pandas dataframe.loc location doesn't exist

I find myself often having to check whether a column or row exists in a dataframe before trying to reference it. For example I end up adding a lot of code like:
if 'mycol' in df.columns and 'myindex' in df.index: x = df.loc[myindex, mycol]
else: x = mydefault
Is there any way to do this more nicely? For example on an arbitrary object I can do x = getattr(anobject, 'id', default) - is there anything similar to this in pandas? Really any way to achieve what I'm doing more gracefully?
There is a method for Series:
So you could do:
df.mycol.get(myIndex, NaN)
Example:
In [117]:
df = pd.DataFrame({'mycol':arange(5), 'dummy':arange(5)})
df
Out[117]:
dummy mycol
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
[5 rows x 2 columns]
In [118]:
print(df.mycol.get(2, NaN))
print(df.mycol.get(5, NaN))
2
nan
Python has this mentality to ask for forgiveness instead of permission. You'll find a lot of posts on this matter, such as this one.
In Python catching exceptions is relatively inexpensive, so you're encouraged to use it. This is called the EAFP approach.
For example:
try:
x = df.loc['myindex', 'mycol']
except KeyError:
x = mydefault

Categories

Resources