I'm trying to perform a number of operations on a list of dataframes. I've opted to use a dictionary to help me with this process, but I was wonder if it's possible to reference the originally created dataframe with the changes.
So using the below code as an example, is it possible to call the dfA object with the columns ['a', 'b', 'c'] that were added when it was nested within the dictionary object?
dfA = pd.DataFrame(data=[1], columns=['x'])
dfB = pd.DataFrame(data=[1], columns=['y'])
dfC = pd.DataFrame(data=[1], columns=['z'])
dfdict = {'A':dfA,
'B':dfB,
'C':dfC}
df_dummy = pd.DataFrame(data=[[1,2,3]], columns=['a', 'b', 'c'])
for key in dfdict:
dfdict[str(key)] = pd.concat([dfdict[str(key)], df_dummy], axis=1)
The initial dfA that you created and the dfA DataFrame from your dictionary are two different objects. (You can confirm this by running dfA is dfdict['A'] or id(dfA) == id(dfdict['A']), both of which should return False).
To access the second (newly created) object you need to call it from the dictionary.
dfdict['A']
Or:
dfdict.get('A')
The returned DataFrame will have the new columns you added.
Related
I have a dataframe with several columns that I am filtering in hopes of using the output to filter another dataframe further. Ultimately I'd like to convert the group by object to a list to further filter another dataframe but I am having a hard time converting the SeriesGroupBy object to a list of values. I am using:
id_list = df[df['date_diff'] == pd.Timedelta('0 days')].groupby('id')['id'].tolist()
I've tried to reset_index() and to_frame() and .values before to_list() with no luck.
error is:
'SeriesGroupBy' object has no attribute tolist
Expected output: Simply a list of id's
Try -
pd.Timedelta('0 days')].groupby('id')['id'].apply(list)
Also, I am a bit skeptical about how you are comparing df['date_diff'] with the groupby output.
EDIT: This might be useful for your intended purpose (s might be output of your groupby):
s = pd.Series([['a','a','b'],['b','b','c','d'],['a','b','e']])
s.explode().unique().tolist()
['a', 'b', 'c', 'd', 'e']
I have a for loop, in which I want to call a different pd.Dataframes in each loop and add a certain column ('feedin') to another dataframe. The variable name consists of 'feedin_' + x. Lets say a,b and c. So in the first loop I want to call the variable feedin_a and add the column 'feedin' to the new dataframe. In the next feedin_b and so on.
I pass a list of ['a', 'b', 'c'] and try to combine feedin_+ a. But since the list consists of string parameters it wont call the variable
feedin_a = pd.Dataframe
feedin_b = pd.Dataframe
feedin_c = pd.Dataframe
list = ['a', 'b', 'c']
for name in list:
df.new['feedin_'+name] = 'feedin_'+name['feedin']
Which doesnt work because the variable is called as a string. I hope you get my problem.
This is one of the reasons it's not a great idea to use variables feedin_a, feedin_b, when you really need a data structure like a list.
If you can't change this, you can looking up the names in the locals() (or globals() dictionary:
df.new['feedin_'+name] = locals()['feedin_'+name]['feedin']
You can use locals() so it should be something like that:
feedin_a = pd.Dataframe
feedin_b = pd.Dataframe
feedin_c = pd.Dataframe
name_list = ['a', 'b', 'c']
for name in list:
key = 'feedin_{}'.format(name)
df.new[key] = locals()[key]
also check this Python, using two variables in getattr?
p.s. do not use list as variable name because it is built-in python object
Others have answered your exact question by using locals(). In the comments someone pointed out another way to achieve the thing you're after: using a dictionary to hold your "variable names" as string that can be looked up with string manipulation.
import pandas as pd
dataframes = dict()
letter_list = ['a', 'b', 'c']
for letter in letter_list:
dataframes['feedin_'+letter] = pd.DataFrame()
for name in letter_list:
dataframes['feedin_'+name].insert(0, 'feedin_'+name, None)
print(dataframes)
{'feedin_a': Empty DataFrame
Columns: [feedin_a]
Index: [], 'feedin_b': Empty DataFrame
Columns: [feedin_b]
Index: [], 'feedin_c': Empty DataFrame
Columns: [feedin_c]
Index: []}
If you intend to do a lot of calling your DataFrames by string manipulation, this is a good way to go.
From the reindex docs:
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
Therefore, I thought that I would get a reordered Dataframe by setting copy=False in place (!). It appears, however, that I do get a copy and need to assign it to the original object again. I don't want to assign it back, if I can avoid it (the reason comes from this other question).
This is what I am doing:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5, 5))
df.columns = [ 'a', 'b', 'c', 'd', 'e' ]
df.head()
Outs:
a b c d e
0 0.234296 0.011235 0.664617 0.983243 0.177639
1 0.378308 0.659315 0.949093 0.872945 0.383024
2 0.976728 0.419274 0.993282 0.668539 0.970228
3 0.322936 0.555642 0.862659 0.134570 0.675897
4 0.167638 0.578831 0.141339 0.232592 0.976057
Reindex gives me the correct output, but I'd need to assign it back to the original object, which is what I wanted to avoid by using copy=False:
df.reindex( columns=['e', 'd', 'c', 'b', 'a'], copy=False )
The desired output after that line is:
e d c b a
0 0.177639 0.983243 0.664617 0.011235 0.234296
1 0.383024 0.872945 0.949093 0.659315 0.378308
2 0.970228 0.668539 0.993282 0.419274 0.976728
3 0.675897 0.134570 0.862659 0.555642 0.322936
4 0.976057 0.232592 0.141339 0.578831 0.167638
Why is copy=False not working in place?
Is it possible to do that at all?
Working with python 3.5.3, pandas 0.23.3
reindex is a structural change, not a cosmetic or transformative one. As such, a copy is always returned because the operation cannot be done in-place (it would require allocating new memory for underlying arrays, etc). This means you have to assign the result back, there's no other choice.
df = df.reindex(['e', 'd', 'c', 'b', 'a'], axis=1)
Also see the discussion on GH21598.
The one corner case where copy=False is actually of any use is when the indices used to reindex df are identical to the ones it already has. You can check by comparing the ids:
id(df)
# 4839372504
id(df.reindex(df.index, copy=False)) # same object returned
# 4839372504
id(df.reindex(df.index, copy=True)) # new object created - ids are different
# 4839371608
A bit off topic, but I believe this would rearrange the columns in place
for i, colname in enumerate(list_of_columns_in_desired_order):
col = dataset.pop(colname)
dataset.insert(i, colname, col)
data set:
df = pd.DataFrame(np.random.randn(5, 4), columns=['A', 'B', 'C', 'D'],index=['abcd','efgh','abcd','abc123','efgh']).reset_index()
s = pd.Series(data=[True,True,False],index=['abcd','efgh','abc123'], name='availability').reset_index()
(feel free to remove the reset_index bits above, they are simply there to easily provide a different approach to the problem. however, the resulting datasets from the queries i'm running resemble the above most accurately)
I have two separate queries that return data similar to the above. One query queries one field from a DB that has one column of information that does not exist in the other. The 'index' column is the common key across both tables.
My result set needs to have the 2nd query's result series injected into the first query's resulting dataframe at a specific column index.
I know that I can simply run:
df = df.merge(s, how='left', on='index')
Then to enforce column order:
df = df[['index', 'A', 'B', 'availability', 'C', 'D']
I saw that you can do df.inject, but that requires that the series be the same length as the df.
I'm wondering if there is a way to do this without having to run merge and then enforce column order. With my actual dataset, the number of columns is significantly longer. I'd imagine the best solution likely relies on list manipulation, but I'd much rather do something clever with how the dataframe is created in the first place.
df.set_index(['index','id']).index.map(s['availability'])
is returning:
TypeError: 'Series' object is not callable
S is a dataframe with a multi-index and one column which is a boolean.
df is a dataframe with columns in it which make up S's multi-index
IIUC:
In [260]: df.insert(3, 'availability',
df['index'].map(s.set_index('index')['availability']))
In [261]: df
Out[261]:
index A B availability C D
0 abcd 1.867270 0.517894 True 0.584115 -0.162361
1 efgh -0.036696 1.155110 True -1.112075 2.005678
2 abcd 0.693795 -0.843335 True -1.003202 1.001791
3 abc123 -1.466148 -0.848055 False -0.373293 0.360091
4 efgh -0.436618 -0.625454 True -0.285795 -0.220717
I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')