I have 10 dataframes that have the same structure (same number of rows and columns) and I am trying to find an efficient way of performing several actions such as renaming columns with a for loop. I have tried putting them in a list such as
dfs = [df1, df2, df3]
for i in dfs:
i.rename(columns={'A': 'a1'},inplace=True)
but it doesn't work. Another issue occurs if I try to use a function and then loop such as:
def groupdfs(anydf)
anydf = anydf.groupby("A").sum
for i in dfs:
groupdfs(i)
No changes are happening to the dataframes. I have searched similar old questions but nothing have worked. What is the best way to loop through many dataframes when you want to perform the same changes to each of them?
The first piece of code you wrote should work fine.
data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
dff = pd.DataFrame(data, columns=['c', 'a', 'b'])
dff
c a b
0 1 2 3
1 4 5 6
2 7 8 9
dato = np.array([(11, 12, 13), (41, 15, 16), (17, 18, 9)])
dfg = pd.DataFrame(data, columns=['c', 'a', 'b'])
dfg
c a b
0 1 2 3
1 4 5 6
2 7 8 9
dffs = [dff, dfg]
for i in dffs:
i.rename(columns={'a': 'a1'},inplace=True)
c a1 b
0 1 2 3
1 4 5 6
2 7 8 9
The only thing I can think of is that you should add a line in the end to save changes to files.
For the first part
Since everything is the same, you can create a list with new column names and assign it to all of them like this:
column_names = ['a1', 'a2', 'a3']
for df in [df1, df2, df3]:
df.columns = column_names
Or, if you want to use dictionary to change some columns only you can:
for df in [df1, df2, df3]:
df.rename({'A':'a1'}, axis=1, inplace = True)
Note that axis = 1 stands for columns level
For the second part
There are two issues:
The groupby creates a new DataFrame that has to be assigned to a variable if you want to use it again
Since you are in a function, you have to return that new DataFrame to be assigned to a variable outside the function as below:
def groupdfs(anydf)
return anydf.groupby("A").sum()
for i in dfs:
i = groupdfs(i)
This will replace the old DataFrame with the new groupby one. It's better to create new variables for the new groupby dataframes
Related
I have a dataframe such as
multiindex1 = pd.MultiIndex.from_product([['a'], np.arange(3, 8)])
df1 = pd.DataFrame(np.random.randn(5, 3), index=multiindex1)
multiindex2 = pd.MultiIndex.from_product([['s'], np.arange(1, 6)])
df2 = pd.DataFrame(np.random.randn(5, 3), index=multiindex2)
multiindex3 = pd.MultiIndex.from_product([['d'], np.arange(2, 7)])
df3 = pd.DataFrame(np.random.randn(5, 3), index=multiindex3)
df = pd.concat([df1, df2, df3])
df.index.names = ['contract', 'index']
df.columns = ['z', 'x', 'c']
>>>
z x c
contract index
a 3 0.354879 0.206557 0.308081
4 0.822102 -0.425685 1.973288
5 -0.801313 -2.101411 -0.707400
6 -0.740651 -0.564597 -0.975532
7 -0.310679 0.515918 -1.213565
s 1 -0.175135 0.777495 0.100466
2 2.295485 0.381226 -0.242292
3 -0.753414 1.172924 0.679314
4 -0.029526 -0.020714 1.546317
5 0.250066 -1.673020 -0.773842
d 2 -0.602578 -0.761066 -1.117238
3 -0.935758 0.448322 -2.135439
4 0.808704 -0.604837 -0.319351
5 0.321139 0.584896 -0.055951
6 0.041849 -1.660013 -2.157992
Now I want to replace the index of index with the column c. That is to say, I want the result as
z x
contract c
a 0.308081 0.354879 0.206557
1.973288 0.822102 -0.425685
-0.707400 -0.801313 -2.101411
-0.975532 -0.740651 -0.564597
-1.213565 -0.310679 0.515918
s 0.100466 -0.175135 0.777495
-0.242292 2.295485 0.381226
0.679314 -0.753414 1.172924
1.546317 -0.029526 -0.020714
-0.773842 0.250066 -1.673020
d -1.117238 -0.602578 -0.761066
-2.135439 -0.935758 0.448322
-0.319351 0.808704 -0.604837
-0.055951 0.321139 0.584896
-2.157992 0.041849 -1.660013
I implement it in one way
df.reset_index().set_index(['contract', 'c']).drop(['index'], axis=1)
But it seems there are some duplecate steps because I manipulate the indexs for three times. So if there is a more elegent way to achieve that?
Try this
# convert column "c" into an index and remove "index" from index
df.set_index('c', append=True).droplevel('index')
Explanation:
Pandas' set_index method has append argument that controls whether to append columns to existing index or not; setting it True appends column "c" as an index. droplevel method removes index level (can remove column level too but removes index level by default).
I just asked a similar question rename columns according to list which has a correct answer for how to add suffixes to column names correctly. But i have a new issue. I want to rename the actual index name for the columns per dataframe. I have three lists of data frames (some of the data frames contain duplicate column index names (and actual data frame names as well - but thats not the issue, the issue is the duplicated original column.names). I simply want to append a suffix to each dataframe.column.name within each list, with a name in the suffix list, based on its numeric order.
here is an example of the data and the output i would like:
# add string to end of x in list of dfs
df1, df2, df3, df4 = (pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('a', 'b')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('c', 'd')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('e', 'f')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('g', 'h')))
df1.columns.name = 'abc'
df2.columns.name = 'abc'
df3.columns.name = 'efg'
df4.columns.name = 'abc'
cat_a = [df2, df1]
cat_b = [df3, df2, df1]
cat_c = [df1]
dfs = [cat_a, cat_b, cat_c]
suffix = ['group1', 'group2', 'group3']
# expected output =
#for df in cat_a: df.columns.name = df.columns.name + 'group1'
#for df in cat_b: df.columns.name = df.columns.name + 'group2'
#for df in cat_c: df.columns.name = df.columns.name + 'group3'
and here is some code that i have written that doesn't work - where df.column.names are duplicated across data frames, multiple suffixes are appended
for x, df in enumerate(dfs):
for i in df:
n = ([(i.columns.name + '_' + str(suffix[x])) for out in i.columns.name])
i.columns.name=n[x]
thank you for looking, i really appreciate it
Your current code is not working as you have multiple references to the same df in your lists, so only the last change matters. You need to make copies.
Assuming you want to change the columns index name for each df in dfs, you can use a list comprehension:
dfs = [[d.rename_axis(suffix[i], axis=1) for d in group]
for i,group in enumerate(dfs)]
output:
>>> dfs[0][0]
group1 c d
0 5 0
1 9 3
2 3 9
3 4 2
4 1 0
5 7 6
6 5 2
7 8 0
8 1 2
9 7 2
Basically, I have 5 pd.dataframes, named= df0, df1, df2, df3, df4. What I would like to do is use a for loop to add data to these 5 dataframes. Something the likes of:
for i, dataset in enumerate([df0,df1,df2,df3,df4]):
dataset = pd.concat([dataset, NEW_DATA])
However, when you do it like this (or when you use a solely list instead of enumerate), 'dataset' returns the dataset, rather than the name (i.e. df0). How can I solve this. For example, the output for the second iteration should be:
for i, dataset in enumerate([df0,df1,df2,df3,df4]):
df1 = pd.concat([df1, NEW_DATA])
edit: I have also tried dictionaries, such as {'df0':df0... etc}, however, it again prints the dataset rather than the dataset 'variable name'.
You can re-assign the new df into your list:
# setup example
df0 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
df1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
df2 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
# then
lst = [df0, df1, df2]
for i, df in enumerate(lst):
newdata = pd.DataFrame([[0,0], [0,0]]) # (say)
lst[i] = df.append(newdata)
df0, df1, df2 = lst
>>> df0
0 1
0 8 7
1 9 1
2 5 6
0 0 0
1 0 0
But, BTW, it might be better to store your DataFrames collection in a dict instead of a list, if you want to refer to them by name instead of by index.
Edit: Rewriting the solution to provide some proper practice.
So the problem is you have a bunch of values that need to be updated through reassignment. There's a stylistic thing going on where if you have df1, df2, ..., maybe you'd much rather have them in a list.
Using a list in any case is also how I'd address the issue.
dfs = [df0, df1, df2, ...]
dfs = [pd.concat([df, NEW_DATA]) for df in dfs]
[df0, df1, df2, ...] = dfs
See how, if you'd just use dfs in general and refer to dfs[0] instead of df0, this solution could almost come for free?
suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])
Data:
Multiple dataframes of the same format (same columns, an equal number of rows, and no points missing).
How do I create a "summary" dataframe that contains an element-wise mean for every element? How about a dataframe that contains an element-wise standard deviation?
A B C
0 -1.624722 -1.160731 0.016726
1 -1.565694 0.989333 1.040820
2 -0.484945 0.718596 -0.180779
3 0.388798 -0.997036 1.211787
4 -0.249211 1.604280 -1.100980
5 0.062425 0.925813 -1.810696
6 0.793244 -1.860442 -1.196797
A B C
0 1.016386 1.766780 0.648333
1 -1.101329 -1.021171 0.830281
2 -1.133889 -2.793579 0.839298
3 1.134425 0.611480 -1.482724
4 -0.066601 -2.123353 1.136564
5 -0.167580 -0.991550 0.660508
6 0.528789 -0.483008 1.472787
You can create a panel of your DataFrames and then compute the mean and SD along the items axis:
df1 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df2 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
df3 = pd.DataFrame(np.random.randn(10, 3), columns=['A', 'B', 'C'])
p = pd.Panel({n: df for n, df in enumerate([df1, df2, df3])})
>>> p.mean(axis=0)
A B C
0 -0.024284 -0.622337 0.581292
1 0.186271 0.596634 -0.498755
2 0.084591 -0.760567 -0.334429
3 -0.833688 0.403628 0.013497
4 0.402502 -0.017670 -0.369559
5 0.733305 -1.311827 0.463770
6 -0.941334 0.843020 -1.366963
7 0.134700 0.626846 0.994085
8 -0.783517 0.703030 -1.187082
9 -0.954325 0.514671 -0.370741
>>> p.std(axis=0)
A B C
0 0.196526 1.870115 0.503855
1 0.719534 0.264991 1.232129
2 0.315741 0.773699 1.328869
3 1.169213 1.488852 1.149105
4 1.416236 1.157386 0.414532
5 0.554604 1.022169 1.324711
6 0.178940 1.107710 0.885941
7 1.270448 1.023748 1.102772
8 0.957550 0.355523 1.284814
9 0.582288 0.997909 1.566383
One simple solution here is to simply concatenate the existing dataframes into a single dataframe while adding an ID variable to track the original source:
dfa = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='a')
dfb = pd.DataFrame( np.random.randn(2,2), columns=['a','b'] ).assign(id='b')
df = pd.concat([df1,df2])
a b id
0 -0.542652 1.609213 a
1 -0.192136 0.458564 a
0 -0.231949 -0.000573 b
1 0.245715 -0.083786 b
So now you have two 2x2 dataframes combined into a single 4x2 dataframe. The 'id' columns identifies the source dataframe so you haven't lost any generality, and can select on 'id' to do the same thing you would to any single dataframe. E.g. df[ df['id'] == 'a' ].
But now you can also use groupby to do any pandas method such as mean() or std() on an element by element basis:
df.groupby('id').mean()
a b
index
0 0.198164 -0.811475
1 0.639529 0.812810
The following solution worked for me.
average_data_frame = (dataframe1 + dataframe2 ) / 2
Or, if you have more than two dataframes, say n, then
average_data_frame = dataframe1
for i in range(1,n):
average_data_frame = average_data_frame + i_th_dataframe
average_data_frame = average_data_frame / n
Once you have the average, you can go for the standard deviation. If you are looking for a "true Pythonic" approach, you should follow other answers. But if you are looking for a working and quick solution, this is it.