how to iterate over list of dataframes? - python

Basically, I have 5 pd.dataframes, named= df0, df1, df2, df3, df4. What I would like to do is use a for loop to add data to these 5 dataframes. Something the likes of:
for i, dataset in enumerate([df0,df1,df2,df3,df4]):
dataset = pd.concat([dataset, NEW_DATA])
However, when you do it like this (or when you use a solely list instead of enumerate), 'dataset' returns the dataset, rather than the name (i.e. df0). How can I solve this. For example, the output for the second iteration should be:
for i, dataset in enumerate([df0,df1,df2,df3,df4]):
df1 = pd.concat([df1, NEW_DATA])
edit: I have also tried dictionaries, such as {'df0':df0... etc}, however, it again prints the dataset rather than the dataset 'variable name'.

You can re-assign the new df into your list:
# setup example
df0 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
df1 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
df2 = pd.DataFrame(np.random.randint(0, 10, (3, 2)))
# then
lst = [df0, df1, df2]
for i, df in enumerate(lst):
newdata = pd.DataFrame([[0,0], [0,0]]) # (say)
lst[i] = df.append(newdata)
df0, df1, df2 = lst
>>> df0
0 1
0 8 7
1 9 1
2 5 6
0 0 0
1 0 0
But, BTW, it might be better to store your DataFrames collection in a dict instead of a list, if you want to refer to them by name instead of by index.

Edit: Rewriting the solution to provide some proper practice.
So the problem is you have a bunch of values that need to be updated through reassignment. There's a stylistic thing going on where if you have df1, df2, ..., maybe you'd much rather have them in a list.
Using a list in any case is also how I'd address the issue.
dfs = [df0, df1, df2, ...]
dfs = [pd.concat([df, NEW_DATA]) for df in dfs]
[df0, df1, df2, ...] = dfs
See how, if you'd just use dfs in general and refer to dfs[0] instead of df0, this solution could almost come for free?

Related

How to replace one of the levels of a MultiIndex dataframe with one of its columns

I have a dataframe such as
multiindex1 = pd.MultiIndex.from_product([['a'], np.arange(3, 8)])
df1 = pd.DataFrame(np.random.randn(5, 3), index=multiindex1)
multiindex2 = pd.MultiIndex.from_product([['s'], np.arange(1, 6)])
df2 = pd.DataFrame(np.random.randn(5, 3), index=multiindex2)
multiindex3 = pd.MultiIndex.from_product([['d'], np.arange(2, 7)])
df3 = pd.DataFrame(np.random.randn(5, 3), index=multiindex3)
df = pd.concat([df1, df2, df3])
df.index.names = ['contract', 'index']
df.columns = ['z', 'x', 'c']
>>>
z x c
contract index
a 3 0.354879 0.206557 0.308081
4 0.822102 -0.425685 1.973288
5 -0.801313 -2.101411 -0.707400
6 -0.740651 -0.564597 -0.975532
7 -0.310679 0.515918 -1.213565
s 1 -0.175135 0.777495 0.100466
2 2.295485 0.381226 -0.242292
3 -0.753414 1.172924 0.679314
4 -0.029526 -0.020714 1.546317
5 0.250066 -1.673020 -0.773842
d 2 -0.602578 -0.761066 -1.117238
3 -0.935758 0.448322 -2.135439
4 0.808704 -0.604837 -0.319351
5 0.321139 0.584896 -0.055951
6 0.041849 -1.660013 -2.157992
Now I want to replace the index of index with the column c. That is to say, I want the result as
z x
contract c
a 0.308081 0.354879 0.206557
1.973288 0.822102 -0.425685
-0.707400 -0.801313 -2.101411
-0.975532 -0.740651 -0.564597
-1.213565 -0.310679 0.515918
s 0.100466 -0.175135 0.777495
-0.242292 2.295485 0.381226
0.679314 -0.753414 1.172924
1.546317 -0.029526 -0.020714
-0.773842 0.250066 -1.673020
d -1.117238 -0.602578 -0.761066
-2.135439 -0.935758 0.448322
-0.319351 0.808704 -0.604837
-0.055951 0.321139 0.584896
-2.157992 0.041849 -1.660013
I implement it in one way
df.reset_index().set_index(['contract', 'c']).drop(['index'], axis=1)
But it seems there are some duplecate steps because I manipulate the indexs for three times. So if there is a more elegent way to achieve that?
Try this
# convert column "c" into an index and remove "index" from index
df.set_index('c', append=True).droplevel('index')
Explanation:
Pandas' set_index method has append argument that controls whether to append columns to existing index or not; setting it True appends column "c" as an index. droplevel method removes index level (can remove column level too but removes index level by default).

Loop through different dataframes and perform actions using a function

I have 10 dataframes that have the same structure (same number of rows and columns) and I am trying to find an efficient way of performing several actions such as renaming columns with a for loop. I have tried putting them in a list such as
dfs = [df1, df2, df3]
for i in dfs:
i.rename(columns={'A': 'a1'},inplace=True)
but it doesn't work. Another issue occurs if I try to use a function and then loop such as:
def groupdfs(anydf)
anydf = anydf.groupby("A").sum
for i in dfs:
groupdfs(i)
No changes are happening to the dataframes. I have searched similar old questions but nothing have worked. What is the best way to loop through many dataframes when you want to perform the same changes to each of them?
The first piece of code you wrote should work fine.
data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)])
dff = pd.DataFrame(data, columns=['c', 'a', 'b'])
dff
c a b
0 1 2 3
1 4 5 6
2 7 8 9
dato = np.array([(11, 12, 13), (41, 15, 16), (17, 18, 9)])
dfg = pd.DataFrame(data, columns=['c', 'a', 'b'])
dfg
c a b
0 1 2 3
1 4 5 6
2 7 8 9
dffs = [dff, dfg]
for i in dffs:
i.rename(columns={'a': 'a1'},inplace=True)
c a1 b
0 1 2 3
1 4 5 6
2 7 8 9
The only thing I can think of is that you should add a line in the end to save changes to files.
For the first part
Since everything is the same, you can create a list with new column names and assign it to all of them like this:
column_names = ['a1', 'a2', 'a3']
for df in [df1, df2, df3]:
df.columns = column_names
Or, if you want to use dictionary to change some columns only you can:
for df in [df1, df2, df3]:
df.rename({'A':'a1'}, axis=1, inplace = True)
Note that axis = 1 stands for columns level
For the second part
There are two issues:
The groupby creates a new DataFrame that has to be assigned to a variable if you want to use it again
Since you are in a function, you have to return that new DataFrame to be assigned to a variable outside the function as below:
def groupdfs(anydf)
return anydf.groupby("A").sum()
for i in dfs:
i = groupdfs(i)
This will replace the old DataFrame with the new groupby one. It's better to create new variables for the new groupby dataframes

rename column name index according to list placement with multiple duplicate names

I just asked a similar question rename columns according to list which has a correct answer for how to add suffixes to column names correctly. But i have a new issue. I want to rename the actual index name for the columns per dataframe. I have three lists of data frames (some of the data frames contain duplicate column index names (and actual data frame names as well - but thats not the issue, the issue is the duplicated original column.names). I simply want to append a suffix to each dataframe.column.name within each list, with a name in the suffix list, based on its numeric order.
here is an example of the data and the output i would like:
# add string to end of x in list of dfs
df1, df2, df3, df4 = (pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('a', 'b')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('c', 'd')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('e', 'f')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('g', 'h')))
df1.columns.name = 'abc'
df2.columns.name = 'abc'
df3.columns.name = 'efg'
df4.columns.name = 'abc'
cat_a = [df2, df1]
cat_b = [df3, df2, df1]
cat_c = [df1]
dfs = [cat_a, cat_b, cat_c]
suffix = ['group1', 'group2', 'group3']
# expected output =
#for df in cat_a: df.columns.name = df.columns.name + 'group1'
#for df in cat_b: df.columns.name = df.columns.name + 'group2'
#for df in cat_c: df.columns.name = df.columns.name + 'group3'
and here is some code that i have written that doesn't work - where df.column.names are duplicated across data frames, multiple suffixes are appended
for x, df in enumerate(dfs):
for i in df:
n = ([(i.columns.name + '_' + str(suffix[x])) for out in i.columns.name])
i.columns.name=n[x]
thank you for looking, i really appreciate it
Your current code is not working as you have multiple references to the same df in your lists, so only the last change matters. You need to make copies.
Assuming you want to change the columns index name for each df in dfs, you can use a list comprehension:
dfs = [[d.rename_axis(suffix[i], axis=1) for d in group]
for i,group in enumerate(dfs)]
output:
>>> dfs[0][0]
group1 c d
0 5 0
1 9 3
2 3 9
3 4 2
4 1 0
5 7 6
6 5 2
7 8 0
8 1 2
9 7 2

Can you merge elements of Pandas dataframes into tuples?

If you have two Pandas dataframes in Python with identical axes, is there a function to merge the elements as tuples so that they maintain their positions? If there is a better way to combine these dataframes without duplicating the number of indices or columns, that works as well.
Expected logic:
You can do this in pure pandas:
(pd.concat([df1,df2])
.stack()
.groupby(level=[0,1])
.apply(tuple)
.unstack()
)
Output:
A B
0 (1, 7) (4, 10)
1 (2, 8) (5, 11)
2 (3, 9) (6, 12)
Input:
import pandas as pd
df1 = pd.DataFrame({"A":[1,2,3],"B":[4,5,6]})
df2 = pd.DataFrame({"A":[7,8,9],"B":[10,11,12]})
The operation you're looking for seems like "zip". That is, match elements of two sequences together into a sequence of tuples. If you look at each column in your dataframes and zip them together you will have a result that is a list of lists of tuples - what you want to be in your result dataframe. You can then construct a dataframe with the same columns and index out of that data. In code, that looks like this:
data = [list(zip(df1[col], df2[col])) for col in df1]
pd.DataFrame(data, index=[1,2,3], columns=["A", "B", "C"])
You can maybe use something like this to achieve what you want.
df3 = pd.DataFrame({x: zip(df1[x], df2[x]) for x in df1.columns})
df1 = pd.DataFrame({"A" : [1,2,3], "B":[4,5,6]})
df2 = pd.DataFrame({"A" : [7,8,9], "B":[10,11,12]})
def add_dfs(df1, df2):
for col in df1.columns:
df1[col] = df1[col].apply(lambda x: (x,))
for col in df2.columns:
df2[col] = df2[col].apply(lambda x: (x,))
df = df1 + df2 # using + operator , satisfies answer technically
return df
df = add_dfs(df1, df2)

Python panda search for value in a df from another df

I’ve got two data frames :-
Df1
Time V1 V2
02:00 D3F3 0041
02:01 DD34 0040
Df2
FileName V1 V2
1111.txt D3F3 0041
2222.txt 0000 0040
Basically I want to compare the v1 v2 columns and if they match print the row time from df1 and the row from df2 filename. So far all i can find is the
isin()
, which simply gives you a boolean output.
So the output would be :
1111.txt 02:00
I started using dataframes because i though i could query the two df's on the V1 / V2 values but I can't see a way. Any pointers would be much appreciated
Use merge on the dataframe columns that you want to have the same values. You can then drop the rows with NaN values, as those will not have matching values. From there, you can print the merged dataframes values however you see fit.
df1 = pd.DataFrame({'Time': ['8a', '10p'], 'V1': [1, 2], 'V2': [3, 4]})
df2 = pd.DataFrame({'fn': ['8.txt', '10.txt'], 'V1': [3, 2], 'V2': [3, 4]})
df1.merge(df2, on=['V1', 'V2'], how='outer').dropna()
=== Output: ===
Time V1 V2 fn
1 10p 2 4 10.txt
The most intuitive solution is:
1) iterate the V1 column in DF1;
2) for each item in this column, check if this item exists in the V1 column of DF2;
3) if the item exists in DF2's V1, then find the index of that item in the DF2 and then you would be able to find the file name.
You can try using pd.concat.
On this case it would be like:
pd.concat([df1, df2.reindex(df1.index)], axis=1)
It will create a new dataframe with all the values, but in case there are some values that doesn't match in both dataframes, it'll return NaN. If you doesn't want this to happen you must use this:
pd.concat([df1, df4], axis=1, join='inner')
If you wanna learn a bit more, use pydata: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
You can use merge option with inner join
df2.merge(df1,how="inner",on=["V1","V2"])[["FileName","Time"]]
While I think Eric's solution is more pythonic, if your only aim is to print the rows on which df1 and df2 have v1 and v2 values the same, provided the two dataframes are of the same length, you can do the following:
for row in range(len(df1)):
if (df1.iloc[row,1:] == df2.iloc[row,1:]).all() == True:
print(df1.iloc[row], df2.iloc[row])
Try this:
client = boto3.client('s3')
obj = client.get_object(Bucket='', Key='')
data = obj['Body'].read()
df1 = pd.read_excel(io.BytesIO(data), sheet_name='0')
df2 = pd.read_excel(io.BytesIO(data), sheet_name='1')
head = df2.columns[0]
print(head)
data = df1.iloc[[8],[0]].values[0]
print(data)
print(df2)
df2.columns = df2.iloc[0]
df2 = df2.drop(labels=0, axis=0)
df2['Head'] = head
df2['ID'] = pd.Series([data,data])
print(df2)
df2.to_csv('test.csv',index=False)

Categories

Resources