How can I concat multiple dataframes in Python? [duplicate] - python

This question already has answers here:
Append multiple pandas data frames at once
(5 answers)
How do I create variable variables?
(17 answers)
Closed 4 years ago.
I have multiple (more than 100) dataframes. How can I concat all of them?
The problem is, that I have too many dataframes, that I can not write them manually in a list, like this:
>>> cluster_1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter ', 'number'])
>>> cluster_1
letter number
0 a 1
1 b 2
>>> cluster_2 = pd.DataFrame([['c', 3], ['d', 4]],
... columns=['letter', 'number'])
>>> cluster_2
letter number
0 c 3
1 d 4
>>> pd.concat([cluster_1, cluster_2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4
The names of my N dataframes are cluster_1, cluster_2, cluster_3,..., cluster_N. The number N can be very high.
How can I concat N dataframes?

I think you can just put it into a list, and then concat the list. In Pandas, the chunk function kind of already does this. I personally do this when using the chunk function in pandas.
pdList = [df1, df2, ...] # List of your dataframes
new_df = pd.concat(pdList)
To create the pdList automatically assuming your dfs always start with "cluster".
pdList = []
pdList.extend(value for name, value in locals().items() if name.startswith('cluster_'))

Generally it goes like:
frames = [df1, df2, df3]
result = pd.concat(frames)
Note: It will reset the index automatically.
Read more details on different types of merging here.
For a large number of data frames:
If you have hundreds of data frames, depending one if you have in on disk or in memory you can still create a list ("frames" in the code snippet) using a for a loop. If you have it in the disk, it can be easily done just saving all the df's in one single folder then reading all the files from that folder.
If you are generating the df's in memory, maybe try saving it in .pkl first.

Use:
pd.concat(your list of column names)
And if want regular index:
pd.concat(your list of column names,ignore_index=True)

Related

Modifying one dataframe appears to change another [duplicate]

This question already has answers here:
Why can pandas DataFrames change each other?
(3 answers)
How do I clone a list so that it doesn't change unexpectedly after assignment?
(24 answers)
Closed 1 year ago.
I am new to loop in Python and just came across a weird question. I was doing some calculations on multiple dataframes, and to simplify the question, here is an illustration.
Suppose I have 3 dataframes filled with NaN:
# generate NaN entries
data = np.empty((15, 10))
# create dataframe
data[:] = np.nan
dfnan = pd.DataFrame(data)
df1 = dfnan
df2 = dfnan
df3 = dfnan
After this step, all the three dataframes give me NaN as expected.
But then, if I add two for loops in one block like below:
for i in range(0, 15, 1):
df1.iloc[i] = 0
for j in range(0, 15, 1):
df2.iloc[j] = df1.iloc[j].transform(lambda x: x+1)
Then all of df1, df2, and df3 give me 1 entries. But shouldn't it be that:
df1 filled with 0, df2 filled with 1 and df3 filled with NaN (since I didn't make any change to it)?
Why is that and how I can change it to get the wanted result?
Assignment never copies in python. df1, df2, df3 and dfnan are all references to the same object (pd.DataFrame(data)). This means that changes in one are reflected in the remaining ones, as they all point to the same object.
This is a great reading https://nedbatchelder.com/text/names.html.
To create independent copies use the copy method
dfnan = pd.DataFrame(data)
df1 = dfnan.copy()
df2 = dfnan.copy()
df3 = dfnan.copy()

Reindex dataframe inside loop [duplicate]

This question already has answers here:
How to change variables fed into a for loop in list form
(4 answers)
Closed 5 months ago.
I'm trying to reindex the columns in a set of dataframes inside a loop. This only seems to work outside the loop. See sample code below
import pandas as pd
data1 = [[1,2,3],[4,5,6],[7,8,9]]
data2 = [[10,11,12],[13,14,15],[16,17,18]]
data3 = [[19,20,21],[22,23,24],[25,26,27]]
index = ['a','b','c']
columns = ['d','e','f']
df1 = pd.DataFrame(data=data1,index=index,columns=columns)
df2 = pd.DataFrame(data=data2,index=index,columns=columns)
df3 = pd.DataFrame(data=data3,index=index,columns=columns)
columns2 = ['f','e','d']
for i in [df1,df2,df3]:
i = i.reindex(columns=columns2)
print(df1)
df2 = df2.reindex(columns=columns2)
print(df2)
df1 is not reindexed as desired, however if I reindex df2 outside of the loop it works. Why is that?
Thanks
Andrew
That happens for the same reason this happens:
a = 5
b = 6
for i in [a, b]:
i = 4
>>> a
5
Why? See this accepted answer.
Concerning your problem, one way to go about it is create a list of reindexed dataframes like so:
reindexed_dfs = [df.reindex(columns=columns2) for df in [df1, df2, df3]]
and then reassign df1, df2 and df3. But it's better to just keep using your newly created list anyways.

How to sort a dataFrame in python pandas by two or more columns based on a list of values? [duplicate]

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?
You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3
Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)

how to efficiently decode arrays to columns in pandas dataframe

I have a function that produces results for every month of a year. In my dataframe I collect these results for different data columns. After that, I have a dataframe containing multiple columns with arrays as values. Now I want to "pivot" those columns to have each value in its own column.
For example, if a row contains values [1,2,3,4,5,6,7,8,9,10,11,12] in column 'A', I want to have twelve columns 'A_01', 'A_02', ..., 'A_12' that each contain one value from the array.
My current code is this:
# create new columns
columns_to_add = []
column_count = len(columns_to_process)
for _, row in df[columns_to_process].iterrows():
columns_to_add += [[row[name][offset] if type(row[name]) == list else row[name]
for offset in range(array_len) for name in range(column_count)]]
new_df = pd.DataFrame(columns_to_add,
columns=[name+'_'+str(offset+1) for offset in range(array_len)
for name in columns_to_process],
index=df.index) # make dataframe addendum
(note: some rows don't have any values, so I had to put the condition if type() == list into the iteration)
But this code is awfully slow. I believe there must be a much more elegant solution. Can you show me such a solution?
IIUC, use Series.tolist with the pandas.DataFrame constructor.
We'll use DataFrame.rename as well to fix your column name format.
# Setup
df = pd.DataFrame({'A': [ [1,2,3,4,5,6,7,8,9,10,11,12] ]})
pd.DataFrame(df['A'].tolist()).rename(columns=lambda x: f'A_{x+1:0>2d}')
[out]
A_01 A_02 A_03 A_04 A_05 A_06 A_07 A_08 A_09 A_10 A_11 A_12
0 1 2 3 4 5 6 7 8 9 10 11 12

Retrieving data from a dataframe by indexing using a list of tuples [duplicate]

This question already has an answer here:
Access entries in pandas data frame using a list of indices
(1 answer)
Closed 5 years ago.
I would like to know if there is a direct manner to pass a list of tuples (row_index, column_index) to a dataframe function in order to retrieve the data at these row and columns index? I indeed thought about using a list comprehension, but I want to know if there is no integrated thing for pandas?
Passing the ordered list of row_indexes and the ordered list of column_indexes to loc merely retrieve the ensemble of intersections; so zipping is useless and the result irrelevant.
For instance, with
df = pd.DataFrame ([[[0,1,2,3],[0,9,8,7]],[[0,1,8,3],[0,4,8,7]]],\
index=["r0","r1"], columns =["c0","c1"])
if I have the list l= [("r0","c0"),("r0","c1"),("r1","c1")]
i can indeed use the list comprehension
[df[r,c] for r,c in l]
but I think I once so a possibility of passing such a list to df or arrays in a way that will retrieve the same result.
Was I mistaken?
Thanks by advance.
Perhaps you are looking for lookup:
import pandas as pd
df = pd.DataFrame ([[0,1,2,3],[0,9,8,7]], index=["r0","r1"], columns =["c0","c1","c2","c3"])
l= [("r0","c0"),("r0","c1"),("r1","c1")]
print(df.lookup(*zip(*l)))
yields
[0 1 9]
Using melt
DF=df.reset_index().melt('index')
DF['Match']=list(zip(DF['index'], DF['variable']))
DF.value[DF.Match.isin(l)]
Out[249]:
0 0
2 1
5 8
Name: value, dtype: int64
Data Input
df = pd.DataFrame([[0, 1, 2, 3], [0, 9, 8, 7]], index = ["r0", "r1"])
l= [("r0",0),("r0",1),("r1",2)]

Categories

Resources