Function to concat undefinded number of dataframes - python

I'd like to create a function where I can input an undefined number of arrays, turn them into data frames ,concatenate them appending their columns and output a merged dataframe.
Example:
# Suppose we have 3 arrays:
data1 = {
'A': ['A1', 'A2', 'A3', 'A4', 'A5'],
'B': ['B1', 'B2', 'B3', 'B4', 'B5'],
'C': ['C1', 'C2', 'C3', 'C4', 'C5'],
}
data2 = {
'D': ['D1', 'D2', 'D3', 'D4', 'D5'],
'E': ['E1', 'E2', 'E3', 'E4', 'E5'],
'F': ['F1', 'F2', 'F3', 'F4', 'F5'],
}
data3 = {
'G': ['G1', 'G2', 'G3', 'G4', 'G5'],
'H': ['H1', 'H2', 'H3', 'H4', 'H5'],
'I': ['I1', 'I2', 'I3', 'I4', 'I5'],
}
# We could convert them into data frames using:
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
# And finally join them with:
df4 = pd.concat([df1, df2, df3], axis=1)
The output dataframe would look like this:
I would like to create a function that can do this, but with an unspecified amount of arrays, for example:
func(data1, data2)
func(data1, data2, data3)
func(data1, data2, data...n)

This is a short answer using list comprehension, provided by Ch3steR.
It works and is a very compact answer.
def func(*args): d = [pd.DataFrame(dc) for dc in args]; return pd.concat(d, axis=1)
In the end I went for a longer and slower solution, but that i will easily understand when looking at my code in the future:
def add_df(*args):
""" Function to concatenate columns of unlimited dataframes"""
list = []
for file in args:
df = pd.read_csv(file)
list.append(df)
return pd.concat(list, axis=1)

Related

Iterate over different length df & export to csv

When my df2, Group2 has the third item 'B3', I get what I want regarding the groupby. How can I get the output when the arrays are different lengths?
I also struggle with getting all data to CSV, not just the last iteration. I tried making the df before the loop and then merging it within, but something doesn't work.
import pandas as pd
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'B1', 'B2', 'C13'],
'Whole': ['full', 'full', 'full', 'semi', 'semi', 'semi']})
df2 = pd.DataFrame({'Group1': ['A1', 'A2', 'A3'],
'Group2': ['B1', 'B2']})
for column in df2.columns:
d_group = df1[df1.Title.isin(df2[column])]
df = d_group.groupby('Whole')['Whole'].count()\
.rename('Column Name from df2')\
.reindex(['part', 'full', 'semi'], fill_value='-')\
.reset_index()
df.T.to_csv('all_groups2.csv', header=False, index=True)
print(df.T)
Desired output:
Whole | part | full | semi
--------+---------+----------+----------
Group1 | - | 3 | -
Group2 | - | - | 2
In Pandas Dataframe, it is expected to have columns (or rows) with the same shapes. Therefore, it is not possible to have df2 in your code.
I recommend using the Series instead, something like this:
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'B1', 'B2', 'C13'],
'Whole': ['full', 'full', 'full', 'semi', 'part', 'semi']})
group1 = pd.Series(['A1', 'A2', 'A3'])
group2 = pd.Series(['B1', 'B2'])
Then, you can filter and groupby dataframe df1 by isin function:
dfg1 = df1[df1['Title'].isin(group1)].groupby('Whole').count()
dfg2 = df1[df1['Title'].isin(group2)].groupby('Whole').count()
And finally join them by concat on axis=1:
res = pd.concat([dfg1, dfg2], axis=1)
res.columns = ['Group1','Group2']
finaldf = res.T
The result is the following:
full part semi
Group1 3.0 NaN NaN
Group2 NaN 1.0 1.0
And finally, you can write it to a CSV with the same code that you had:
finaldf.to_csv('result.csv', header=False, index=True)
I recommend not writing row by row to a file, unless it is a very huge file and you cannot store it in memory. In that case, I recommend partitioning or using Dask.
I just realized I could just load my df2 as pd.series, and iterate over the index, not column, to get where I wanted to be.
import pandas as pd
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'C1', 'C2', 'C3'],
'ID': ['B1', 'B2', 'B3', 'A1', 'D2', 'D3'],
'Whole': ['full', 'full', 'full', 'semi', 'semi', 'semi']})
df2 = pd.Series({'Group1': ['A1', 'A2', 'A3'],
'Group2': ['B1', 'B2']})
df = pd.DataFrame()
for index in df2.index:
d_group = (df1[df1.ID.isin(df2[index])])
df3 = d_group.groupby('Whole')['Whole'].count()\
.rename(index, inplace=True)\
.reindex(['part', 'full', 'semi'], fill_value='-')
df = df.append(df3, ignore_index=False, sort=False)
print(df)

Concat DataFrame under specific condition

For the following dataframes which are stored in a list of lists, I want to concat them if there is something to:
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
fr_list = [[] for x in range(2)]
fr_list[0].append(df1)
fr_list[0].append(df1)
fr_list[1].append(df1)
for x in range(2):
df = pd.concat(fr_list[x] if len(fr_list[x]) > 1) # <-- here is the problem
The syntax you want is probably:
...
df = pd.concat((fr for fr in fr_list[x] if len(fr) > 1))

Will output change if right_index is used instead of left_index with both left_on and right_on defined in pandas dataframe

I have two dataframes
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A0', 'A5', 'A6', 'A7'],
'B': ['B1', 'B5', 'B6', 'B7'],
'C': ['A1', 'C5', 'C6', 'C7'],
'D': ['B1', 'D5', 'D6', 'D7']},index=[4, 5, 6, 7])
Output I
pd.merge(df1, df2, how='outer', left_index=True, left_on='A', right_on='A')
Output II
pd.merge(df1, df2, how='outer', right_index=True, left_on='A', right_on='A')
Why do these above two outputs differ? Basically I want clarification regarding the functionality of right_index and left_index?

column concat by specific column pandas

I have two Datframes like
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['j0', 'j1', 'j2'])
right = pd.DataFrame({'A': ['A1', 'A0', 'A2'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
i want to column bind by the 'A' column. how to achieve this ?
and i want to do this by pd.concat not pd.merge

Pandas dataframe add columns with automatic adding missing indices

I have the following 2 simple dataframes.
df1:
df2:
I want to add df2 to df1 by using something like:
df1["CF 0.3"]=df2
However, this only adds values where indexes in df1 and df2 are the same. I would like a way I can add a column so that missing indexes are automatically added and if there is not associated value of that index, it is filled with NaN. Something like this:
The way I did this is by writing
df1=df1.add(df2)
This adds automatically missing indexes but all values are NaN. Then I manually populated values by writing:
df1["CF 0.1"]=dummyDF1
df1["CF 0.3"]=dummyDF2
Is there an easier way to do this? I have a feeling I am missing something.
I hope you understand my question :)
Use concat refer to this documentation for detailed help.
And here is an example based on the documentation:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'X': ['A4', 'A5', 'A6', 'A7'],
'XB': ['B4', 'B5', 'B6', 'B7'],
'XC': ['C4', 'C5', 'C6', 'C7'],
'XD': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'YA': ['A8', 'A9', 'A10', 'A11'],
'YB': ['B8', 'B9', 'B10', 'B11'],
'YC': ['C8', 'C9', 'C10', 'C11'],
'YD': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
#To get the desired result you are looking for you need to reset the index.
#With the dataframes you have you may not be able to merge as well
#Since merge would need a common index or column
frames = [df1.reset_index(drop=True), df2.reset_index(drop=True), df3.reset_index(drop=True)]
df4 = pd.concat(frames, axis=1)
print df4
please read the docs
use concat or merge or join
http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra
Have a look at the concat function which does what you looking for here.

Categories

Resources