When my df2, Group2 has the third item 'B3', I get what I want regarding the groupby. How can I get the output when the arrays are different lengths?
I also struggle with getting all data to CSV, not just the last iteration. I tried making the df before the loop and then merging it within, but something doesn't work.
import pandas as pd
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'B1', 'B2', 'C13'],
'Whole': ['full', 'full', 'full', 'semi', 'semi', 'semi']})
df2 = pd.DataFrame({'Group1': ['A1', 'A2', 'A3'],
'Group2': ['B1', 'B2']})
for column in df2.columns:
d_group = df1[df1.Title.isin(df2[column])]
df = d_group.groupby('Whole')['Whole'].count()\
.rename('Column Name from df2')\
.reindex(['part', 'full', 'semi'], fill_value='-')\
.reset_index()
df.T.to_csv('all_groups2.csv', header=False, index=True)
print(df.T)
Desired output:
Whole | part | full | semi
--------+---------+----------+----------
Group1 | - | 3 | -
Group2 | - | - | 2
In Pandas Dataframe, it is expected to have columns (or rows) with the same shapes. Therefore, it is not possible to have df2 in your code.
I recommend using the Series instead, something like this:
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'B1', 'B2', 'C13'],
'Whole': ['full', 'full', 'full', 'semi', 'part', 'semi']})
group1 = pd.Series(['A1', 'A2', 'A3'])
group2 = pd.Series(['B1', 'B2'])
Then, you can filter and groupby dataframe df1 by isin function:
dfg1 = df1[df1['Title'].isin(group1)].groupby('Whole').count()
dfg2 = df1[df1['Title'].isin(group2)].groupby('Whole').count()
And finally join them by concat on axis=1:
res = pd.concat([dfg1, dfg2], axis=1)
res.columns = ['Group1','Group2']
finaldf = res.T
The result is the following:
full part semi
Group1 3.0 NaN NaN
Group2 NaN 1.0 1.0
And finally, you can write it to a CSV with the same code that you had:
finaldf.to_csv('result.csv', header=False, index=True)
I recommend not writing row by row to a file, unless it is a very huge file and you cannot store it in memory. In that case, I recommend partitioning or using Dask.
I just realized I could just load my df2 as pd.series, and iterate over the index, not column, to get where I wanted to be.
import pandas as pd
df1 = pd.DataFrame({'Title': ['A1', 'A2', 'A3', 'C1', 'C2', 'C3'],
'ID': ['B1', 'B2', 'B3', 'A1', 'D2', 'D3'],
'Whole': ['full', 'full', 'full', 'semi', 'semi', 'semi']})
df2 = pd.Series({'Group1': ['A1', 'A2', 'A3'],
'Group2': ['B1', 'B2']})
df = pd.DataFrame()
for index in df2.index:
d_group = (df1[df1.ID.isin(df2[index])])
df3 = d_group.groupby('Whole')['Whole'].count()\
.rename(index, inplace=True)\
.reindex(['part', 'full', 'semi'], fill_value='-')
df = df.append(df3, ignore_index=False, sort=False)
print(df)
Related
I'd like to create a function where I can input an undefined number of arrays, turn them into data frames ,concatenate them appending their columns and output a merged dataframe.
Example:
# Suppose we have 3 arrays:
data1 = {
'A': ['A1', 'A2', 'A3', 'A4', 'A5'],
'B': ['B1', 'B2', 'B3', 'B4', 'B5'],
'C': ['C1', 'C2', 'C3', 'C4', 'C5'],
}
data2 = {
'D': ['D1', 'D2', 'D3', 'D4', 'D5'],
'E': ['E1', 'E2', 'E3', 'E4', 'E5'],
'F': ['F1', 'F2', 'F3', 'F4', 'F5'],
}
data3 = {
'G': ['G1', 'G2', 'G3', 'G4', 'G5'],
'H': ['H1', 'H2', 'H3', 'H4', 'H5'],
'I': ['I1', 'I2', 'I3', 'I4', 'I5'],
}
# We could convert them into data frames using:
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df3 = pd.DataFrame(data3)
# And finally join them with:
df4 = pd.concat([df1, df2, df3], axis=1)
The output dataframe would look like this:
I would like to create a function that can do this, but with an unspecified amount of arrays, for example:
func(data1, data2)
func(data1, data2, data3)
func(data1, data2, data...n)
This is a short answer using list comprehension, provided by Ch3steR.
It works and is a very compact answer.
def func(*args): d = [pd.DataFrame(dc) for dc in args]; return pd.concat(d, axis=1)
In the end I went for a longer and slower solution, but that i will easily understand when looking at my code in the future:
def add_df(*args):
""" Function to concatenate columns of unlimited dataframes"""
list = []
for file in args:
df = pd.read_csv(file)
list.append(df)
return pd.concat(list, axis=1)
I'd like to remove values in list from column B based on column A, wondering how.
Given:
df = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []]
})
I want:
result = pd.DataFrame({
'A': ['a1', 'a2', 'a3', 'a4'],
'B': [['a1', 'a2'], ['a1', 'a2', 'a3'], ['a1', 'a3'], []],
'Output': [['a2'], ['a1', 'a3'], ['a1'], []]
})
One way of doing that is applying a filtering function to each row via DataFrame.apply:
df['Output'] = df.apply(lambda x: [i for i in x.B if i != x.A], axis=1)
Another solution using iterrows():
for i,value in df.iterrows():
try:
value['B'].remove(value['A'])
except ValueError:
pass
print(df)
Output:
A B
0 a1 [a2]
1 a2 [a1, a3]
2 a3 [a1]
3 a4 []
I have two Datframes like
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['j0', 'j1', 'j2'])
right = pd.DataFrame({'A': ['A1', 'A0', 'A2'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
i want to column bind by the 'A' column. how to achieve this ?
and i want to do this by pd.concat not pd.merge
When I merge two simple dataframes, then everything works fine. But when I apply the same code to my real dataframes, then the merging does not work correctly:
I want to merge df1 and df2 on column A using left joining.
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A4','A5'],
'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})
result = pd.merge(df1, df2[["A","C"]], how='left', on='A')
In this case the result is correct (the number of rows in result is the same as df1).
However when I run the same code on my real data, the number of rows in result is much larger than df1 and is more similar to df2.
result = pd.merge(df1, df2[["ID","EVENT"]], how='left', on='ID')
The field ID is of type String (astype(str)).
What might be the reason on this? I cannot post here the real dataset, but maybe some indications still might be done based on my explanation. Thanks.
UDPATE:
I checked the dataframe result and I can see many duplicated rows having the same ID. Why?
See this slightly modified example (I modified the last two values in column A in df2):
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
df2 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A0','A0'],
'C': ['C0', 'C1', 'C2', 'C3','C4','C5'],
'D': ['D0', 'D1', 'D2', 'D3','D4','A5']})
result = pd.merge(df1, df2[["A","C"]], how='left', on='A')
Output:
A B C
0 A0 B0 C0
1 A0 B0 C4
2 A0 B0 C5
3 A1 B1 C1
4 A2 B2 C2
5 A3 B3 C3
There is one A0 row for each A0 in df2. This is also what is happening with your data.
I have the following 2 simple dataframes.
df1:
df2:
I want to add df2 to df1 by using something like:
df1["CF 0.3"]=df2
However, this only adds values where indexes in df1 and df2 are the same. I would like a way I can add a column so that missing indexes are automatically added and if there is not associated value of that index, it is filled with NaN. Something like this:
The way I did this is by writing
df1=df1.add(df2)
This adds automatically missing indexes but all values are NaN. Then I manually populated values by writing:
df1["CF 0.1"]=dummyDF1
df1["CF 0.3"]=dummyDF2
Is there an easier way to do this? I have a feeling I am missing something.
I hope you understand my question :)
Use concat refer to this documentation for detailed help.
And here is an example based on the documentation:
import pandas as pd
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'X': ['A4', 'A5', 'A6', 'A7'],
'XB': ['B4', 'B5', 'B6', 'B7'],
'XC': ['C4', 'C5', 'C6', 'C7'],
'XD': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'YA': ['A8', 'A9', 'A10', 'A11'],
'YB': ['B8', 'B9', 'B10', 'B11'],
'YC': ['C8', 'C9', 'C10', 'C11'],
'YD': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
#To get the desired result you are looking for you need to reset the index.
#With the dataframes you have you may not be able to merge as well
#Since merge would need a common index or column
frames = [df1.reset_index(drop=True), df2.reset_index(drop=True), df3.reset_index(drop=True)]
df4 = pd.concat(frames, axis=1)
print df4
please read the docs
use concat or merge or join
http://pandas.pydata.org/pandas-docs/stable/merging.html#brief-primer-on-merge-methods-relational-algebra
Have a look at the concat function which does what you looking for here.