I have a dataframe with both numerical and categorical data (df1). I am creating a database that resembles the first dataframe df2, meaning it has the same column names and dtypes as df1. However, besides the names and dtypes of df1 I would also like to keep the categories of the categorical variables even if they don't appear on df2 when I create it.
So far the easiest solution I have found is to loop over all categorical variables on df2, adding the categories of each categorical variable of df1. However I believe there must be a faster/more efficient solution than the one I am proposing.
df1 = pd.DataFrame({
'A' : pd.Categorical(list('bbeebbaa'), categories=['e','a','b'], ordered=True),
'B' : [1,2,1,2,2,1,2,1],
'C' : pd.Categorical(list('ddeeccaa'), categories=['e','a','d', 'c'], ordered=True)})
df2 = pd.DataFrame({
'A' : pd.Categorical(list('bbeebbbb'), categories=['e', 'b'], ordered=True),
'B' : [1,2,1,2,2,1,2,1],
'C' : pd.Categorical(list('cccccccc'), categories=['c'], ordered=True)})
categorical = ['A', 'B']
for var in categorical:
df2[var] = df2[var].cat.add_categories(df1[var].cat.categories)
If all categories of df2 are in df1, you can use set_categories() function.
l = list(df1['A'].cat.categories)
df2['A'] = df2['A'].cat.set_categories(l)
Or in one line:
df2['A'] = df2['A'].cat.set_categories(list(df1['A'].cat.categories))
If both df1 and df2 contain categories unique for them, I am not sure how would I handle it - probably similarly to your way presented here.
Related
below are my two dfs
df = pd.DataFrame(np.random.randint(1,10,(5,3)),columns=['a','b','c'])
dd = pd.DataFrame(np.repeat(3,2),columns=['a'])
I want to replace the column 'a' of df with values in column 'a' of dd. Any empty rows are replaced by zero "only" for column 'a'. All other columns of df remain unchanged.
so column 'a' should contain 3,3,0,0,0
This is probably not the cleanest way, but it works.
df['a'] = dd['a']
df['a'] = df['a'].fillna(0)
I am trying to get rows where values in a column are different from two data frames.
For example, let say we have these two dataframe below:
import pandas as pd
data1 = {'date' : [20210701, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0]}
data2 = {'date' : [20210701, 20210702, 20210704, 20210703, 20210705, 20210705],
'name': ['Dave', 'Dave', 'Dave', 'Sue', 'Sue', 'Ann'],
'a' : [1,0,1,1,0,0]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
As you can see Dave has different values in column 'a' on 20210704, and Sue has different values in column 'a' on 020210705. Therefore, my desire output should be something like:
import pandas as pd
output = {'date' : [20210704, 20210705],
'name': ['Dave', 'Sue'],
'a_from_old' : [0,1]}
df_output = pd.DataFrame(output)
If I am not mistaken, what I am asking for is pretty much the same thing as minus statement in SQL unless I am missing some edge cases.
How can I find those rows with the same date and name but different values in a column?
Update
I found an edge case that some data is not even in another data frame, I want to find the ones that are in both data frame but the value in column 'a' is different.
I edited the sample data set to take the edge case into account.
(Notice that Dave on 20210702 is not appear on the final output because the data is not in the first data frame).
Another option but with an inner merge and keep only rows where the a from df1 does not equal the a from df2:
df3 = (
df1.merge(df2, on=['date', 'name'], suffixes=('_from_old', '_df2'))
.query('a_from_old != a_df2') # df1 `a` != df2 `a`
.drop('a_df2', axis=1) # Remove column with df2 `a` values
)
df3:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
try left merge() with indicator=True then filterout results with query() then drop extra column by drop() and rename 'a' to 'a_from_old' by using rename():
out=(df1.merge(df2,on=['date','name','a'],how='left',indicator=True)
.query("_merge=='left_only'").drop('_merge',1)
.rename(columns={'a':'a_from_old'}))
output of out:
date name a_from_old
2 20210704 Dave 0
4 20210705 Sue 1
Note: If there are many more columns that you want to rename then pass:
suffixes=('_from_old', '') in the merge() method as a parameter
My question is similar to this one, however I do need renaming columns because I aggregate my data using functions:
def series(x):
return ','.join(str(item) for item in x)
agg = {
'revenue': ['sum', series],
'roi': ['sum', series],
}
df.groupby('name').agg(agg)
As a result I have groups of identically named columns:
which become completely indistinguishable after I drop the higher column level:
df.columns = df.columns.droplevel(0)
So, how do I go about keeping unique names for my columns?
Use map for flatten columns names:
df.columns = df.columns.map('_'.join)
data set:
df = pd.DataFrame(np.random.randn(5, 4), columns=['A', 'B', 'C', 'D'],index=['abcd','efgh','abcd','abc123','efgh']).reset_index()
s = pd.Series(data=[True,True,False],index=['abcd','efgh','abc123'], name='availability').reset_index()
(feel free to remove the reset_index bits above, they are simply there to easily provide a different approach to the problem. however, the resulting datasets from the queries i'm running resemble the above most accurately)
I have two separate queries that return data similar to the above. One query queries one field from a DB that has one column of information that does not exist in the other. The 'index' column is the common key across both tables.
My result set needs to have the 2nd query's result series injected into the first query's resulting dataframe at a specific column index.
I know that I can simply run:
df = df.merge(s, how='left', on='index')
Then to enforce column order:
df = df[['index', 'A', 'B', 'availability', 'C', 'D']
I saw that you can do df.inject, but that requires that the series be the same length as the df.
I'm wondering if there is a way to do this without having to run merge and then enforce column order. With my actual dataset, the number of columns is significantly longer. I'd imagine the best solution likely relies on list manipulation, but I'd much rather do something clever with how the dataframe is created in the first place.
df.set_index(['index','id']).index.map(s['availability'])
is returning:
TypeError: 'Series' object is not callable
S is a dataframe with a multi-index and one column which is a boolean.
df is a dataframe with columns in it which make up S's multi-index
IIUC:
In [260]: df.insert(3, 'availability',
df['index'].map(s.set_index('index')['availability']))
In [261]: df
Out[261]:
index A B availability C D
0 abcd 1.867270 0.517894 True 0.584115 -0.162361
1 efgh -0.036696 1.155110 True -1.112075 2.005678
2 abcd 0.693795 -0.843335 True -1.003202 1.001791
3 abc123 -1.466148 -0.848055 False -0.373293 0.360091
4 efgh -0.436618 -0.625454 True -0.285795 -0.220717
I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')