I'm working on a small tool that does some calculations on a dataframe, let's say something like this:
df['column_c'] = df['column_a'] + df['column_b']
for this to work the dataframe need to have the columns 'column_a' and 'column_b'. I would like this code to work if the columns are named slightly different named in the import file (csv or xlsx). For example 'columnA', 'Col_a', ect).
The easiest way would be renaming the columns inside the imported file, but let's assume this is not possible. Therefore I would like to do some think like this:
if column name is in list ['columnA', 'Col_A', 'col_a', 'a'... ] rename it to 'column_a'
I was thinking about having a dictionary with possible column names, when a column name would be in this dictionary it will be renamed to 'column_a'. An additional complication would be the fact that the columns can be in arbitrary order.
How would one solve this problem?
I recommend you formulate the conversion logic and write a function accordingly:
lst = ['columnA', 'Col_A', 'col_a', 'a']
def converter(x):
return 'column_'+x[-1].lower()
res = list(map(converter, lst))
['column_a', 'column_a', 'column_a', 'column_a']
You can then use this directly in pd.DataFrame.rename:
df = df.rename(columns=converter)
Example usage:
df = pd.DataFrame(columns=['columnA', 'col_B', 'c'])
df = df.rename(columns=converter)
print(df.columns)
Index(['column_a', 'column_b', 'column_c'], dtype='object')
Simply
for index, column_name in enumerate(df.columns):
if column_name in ['columnA', 'Col_A', 'col_a' ]:
df.columns[index] = 'column_a'
with dictionary
dico = {'column_a':['columnA', 'Col_A', 'col_a' ], 'column_b':['columnB', 'Col_B', 'col_b' ]}
for index, column_name in enumerate(df.columns):
for name, ex_names in dico:
if column_name in ex_names:
df.columns[index] = name
This should solve it:
df=pd.DataFrame({'colA':[1,2], 'columnB':[3,4]})
def rename_df(col):
if col in ['columnA', 'Col_A', 'colA' ]:
return 'column_a'
if col in ['columnB', 'Col_B', 'colB' ]:
return 'column_b'
return col
df = df.rename(rename_df, axis=1)
if you have the list of other names like list_othername_A or list_othername_B, you can do:
for col_name in df.columns:
if col_name in list_othername_A:
df = df.rename(columns = {col_name : 'column_a'})
elif col_name in list_othername_B:
df = df.rename(columns = {col_name : 'column_b'})
elif ...
EDIT: using the dictionary of #djangoliv, you can do even shorter:
dico = {'column_a':['columnA', 'Col_A', 'col_a' ], 'column_b':['columnB', 'Col_B', 'col_b' ]}
#create a dict to rename, kind of reverse dico:
dict_rename = {col:key for key in dico.keys() for col in dico[key]}
# then just rename:
df = df.rename(columns = dict_rename )
Note that this method does not work if in df you have two columns 'columnA' and 'Col_A' but otherwise, it should work as rename does not care if any key in dict_rename is not in df.columns.
Related
I'm trying to write a function that will backfill columns in a dataframe adhearing to a condition. The upfill should only be done within groups. I am however having a hard time getting the group object to ungroup. I have tried reset_index as in the example bellow but that gets an AttributeError.
Accessing the original df through result.obj doesn't lead to the updated value because there is no inplace for the groupby bfill.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column].bfill(axis="rows", inplace=True)
return df
Assigning the dataframe column in the function doesn't work because groupbyobject doesn't support item assingment.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column] = df[column].bfill()
return df
The test I'm trying to get to pass:
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
result.reset_index()
assert result["x_value"].equals(Series([4,4,None,5,5]))
You should use 'transform' method on the grouped DataFrame, like this:
import pandas as pd
def test_upfill():
df = pd.DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
result = df.groupby("group").transform(lambda x: x.bfill())
assert result["x_value"].equals(pd.Series([4,4,None,5,5]))
test_upfill()
Here you can find find more information about the transform method on Groupby objects
Based on the accepted answer this is the full solution I got to although I have read elsewhere there are issues using the obj attribute.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
columns = [column for column in df.obj.columns if column.startswith("x")]
df.obj[columns] = df[columns].transform(lambda x:x.bfill())
return df
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
assert df["x_value"].equals(Series([4,4,None,5,5]))
i have 5 different data frames that are output of different conditions or tables.
i want to have an output if these data-frames are empty or not. basically i will define with len(df) each data frame and will pass a string if they have anything in them.
def(df1,df2,df3,df4,df5)
if len(df1) > 0:
"df1 not empty"
else: ""
if len(df2) > 0:
"df2 not empty"
else: ""
then i want to append these string to each other and will have a string like
**df1 not empty, df3 not empty**
try this :
import pandas as pd
dfs = {'milk': pd.DataFrame(['a']), 'bread': pd.DataFrame(['b']), 'potato': pd.DataFrame()}
print(''.join(
[f'{name} not empty. ' for name, df in dfs.items() if (not df.empty)])
)
output:
milk not empty. bread not empty.
data = [1,2,3]
df = pd.DataFrame(data, columns=['col1']) #create not empty df
data1 = []
df1 = pd.DataFrame(data) #create empty df
dfs = [df, df1] #list them
#the "for loop" is replaced here by a list comprehension
#I used enumerate() to attribute an index to each df in the list of dfs, because otherwise in the print output if you call directly df0 or df1 it will print th entire dataframe, not only his name
print(' '.join([f'df{i} is not empty.' for i,df in enumerate(dfs) if not df.empty]))
Result:
df0 is not empty. df1 is not empty.
With a one-liner:
dfs = [df1,df2,df3,df4,df5]
output = ["your string here" for df in dfs if not df.empty]
You can then concatenate strings together, if you want:
final_string = "; ".join(output)
I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
def Transformation_To_UpdateNex(df):
s = 'TERM-ID,NAME,QUALIFIER,FACET1_ID,FACET2_ID,FACET3_ID,FACET4_ID,GROUP1_ID,GROUP2_ID,GROUP3_ID,GROUP4_ID,IS_VALID,IS_SELLABLE,IS_PRIMARY,IS_BRANCHABLE,HAS_RULES,FOR_SUGGESTION,IS_SAVED,S_NEG,SCORE,GOOGLE_SV,CPC,SINGULARTEXT,SING_PLU_VORGABE'
df_Import = pd.DataFrame(columns = s.split(','))
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
df_Import = df.rename(columns = d).reindex(columns=df_Import.columns)
df_Import.to_csv("Update.csv", sep=";", index = False, encoding = "ISO-8859-1")
ValueError: cannot reindex from a duplicate axis
I am trying to take values from a filled Dataframe and transfer these values keeping the same structure to my new Dataframe (empty one described first in the code).
Any ideas how to solve the value error?
So error:
ValueError: cannot reindex from a duplicate axis
means there are duplicated columns names.
I think problem is with rename, because it create duplicated columns:
s = 'TERM-ID,NAME,QUALIFIER,FACET1_ID,NAMECHANGE,TYP'
df = pd.DataFrame(columns = s.split(','))
print (df)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAMECHANGE, TYP]
Index: []
Here after rename get duplicated NAME and QUALIFIER columns, because original columns are NAME and NAMECHANGE and also QUALIFIER and TYP pairs:
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
df1 = df.rename(columns = d)
print (df1)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAME, QUALIFIER]
Index: []
Possible solution is test, if exist column and filter dictionary:
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
d1 = {k: v for k, v in d.items() if v not in df.columns}
print (d1)
{}
df1 = df.rename(columns = d1)
print (df1)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAMECHANGE, TYP]
Index: []
I created this dataframe:
import pandas as pd
columns = pd.MultiIndex.from_tuples([("x", "", ""), ("values", "a", "a.b"), ("values", "c", "")])
df0 = pd.DataFrame([(0,10,20),(1,100,200)], columns=columns)
df0
I unload df0 to excel:
df0.to_excel("test.xlsx")
and load it again:
df1 = pd.read_excel("test.xlsx", header=[0,1,2])
df1
And I have Unnamed :... column names.
To make df1 look like inital df0 I run:
def rename_unnamed(df, label=""):
for i, columns in enumerate(df.columns.levels):
columns = columns.tolist()
for j, row in enumerate(columns):
if "Unnamed: " in row:
columns[j] = ""
df.columns.set_levels(columns, level=i, inplace=True)
return df
rename_unnamed(df1)
Well done. But is there any pandas way from box to do this?
Since pandas 0.21.0 the code should be like this
def rename_unnamed(df):
"""Rename unamed columns name for Pandas DataFrame
See https://stackoverflow.com/questions/41221079/rename-multiindex-columns-in-pandas
Parameters
----------
df : pd.DataFrame object
Input dataframe
Returns
-------
pd.DataFrame
Output dataframe
"""
for i, columns in enumerate(df.columns.levels):
columns_new = columns.tolist()
for j, row in enumerate(columns_new):
if "Unnamed: " in row:
columns_new[j] = ""
if pd.__version__ < "0.21.0": # https://stackoverflow.com/a/48186976/716469
df.columns.set_levels(columns_new, level=i, inplace=True)
else:
df = df.rename(columns=dict(zip(columns.tolist(), columns_new)),
level=i)
return df
Mixing answers from #jezrael and #dinya, and limited for pandas above 0.21.0 (after 2017) an option to solve this will be:
for i, columns_old in enumerate(df.columns.levels):
columns_new = np.where(columns_old.str.contains('Unnamed'), '-', columns_old)
df.rename(columns=dict(zip(columns_old, columns_new)), level=i, inplace=True)
You can use numpy.where with condition by contains:
for i, col in enumerate(df1.columns.levels):
columns = np.where(col.str.contains('Unnamed'), '', col)
df1.columns.set_levels(columns, level=i, inplace=True)
print (df1)
x values
a c
a.b
0 0 10 20
1 1 100 200