def Transformation_To_UpdateNex(df):
s = 'TERM-ID,NAME,QUALIFIER,FACET1_ID,FACET2_ID,FACET3_ID,FACET4_ID,GROUP1_ID,GROUP2_ID,GROUP3_ID,GROUP4_ID,IS_VALID,IS_SELLABLE,IS_PRIMARY,IS_BRANCHABLE,HAS_RULES,FOR_SUGGESTION,IS_SAVED,S_NEG,SCORE,GOOGLE_SV,CPC,SINGULARTEXT,SING_PLU_VORGABE'
df_Import = pd.DataFrame(columns = s.split(','))
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
df_Import = df.rename(columns = d).reindex(columns=df_Import.columns)
df_Import.to_csv("Update.csv", sep=";", index = False, encoding = "ISO-8859-1")
ValueError: cannot reindex from a duplicate axis
I am trying to take values from a filled Dataframe and transfer these values keeping the same structure to my new Dataframe (empty one described first in the code).
Any ideas how to solve the value error?
So error:
ValueError: cannot reindex from a duplicate axis
means there are duplicated columns names.
I think problem is with rename, because it create duplicated columns:
s = 'TERM-ID,NAME,QUALIFIER,FACET1_ID,NAMECHANGE,TYP'
df = pd.DataFrame(columns = s.split(','))
print (df)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAMECHANGE, TYP]
Index: []
Here after rename get duplicated NAME and QUALIFIER columns, because original columns are NAME and NAMECHANGE and also QUALIFIER and TYP pairs:
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
df1 = df.rename(columns = d)
print (df1)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAME, QUALIFIER]
Index: []
Possible solution is test, if exist column and filter dictionary:
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
d1 = {k: v for k, v in d.items() if v not in df.columns}
print (d1)
{}
df1 = df.rename(columns = d1)
print (df1)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAMECHANGE, TYP]
Index: []
Related
I have a file with 136 columns. I was trying to find the unique values of each column and from there, I need to find the number of rows for the unique values.
I tried using df and dict for the unique values. However, when I export it back to csv file, the unique values are exported as a list in one cell for each column.
Is there any way I can do to simplify the counting process of the unique values in each column?
df = pd.read_excel(filename)
column_headers = list(df.columns.values)
df_unique = {}
df_count = {}
def approach_1(data):
count = 0
for entry in data:
if not entry =='nan' or not entry == 'NaN':
count += 1
return count
for unique in column_headers:
new = df.drop_duplicates(subset=unique , keep='first')
df_unique[unique] = new[unique].tolist()
csv_unique = pd.DataFrame(df_unique.items(), columns = ['Data Source Field', 'First Row'])
csv_unique.to_csv('Unique.csv', index = False)
for count in df_unique:
not_nan = approach_1(df_unique[count])
df_count[count] = not_nan
csv_count = pd.DataFrame(df_count.items(), columns = ['Data Source Field', 'Count'])
.unique() is simpler ->len(df[col].unique()) is the count
import pandas as pd
dict = [
{"col1":"0","col2":"a"},
{"col1":"1","col2":"a"},
{"col1":"2","col2":"a"},
{"col1":"3","col2":"a"},
{"col1":"4","col2":"a"},
{"col2":"a"}
]
df = pd.DataFrame.from_dict(dict)
result_dict = {}
for col in df.columns:
result_dict[col] = len(df[col].dropna().unique())
print(result_dict)
I would like to call a pd.dataframe object but only the objects that are the ones in the key of a dictionary. I have multiple excel template files and they column names vary causing for the need of removal of certain column names. For reproducible reason i attached a sample below.
import pandas as pd
filename='template'
data= [['Auto','','','']]
df= pd.DataFrame(data,columns=['industry','System_Type__c','AccountType','email'])
valid= {'industry': ['Automotive'],
'SME Vertical': ['Agriculture'],
'System_Type__c': ['Access'],
'AccountType': ['Commercial']}
valid={k:v for k, v in valid.items() if k in df.columns.values}
errors= {}
errors[filename]={}
df1= df[['industry','System_Type__c','AccountType']]
mask = df1.apply(lambda c: c.isin(valid[c.name]))
df1.mask(mask|df1.eq(' ')).stack()
for err_i, (r, v) in enumerate(df1.mask(mask|df1.eq(' ')).stack().iteritems()):
errors[filename][err_i] = {"row": r[0],
"column": r[1],
"message": v + " is invalid check column " + r[1] + ' and replace with a standard value'}
I would like df1 to be a variable to a more dynamic list of df.DataFrame objects
how would I replace this piece of code to be more dynamic?
df1= df[['industry','System_Type__c','AccountType', 'SME Vertical']]
#desired output would drop SME Vertical since it is not a df column
df1= df[['industry','System_Type__c','AccountType']]
# list of the dictionary returns the keys
# you then filter the DF based on it and assign to DF1
df1=df[list(valid)]
df1
industry System_Type__c AccountType
0 Auto
I have the following data that I want convert into pandas dataframe
Input
my_dict = {'table_1': [{'columns_1': 148989, 'columns_2': 437643}], 'table_2': [{'columns_1': 3344343, 'columns_2': 9897833}]}
Expected Output
table_name columns_1 columns_2
table_1 148989 437643
table_2 3344343 9897833
I tried below way but due to the loop, i can only get the last value
def convert_to_df():
for key, value in my_dict.items():
df = pd.DataFrame.from_dict(value, orient='columns')
df['table_name'] = key
return df
What I'm I missing?
Just get rid of those lists and you can feed directly to the DataFrame constructor:
pd.DataFrame({k: v[0] for k,v in my_dict.items()}).T
output:
columns_1 columns_2
table_1 148989 437643
table_2 3344343 9897833
With the index as column:
(pd.DataFrame({k: v[0] for k,v in my_dict.items()})
.T
.rename_axis('table_name')
.reset_index()
)
output:
table_name columns_1 columns_2
0 table_1 148989 437643
1 table_2 3344343 9897833
Not the nicest way imho (mozway's method is nicer), but to continue on the road you tried, you need to add the output of your for loop to a list and then concat that into 1 dataframe.
def convert_to_df():
df_list = [] #Add a list where the output of every loop is added to
for key, value in my_dict.items()
df = pd.DataFrame.from_dict(value, orient='columns')
df['table_name'] = key
df_list.append(df) #Append to the list
df = pd.concat(df_list) # Concat list into dataframe
return df
df = convert_to_df()
Situation
Take in a .csv file
Replace all non-ascii characters with a ?
Find row and columns that have a ? and display their location (in a df or list)
Sample Code
df = pd.read_csv('../test.csv', sep='|', skiprows=1)
find_non_ascii = df.select_dtypes(object)
df[find_non_ascii.columns] = find_non_ascii.apply(lambda x: x.str.encode("ascii", "replace").str.decode("ascii")
)
df2 = df[find_non_ascii.columns]
quest = '\\?'
lster = []
try:
for col in cols:
df3 = df2.loc[df2[f'{col}'].str.contains(quest, na=False)]
if df3.items():
lster.append(df3)
except Exception as e:
print(e)
print(lster)
Output
[ NAME EARPHONES MODEL_NUMBER ID CAR
0 d?fgh ?g?s s-s d?d asd,
NAME EARPHONES MODEL_NUMBER ID CAR
0 d?fgh ?g?s s-s d?d asd
1 dfg A? NaN af a,
Empty DataFrame
Columns: [NAME, EARPHONES, MODEL_NUMBER, ID, CAR]
Index: [],
NAME EARPHONES MODEL_NUMBER ID CAR
0 d?fgh ?g?s s-s d?d asd,
Empty DataFrame
Columns: [NAME, EARPHONES, MODEL_NUMBER, ID, CAR]
Index: []]```
You can create a mask (True/False) for each row using df.apply, where df is most likely df2 in your situation but I left it as df for simplicity.
df['mask'] = df.apply(lambda x: any(x[col].__contains__('?') for col in df.columns), axis = 1)
You can then filter the dataframe using this mask to show only the rows where the mask is True (the row contains any '?').
df.loc[df['mask']]
And to remove the mask column showing in the result
df.loc[df['mask'],df.columns[:-1]]
I have a dataframe where one of the columns has a dictionary in it
import pandas as pd
import numpy as np
def generate_dict():
return {'var1': np.random.rand(), 'var2': np.random.rand()}
data = {}
data[0] = {}
data[1] = {}
data[0]['A'] = generate_dict()
data[1]['A'] = generate_dict()
df = pd.DataFrame.from_dict(data, orient='index')
I would like to unpack the key/value pairs in the dictionary into a new dataframe, where each entry has it's own row. I can do that by iterating over the rows and appending to a new DataFrame:
def expand_row(row):
df_t = pd.DataFrame.from_dict({'value': row.A})
df_t.index.rename('row', inplace=True)
df_t.reset_index(inplace=True)
df_t['column'] = 'A'
return df_t
df_expanded = pd.DataFrame([])
for _, row in df.iterrows():
T = expand_row(row)
df_expanded = df_expanded.append(T, ignore_index=True)
This is rather slow, and my application is performance critical. I tihnk this is possible with df.apply. However as my function returns a DataFrame instead of a series, simply doing
df_expanded = df.apply(expand_row)
doesn't quite work. What would be the most performant way to do this?
Thanks in advance.
You can use nested list comprehension and then replace column 0 with constant A (column name):
d = df.A.to_dict()
df1 = pd.DataFrame([(key,key1,val1) for key,val in d.items() for key1,val1 in val.items()])
df1[0] = 'A'
df1.columns = ['columns','row','value']
print (df1)
columns row value
0 A var1 0.013872
1 A var2 0.192230
2 A var1 0.176413
3 A var2 0.253600
Another solution:
df1 = pd.DataFrame.from_records(df.A.values.tolist()).stack().reset_index()
df1['level_0'] = 'A'
df1.columns = ['columns','row','value']
print (df1)
columns row value
0 A var1 0.332594
1 A var2 0.118967
2 A var1 0.374482
3 A var2 0.263910