How to convert python dictionary to pandas dataframe - python

I have the following data that I want convert into pandas dataframe
Input
my_dict = {'table_1': [{'columns_1': 148989, 'columns_2': 437643}], 'table_2': [{'columns_1': 3344343, 'columns_2': 9897833}]}
Expected Output
table_name columns_1 columns_2
table_1 148989 437643
table_2 3344343 9897833
I tried below way but due to the loop, i can only get the last value
def convert_to_df():
for key, value in my_dict.items():
df = pd.DataFrame.from_dict(value, orient='columns')
df['table_name'] = key
return df
What I'm I missing?

Just get rid of those lists and you can feed directly to the DataFrame constructor:
pd.DataFrame({k: v[0] for k,v in my_dict.items()}).T
output:
columns_1 columns_2
table_1 148989 437643
table_2 3344343 9897833
With the index as column:
(pd.DataFrame({k: v[0] for k,v in my_dict.items()})
.T
.rename_axis('table_name')
.reset_index()
)
output:
table_name columns_1 columns_2
0 table_1 148989 437643
1 table_2 3344343 9897833

Not the nicest way imho (mozway's method is nicer), but to continue on the road you tried, you need to add the output of your for loop to a list and then concat that into 1 dataframe.
def convert_to_df():
df_list = [] #Add a list where the output of every loop is added to
for key, value in my_dict.items()
df = pd.DataFrame.from_dict(value, orient='columns')
df['table_name'] = key
df_list.append(df) #Append to the list
df = pd.concat(df_list) # Concat list into dataframe
return df
df = convert_to_df()

Related

if duplicata row update rows to 0 in Pyspark

I need to update values in column DF.EMAIL if have duplicates values in DF.EMAIL column to 0
generate DF
data = [('2345', 'leo#gmai.com'),
('2398', 'leo#hotmai.com'),
('2398', 'leo#hotmai.com'),
('2328', 'leo#yahoo.con'),
('3983', 'leo#yahoo.com.ar')]
serialize DF
df = sc.parallelize(data).toDF(['ID', 'EMAIL'])
# show DF
df.show()
Partial Solution
# create column with value 0 if don't have duplicates
# if have duplicates set value 1
df_join = df.join(
df.groupBy(df.columns).agg((count("*")>1).cast("int").alias("duplicate_indicator")),
on=df.columns,
how="inner"
)
# Update to 1 if have duplicates
df1 = df_join.withColumn(
"EMAIL",
when(df_join.duplicate_indicator == 1,"") \
.otherwise(df_join.EMAIL)
)
Syntax-wise, this looks more compact but yours might perform better.
df = (df.withColumn('count', count('*').over(Window.partitionBy('ID')))
.withColumn('EMAIL', when(col('count') > 1, '').otherwise(col('EMAIL'))))

Memory problem using python pandas to join stock DataFrames in loop

I am trying to join a lot of dataframes in order to do the correlation matrix in pandas.
So, it seems that I have to keep on adding columns on the right hand, with the "Date" as the index.
But, when I try to do this function with just 50 dataframes, it ends with the memory error.
Is there anyone knows what is happening?
def taking_and_combining_data_from_mysql_to_excel(root):
saved_path = root + "\main_df.xlsx"
main_df = pd.DataFrame()
mycursor = mydb.cursor(buffered=True)
for key, value in stock_dic.items():
mycursor.execute("""SELECT date, Adj_close
FROM hk_stock
Where date >= '2020-03-13 00:00:00' and stock_number = '{}'""".format(key))
row_result = mycursor.fetchall()
df = pd.DataFrame(row_result)
df.columns = ['Date', value]
df.set_index('Date',inplace=True)
if main_df.empty:
main_df = df
else:
main_df = main_df.join(df,how="outer")
with pd.ExcelWriter(saved_path) as writer:
main_df.to_excel(writer,sheet_name="raw_data")
main_df.corr().to_excel(writer,sheet_name="correlation")
return main_df
Pandas is not designed for such dynamic concatenations. You could just append things into a list, and convert that list into a DataFrame. Like so:
join=[]
for key, value in stock_dic.items():
join.append({'Date':value} )
df_join=pd.DataFrame(join)

Pandas rename & reindex: ValueError

def Transformation_To_UpdateNex(df):
s = 'TERM-ID,NAME,QUALIFIER,FACET1_ID,FACET2_ID,FACET3_ID,FACET4_ID,GROUP1_ID,GROUP2_ID,GROUP3_ID,GROUP4_ID,IS_VALID,IS_SELLABLE,IS_PRIMARY,IS_BRANCHABLE,HAS_RULES,FOR_SUGGESTION,IS_SAVED,S_NEG,SCORE,GOOGLE_SV,CPC,SINGULARTEXT,SING_PLU_VORGABE'
df_Import = pd.DataFrame(columns = s.split(','))
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
df_Import = df.rename(columns = d).reindex(columns=df_Import.columns)
df_Import.to_csv("Update.csv", sep=";", index = False, encoding = "ISO-8859-1")
ValueError: cannot reindex from a duplicate axis
I am trying to take values from a filled Dataframe and transfer these values keeping the same structure to my new Dataframe (empty one described first in the code).
Any ideas how to solve the value error?
So error:
ValueError: cannot reindex from a duplicate axis
means there are duplicated columns names.
I think problem is with rename, because it create duplicated columns:
s = 'TERM-ID,NAME,QUALIFIER,FACET1_ID,NAMECHANGE,TYP'
df = pd.DataFrame(columns = s.split(','))
print (df)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAMECHANGE, TYP]
Index: []
Here after rename get duplicated NAME and QUALIFIER columns, because original columns are NAME and NAMECHANGE and also QUALIFIER and TYP pairs:
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
df1 = df.rename(columns = d)
print (df1)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAME, QUALIFIER]
Index: []
Possible solution is test, if exist column and filter dictionary:
d = {'TERMID':'TERM-ID', 'NAMECHANGE':'NAME', 'TYP':'QUALIFIER'}
d1 = {k: v for k, v in d.items() if v not in df.columns}
print (d1)
{}
df1 = df.rename(columns = d1)
print (df1)
Empty DataFrame
Columns: [TERM-ID, NAME, QUALIFIER, FACET1_ID, NAMECHANGE, TYP]
Index: []

Rename variously formatted column headers in pandas

I'm working on a small tool that does some calculations on a dataframe, let's say something like this:
df['column_c'] = df['column_a'] + df['column_b']
for this to work the dataframe need to have the columns 'column_a' and 'column_b'. I would like this code to work if the columns are named slightly different named in the import file (csv or xlsx). For example 'columnA', 'Col_a', ect).
The easiest way would be renaming the columns inside the imported file, but let's assume this is not possible. Therefore I would like to do some think like this:
if column name is in list ['columnA', 'Col_A', 'col_a', 'a'... ] rename it to 'column_a'
I was thinking about having a dictionary with possible column names, when a column name would be in this dictionary it will be renamed to 'column_a'. An additional complication would be the fact that the columns can be in arbitrary order.
How would one solve this problem?
I recommend you formulate the conversion logic and write a function accordingly:
lst = ['columnA', 'Col_A', 'col_a', 'a']
def converter(x):
return 'column_'+x[-1].lower()
res = list(map(converter, lst))
['column_a', 'column_a', 'column_a', 'column_a']
You can then use this directly in pd.DataFrame.rename:
df = df.rename(columns=converter)
Example usage:
df = pd.DataFrame(columns=['columnA', 'col_B', 'c'])
df = df.rename(columns=converter)
print(df.columns)
Index(['column_a', 'column_b', 'column_c'], dtype='object')
Simply
for index, column_name in enumerate(df.columns):
if column_name in ['columnA', 'Col_A', 'col_a' ]:
df.columns[index] = 'column_a'
with dictionary
dico = {'column_a':['columnA', 'Col_A', 'col_a' ], 'column_b':['columnB', 'Col_B', 'col_b' ]}
for index, column_name in enumerate(df.columns):
for name, ex_names in dico:
if column_name in ex_names:
df.columns[index] = name
This should solve it:
df=pd.DataFrame({'colA':[1,2], 'columnB':[3,4]})
def rename_df(col):
if col in ['columnA', 'Col_A', 'colA' ]:
return 'column_a'
if col in ['columnB', 'Col_B', 'colB' ]:
return 'column_b'
return col
df = df.rename(rename_df, axis=1)
if you have the list of other names like list_othername_A or list_othername_B, you can do:
for col_name in df.columns:
if col_name in list_othername_A:
df = df.rename(columns = {col_name : 'column_a'})
elif col_name in list_othername_B:
df = df.rename(columns = {col_name : 'column_b'})
elif ...
EDIT: using the dictionary of #djangoliv, you can do even shorter:
dico = {'column_a':['columnA', 'Col_A', 'col_a' ], 'column_b':['columnB', 'Col_B', 'col_b' ]}
#create a dict to rename, kind of reverse dico:
dict_rename = {col:key for key in dico.keys() for col in dico[key]}
# then just rename:
df = df.rename(columns = dict_rename )
Note that this method does not work if in df you have two columns 'columnA' and 'Col_A' but otherwise, it should work as rename does not care if any key in dict_rename is not in df.columns.

unpack dictionary entries in pandas into dataframe

I have a dataframe where one of the columns has a dictionary in it
import pandas as pd
import numpy as np
def generate_dict():
return {'var1': np.random.rand(), 'var2': np.random.rand()}
data = {}
data[0] = {}
data[1] = {}
data[0]['A'] = generate_dict()
data[1]['A'] = generate_dict()
df = pd.DataFrame.from_dict(data, orient='index')
I would like to unpack the key/value pairs in the dictionary into a new dataframe, where each entry has it's own row. I can do that by iterating over the rows and appending to a new DataFrame:
def expand_row(row):
df_t = pd.DataFrame.from_dict({'value': row.A})
df_t.index.rename('row', inplace=True)
df_t.reset_index(inplace=True)
df_t['column'] = 'A'
return df_t
df_expanded = pd.DataFrame([])
for _, row in df.iterrows():
T = expand_row(row)
df_expanded = df_expanded.append(T, ignore_index=True)
This is rather slow, and my application is performance critical. I tihnk this is possible with df.apply. However as my function returns a DataFrame instead of a series, simply doing
df_expanded = df.apply(expand_row)
doesn't quite work. What would be the most performant way to do this?
Thanks in advance.
You can use nested list comprehension and then replace column 0 with constant A (column name):
d = df.A.to_dict()
df1 = pd.DataFrame([(key,key1,val1) for key,val in d.items() for key1,val1 in val.items()])
df1[0] = 'A'
df1.columns = ['columns','row','value']
print (df1)
columns row value
0 A var1 0.013872
1 A var2 0.192230
2 A var1 0.176413
3 A var2 0.253600
Another solution:
df1 = pd.DataFrame.from_records(df.A.values.tolist()).stack().reset_index()
df1['level_0'] = 'A'
df1.columns = ['columns','row','value']
print (df1)
columns row value
0 A var1 0.332594
1 A var2 0.118967
2 A var1 0.374482
3 A var2 0.263910

Categories

Resources