I have a formatted data as dict[tuple[str, str], list[float]]
i want to convert it into a pandas dataframe
Example data:
{('A','B'): [-0.008035100996494293,0.008541940711438656]}
i tried using some data manipulations using split functions.
Expecting:-
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656], ('C','D'): [-0.008035100996494293,0.008541940711438656]}
title = []
heading = []
num_col1 = []
num_col2 = []
for key, val in data.items():
title.append(key[0])
heading.append(key[1])
num_col1.append(val[0])
num_col2.append(val[1])
data_ = {'title':title, 'heading':heading, 'num_col1':num_col1, 'num_col2':num_col1}
pd.DataFrame(data_)
Your best bet will be to construct your Index manually. For this we can use pandas.MultiIndex.from_tuples since your dictionary keys are stored as tuples. From there we just need to store the values of the dictionary into the body of a DataFrame.
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
index = pd.MultiIndex.from_tuples(data.keys(), names=['title', 'heading'])
df = pd.DataFrame(data.values(), index=index).reset_index()
print(df)
title heading 0 1
0 A B -0.008035 0.008542
If you want chained operation, you can do:
import pandas as pd
data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
df = (
pd.DataFrame.from_dict(data, orient='index')
.pipe(lambda d:
d.set_axis(pd.MultiIndex.from_tuples(d.index, names=['title', 'heading']))
)
.reset_index()
)
print(df)
title heading 0 1
0 A B -0.008035 0.008542
Another possible solution, which works also if the tuples and lists vary in length:
pd.concat([pd.DataFrame.from_records([x for x in d.keys()],
columns=['title', 'h1', 'h2']),
pd.DataFrame.from_records([x[1] for x in d.items()])], axis=1)
Output:
title h1 h2 0 1 2
0 A B None -0.008035 0.008542 NaN
1 C B D -0.010351 1.008542 5.0
Data input:
d = {('A','B'): [-0.008035100996494293,0.008541940711438656],
('C','B', 'D'): [-0.01035100996494293,1.008541940711438656, 5]}
You can expand the keys and values as you iterate the dictionary items. Pandas will see 4 values which it will make into a row.
>>> import pandas as pd
>>> data = {('A','B'): [-0.008035100996494293,0.008541940711438656]}
>>> pd.DataFrame(((*d[0], *d[1]) for d in data.items()), columns=("Title", "Heading", "Foo", "Bar"))
Title Heading Foo Bar
0 A B -0.008035 0.008542
Related
import pandas as pd
D = {"a":[(1.0411070751196425, 1.048179051450828),(0.8020630165032718, 0.8884133074952416)],
"b":[(1.0411070751196425, 1.048179051450828),(0.8020630165032718, 0.8884133074952416)],
"c":[(1.0411070751196425, 1.048179051450828),(0.8020630165032718, 0.8884133074952416)],
"d":[(1.0411070751196425, 1.048179051450828),(0.8020630165032718, 0.8884133074952416)]}
D = pd.DataFrame(D)
Suppose I have such a pandas dataframe whose entries are tuples. When I print out this dataframe, how can I only display each number to 4 decimals? For example, the complete entry is (1.0411070751196425, 1.048179051450828), and I wanna display (1.0411, 1.0482).
Use DataFrame.applymap for elementwise processing with generator comprehension and tuples:
D = D.applymap(lambda x: tuple(round(y,4) for y in x))
print (D)
a b c d
0 (1.0411, 1.0482) (1.0411, 1.0482) (1.0411, 1.0482) (1.0411, 1.0482)
1 (0.8021, 0.8884) (0.8021, 0.8884) (0.8021, 0.8884) (0.8021, 0.8884)
I have a very large list, so i will use the below as a reproducible example. I would like to unlist the following so i can use the keys of the dictionaries as columns to a dataframe.
[{'message':'Today is a sunny day.','comments_count':'45','id':
'1401305690071546_11252160039985938','created_time': '2020-02-29T13:43:46+0000'},
{'message':'Today is a cloudy day.','comments_count':'47','id':
'1401305690073586_11252160039985938','created_time': '2020-03-29T13:43:46+0000'}]
Desired output will be the following columns as a panda dataframe:
message comments_count id created_time
If it’s a list of dictionaries that you want to transform to data-frame you can just do the following:
df1 = pd.DataFrame(l)
# or
df2 = pd.DataFrame.from_dict(l)
the output of both use cases is:
print(df2)
print(df2.columns)
message ... created_time
0 Today is a sunny day. ... 2020-02-29T13:43:46+0000
1 Today is a cloudy day. ... 2020-03-29T13:43:46+0000
[2 rows x 4 columns]
Index(['message', 'comments_count', 'id', 'created_time'], dtype='object')
If you want to put all of the data into the dataframe:
import pandas as pd
my_container = [{'message':'Today is a sunny day.','comments_count':'45','id': '1401305690071546_11252160039985938','created_time': '2020-02-29T13:43:46+0000'}, {'message':'Today is a cloudy day.','comments_count':'47','id': '1401305690073586_11252160039985938','created_time': '2020-03-29T13:43:46+0000'}]
df = pd.DataFrame(my_container)
If you want an empty dataframe with the correct columns:
columns = set()
for d in my_container:
columns.update(d.keys())
df = pd.DataFrame(columns=columns)
You can iterate through the list and find the type() of each item
dictList = []
for i in myList:
if type(i) == dict:
dictList.append(i)
myList.remove(i)
How can I create a single row and get the data type, maximum column length and count for each column of a data frame as shown in bottom desired output section.
import pandas as pd
table = 'sample_data'
idx=0
#Create a Dictionary of series
d = {'Name':pd.Series(['Tom','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,'NULL',40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]),
'new_column':pd.Series([])
}
#Create a DataFrame using above data
sdf = pd.DataFrame(d)
#Create a summary description
desired_data = sdf.describe(include='all').T
desired_data = desired_data.rename(columns={'index':'Variable'})
#print(summary)
#Get Data Type
dtype = sdf.dtypes
#print(data_type)
#Get total count of records (need to work on)
counts = sdf.shape[0] # gives number of row count
#Get maximum length of values
maxcollen = []
for col in range(len(sdf.columns)):
maxcollen.append(max(sdf.iloc[:,col].astype(str).apply(len)))
#print('Max Column Lengths ', maxColumnLenghts)
#Constructing final data frame
desired_data = desired_data.assign(data_type = dtype.values)
desired_data = desired_data.assign(total_count = counts)
desired_data = desired_data.assign(max_col_length = maxcollen)
final_df = desired_data
final_df = final_df.reindex(columns=['data_type','max_col_length','total_count'])
final_df.insert(loc=idx, column='table_name', value=table)
final_df.to_csv('desired_data.csv')
#print(final_df)
Output of above code:
The desired output I am looking for is :
In : sdf
Out:
table_name Name_data_type Name_total_count Name_max_col_length Age_data_type Age_total_count Age_max_col_length Rating_data_type Rating_total_count Rating_max_col_length
sample_data object 12 6 object 12 4 float64 12 4
If you have noticed, I want to print single row where I create column_name_data_type,column_name_total_count,column_name_max_col_length and get the respective values for the same.
Here's a solution:
df = final_df
df = df.drop("new_column").drop("table_name", axis=1)
df = df.reset_index()
df.melt(id_vars=["index"]).set_index(["index", "variable"]).sort_index().transpose()
The result is:
index Age Name \
variable data_type max_col_length total_count data_type max_col_length ...
value object 4 12 object 6 ...
Can you try this:
The below code tries to iterate entire dataframe, hence it may take some time complexity. This is not the optimal solution but working solution for the above problem.
from collections import OrderedDict
## storing key-value pair
result_dic = OrderedDict()
unique_table_name = final_df["table_name"].unique()
# remove unwanted rows
final_df.drop("new_column", inplace=True)
cols_name = final_df.columns
## for every unique table name, generating row
for unique_table_name in unique_table_name:
result_dic["table_name"] = unique_table_name
filtered_df = final_df[final_df["table_name"] == unique_table_name]
for row in filtered_df.iterrows():
for cols in cols_name:
if cols != "table_name":
result_dic[row[0]+"_"+cols] = row[1][cols]
Convert dict to dataframe
## convert dataframe from dict
result_df = pd.DataFrame([result_dic])
result_df
expected output is:
table_name Name_data_type Name_max_col_length Name_total_count Age_data_type Age_max_col_length Age_total_count Rating_data_type Rating_max_col_length Rating_total_count
0 sample_data object 6 12 object 4 12 float64 4 12
I have a dataframe where one of the columns has a dictionary in it
import pandas as pd
import numpy as np
def generate_dict():
return {'var1': np.random.rand(), 'var2': np.random.rand()}
data = {}
data[0] = {}
data[1] = {}
data[0]['A'] = generate_dict()
data[1]['A'] = generate_dict()
df = pd.DataFrame.from_dict(data, orient='index')
I would like to unpack the key/value pairs in the dictionary into a new dataframe, where each entry has it's own row. I can do that by iterating over the rows and appending to a new DataFrame:
def expand_row(row):
df_t = pd.DataFrame.from_dict({'value': row.A})
df_t.index.rename('row', inplace=True)
df_t.reset_index(inplace=True)
df_t['column'] = 'A'
return df_t
df_expanded = pd.DataFrame([])
for _, row in df.iterrows():
T = expand_row(row)
df_expanded = df_expanded.append(T, ignore_index=True)
This is rather slow, and my application is performance critical. I tihnk this is possible with df.apply. However as my function returns a DataFrame instead of a series, simply doing
df_expanded = df.apply(expand_row)
doesn't quite work. What would be the most performant way to do this?
Thanks in advance.
You can use nested list comprehension and then replace column 0 with constant A (column name):
d = df.A.to_dict()
df1 = pd.DataFrame([(key,key1,val1) for key,val in d.items() for key1,val1 in val.items()])
df1[0] = 'A'
df1.columns = ['columns','row','value']
print (df1)
columns row value
0 A var1 0.013872
1 A var2 0.192230
2 A var1 0.176413
3 A var2 0.253600
Another solution:
df1 = pd.DataFrame.from_records(df.A.values.tolist()).stack().reset_index()
df1['level_0'] = 'A'
df1.columns = ['columns','row','value']
print (df1)
columns row value
0 A var1 0.332594
1 A var2 0.118967
2 A var1 0.374482
3 A var2 0.263910
I am currently using this code:
import pandas as pd
AllDays = ['a','b','c','d']
TempDay = pd.DataFrame( np.random.randn(4,2) )
TempDay['Dates'] = AllDays
TempDay.to_csv('H:\MyFile.csv', index = False, header = False)
But when it prints it prints the array before the dates with a header row. I am seeking to print the dates before the TemperatureArray and no header rows.
Edit:
The file is with the TemperatureArray followed by Dates: [ TemperatureArray, Date].
-0.27724356949570034,-0.3096554106726788,a
-0.10619546908708237,0.07430127684522048,b
-0.07619665345406437,0.8474460146082116,c
0.19668718143436803,-0.8072994364484335,d
I am looking to print: [ Date TemperatureArray]
a,-0.27724356949570034,-0.3096554106726788
b,-0.10619546908708237,0.07430127684522048
c,-0.07619665345406437,0.8474460146082116
d,0.19668718143436803,-0.8072994364484335
The pandas.Dataframe.to_csv method has a keyword argument, header=True that can be turned off to disable headers.
However, it sometimes does not work (from experience).
Using it in conjunction with index=False should solve your issue.
For example, this snippet should fix your issue:
TempDay.to_csv('C:\MyFile.csv', index=False, header=False)
Here is a full example showing how it disables the header row:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(6,4))
>>> df
0 1 2 3
0 1.295908 1.127376 -0.211655 0.406262
1 0.152243 0.175974 -0.777358 -1.369432
2 1.727280 -0.556463 -0.220311 0.474878
3 -1.163965 1.131644 -1.084495 0.334077
4 0.769649 0.589308 0.900430 -1.378006
5 -2.663476 1.010663 -0.839597 -1.195599
>>> # just assigns sequential letters to the column
>>> df[4] = [chr(i+ord('A')) for i in range(6)]
>>> df
0 1 2 3 4
0 1.295908 1.127376 -0.211655 0.406262 A
1 0.152243 0.175974 -0.777358 -1.369432 B
2 1.727280 -0.556463 -0.220311 0.474878 C
3 -1.163965 1.131644 -1.084495 0.334077 D
4 0.769649 0.589308 0.900430 -1.378006 E
5 -2.663476 1.010663 -0.839597 -1.195599 F
>>> # here we reindex the headers and return a copy
>>> # using this form of indexing just requires you to provide
>>> # a list with all the columns you desire and in the order desired
>>> df2 = df[[4, 1, 2, 3]]
>>> df2
4 1 2 3
0 A 1.127376 -0.211655 0.406262
1 B 0.175974 -0.777358 -1.369432
2 C -0.556463 -0.220311 0.474878
3 D 1.131644 -1.084495 0.334077
4 E 0.589308 0.900430 -1.378006
5 F 1.010663 -0.839597 -1.195599
>>> df2.to_csv('a.txt', index=False, header=False)
>>> with open('a.txt') as f:
... print(f.read())
...
A,1.1273756275298716,-0.21165535441591588,0.4062624848191157
B,0.17597366083826546,-0.7773584823122313,-1.3694320591723093
C,-0.556463084618883,-0.22031139982996412,0.4748783498361957
D,1.131643603259825,-1.084494967896866,0.334077296863368
E,0.5893080536600523,0.9004299653290818,-1.3780062860066293
F,1.0106633581546611,-0.839597332636998,-1.1955992812601897
If you need to dynamically adjust the columns, and move the last column to the first, you can do as follows:
# this returns the columns as a list
columns = df.columns.tolist()
# removes the last column, the newest one you added
tofirst_column = columns.pop(-1)
# just move it to the start
new_columns = [tofirst_column] + columns
# then you can the rest
df2 = df[new_columns]
This simply allows you to take the current column list, construct a Python list from the current columns, and reindex the headers without having any prior knowledge on the headers.