I have a dataframe and want to convert it to a list of dictionaries. I use read_csv() to create this dataframe. The dataframe looks like the following:
AccountName AccountType StockName Allocation
0 MN001 #1 ABC 0.4
1 MN001 #1 ABD 0.6
2 MN002 #2 EFG 0.5
3 MN002 #2 HIJ 0.4
4 MN002 #2 LMN 0.1
The desired output:
[{'ABC':0.4, 'ABD':0.6}, {'EFG':0.5, 'HIJ':0.4,'LMN':0.1}]
I have tried to research on similar topics and used the Dataframe.to_dict() function. I look forward to getting this done. Many thanks for your help!
import pandas as pd
import numpy as np
d = np.array([['MN001','#1','ABC', 0.4],
['MN001','#1','ABD', 0.6],
['MN002', '#2', 'EFG', 0.5],
['MN002', '#2', 'HIJ', 0.4],
['MN002', '#2', 'LMN', 0.1]])
df = pd.DataFrame(data=d, columns = ['AccountName','AccountType','StockName', 'Allocation'])
by_account_df = df.groupby('AccountName').apply(lambda x : dict(zip(x['StockName'],x['Allocation']))).reset_index(name='dic'))
by_account_lst = by_account_df['dic'].values.tolist()
And the result should be:
print(by_account_lst)
[{'ABC': '0.4', 'ABD': '0.6'}, {'EFG': '0.5', 'HIJ': '0.4', 'LMN': '0.1'}]
This should do it:
portfolios = []
for _, account in df.groupby('AccountName'):
portfolio = {stock['StockName']: stock['Allocation']
for _, stock in account.iterrows()}
portfolios.append(portfolio)
First use the groupby() function to group the rows of the dataframe by AccountName. To access the individual rows (stocks) for each account, you use the iterrows() method. As user #ebb-earl-co explained in the comments, the _ is there as a placeholder variable, because iterrows() returns (index, Series) tuples, and we only need the Series (the rows themselves). From there, use a dict comprehension to create a dictionary mapping StockName -> Allocation for each stock. Finally, append that dictionary to the list of portfolios, resulting in the expected output:
[{'ABC': 0.4, 'ABD': 0.6}, {'EFG': 0.5, 'HIJ': 0.4, 'LMN': 0.1}]
One more thing: if you decide later that you want to label each dict in the portfolios with the account name, you could do it like this:
portfolios = []
for acct_name, account in df.groupby('AccountName'):
portfolio = {stock['StockName']: stock['Allocation']
for _, stock in account.iterrows()}
portfolios.append({acct_name: portfolio})
This will return a list of nested dicts like this:
[{'MN001': {'ABC': 0.4, 'ABD': 0.6}},
{'MN002': {'EFG': 0.5, 'HIJ': 0.4, 'LMN': 0.1}}]
Note that in this case, I used the variable acct_name instead of assigning to _ because we actually will use the index to "label" the dicts in the portfolios list.
Related
I have a dictionary with unique ID and [sample distribution of scores] pairs, e.g.: '100': [0.5, 0.6, 0.2, 0.7, 0.3]. The arrays are not all the same length.
For each item/'scores' array in my dictionary, I want to fit a beta distribution like scipy.stats.beta.fit() over the distribution of scores and get the alpha/beta parameters for each sample. And then I want this in a new dictionary — so it'd be like, '101': (1.5, 1.8).
I know I could do this by iterating over my dictionary with a for-loop, but the dictionary is pretty massive/I'd like to know if there's a more computationally efficient way of doing it.
For context, the way I get this dictionary is from a pandas dataframe, where I do:
my_dictionary = df.groupby('unique_id')['score'].apply(list).to_dict()
The df looks like this:
For example:
df = pd.DataFrame({
'id': ['100', '100', '100', '101', '101', '102'],
'score' : [0.5, 0.3, 0.2, 1, 0.2, 0.9]
})
And then the resulting dictionary looks like:
{'100': [0.5, 0.3, 0.2], '101': [0.2, 0.1], '102': [0.9]}
Is there maybe also a way of fitting the beta distribution straight from the df.groupby level/without having to convert it into a dictionary first and then looping over the dictionary with scipy? Like is there something where I could do:
df.groupby('unique_id')['score'].apply(stats.beta.fit()).to_dict()
...or something like that?
Try this:
df=df.groupby('id').apply(lambda x: list(beta.fit(x.score)))
dc=df.to_dict()
Output:
df
id
100 [0.2626434905176847, 0.37866242902872393, 0.18...
101 [1.253982875508286, 0.8832540117966552, -0.093...
102 [1.044551187075241, 1.0167687597781938, 0.8999...
dtype: object
dc
{'100': [0.2626434905176847, 0.37866242902872393, 0.18487097639113187, 0.3151290236088682],
'101': [1.253982875508286, 0.8832540117966552, -0.09383386122371801, 1.0938338612237182],
'102': [1.044551187075241, 1.0167687597781938, 0.8999999999999999, 1.1272504901983386e-16]}
As I recognize You need to fit multiple beta.fit per row of dataframe df:
df['beta_fit'] = df['score'].apply( lambda x: stats.beta.fit(x))
Now result is stored in df['beta_fit']:
0 (0.5158954356434775, 0.4824876600627905, 0.154...
1 (0.18219650169013427, 0.18228236200252418, 0.1...
2 (2.874609362944296, 0.8497751096020354, -0.341...
3 (1.313976940871222, 0.5956397575363881, -0.093...
Name: beta_fit, dtype: object
If you want to keep the location (loc) and scale (scale) fixed, you need to indicate this in scipy.stats.beta.fit. You can use functools.partial for this:
import pandas as pd
>>> import scipy.stats
>>> from functools import partial
>>> df = pd.DataFrame({
... 'id': ['100', '100', '100', '101', '101', '102'],
... 'score' : [0.5, 0.3, 0.2, 0.1, 0.2, 0.9]
... })
>>> beta = partial(scipy.stats.beta.fit, floc=0, fscale=1)
>>> df.groupby('id')['score'].apply(beta)
id
100 (4.82261025047374, 9.616623800842953, 0, 1)
101 (0.7079910251948778, 0.910200073771759, 0, 1)
Name: score, dtype: object
Note that I have adjusted your input example, since it contains an incorrect value (1.0), and too few values for the fit to succeed in some cases.
I want to convert a Pandas DataFrame into separate dicts, where the names of the dict are the columnn names and all dics have the same index.
the dataframe looks like this:
cBmsExp cCncC cDnsWd
PlantName
A.gre 2.5 0.45 896.8
A.rig 2.5 0.40 974.9
A.tex 3.5 0.45 863.1
the result should be:
cBmsExp = {"A.gre":2.5, "A.rig": 2.5, "A.tex": 3.5}
cCncC = {"A.gre":0.45, "A.rig": 0.4, "A.tex": 0.45}
cDnsWd = {"A.gre":898.8, "A.rig": 974.9, "A.tex": 863.1}
I can't figure out how a column name can become the name of a variable in my Python code.
I went through piles of stack overflow questions and answers, but I didn't find this type of problem among them.
Suggestions for code are very much appreciated!
It is not recommended, better is create dict of dicts and select by keys:
d = df.to_dict()
print (d)
{'cBmsExp': {'A.gre': 2.5, 'A.rig': 2.5, 'A.tex': 3.5},
'cCncC': {'A.gre': 0.45, 'A.rig': 0.4, 'A.tex': 0.45},
'cDnsWd': {'A.gre': 896.8, 'A.rig': 974.9, 'A.tex': 863.1}}
print (d['cBmsExp'])
{'A.gre': 2.5, 'A.rig': 2.5, 'A.tex': 3.5}
But possible, e.g. by globals:
for k, v in d.items():
globals()[k] = v
print (cBmsExp)
{'A.gre': 2.5, 'A.rig': 2.5, 'A.tex': 3.5}
I want to write a specific key with tuple values in a CSV file using Python. I cannot currently use numby or any other python external library. I am using "zip" to achieve this but only the first value associated with the key is getting directed, whereas, I want to print all the values in the tuple.
A sample dictionary and code are provided below:
data = {
"Pakistan": (0.57, 0.05, 0.79),
"India": (0.47, 0.12, 0.54),
"Bangladesh": (0.49, 0.17, 0.81)
}
con_name = input("Write up to three comma-separated countries for which you want to extract data: ")
count = len(re.findall(r'\w+', con_name))
if count == 1:
con_check1 = con_name.split()[0]
if con_check1.lower() in map(str.lower, data.keys()):
con_check1 = con_check1.capitalize()
x = list(data.keys()).index(con_check1)
y = [key for key in data.keys()][x]
csv_columns = ['Country Name','1997','1998','1999']
with open('Emissions_subset.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(csv_columns)
z = [y]
csv_out.writerows(zip(z, data[con_check1]))
The current output in the CSV file:
Country Name, 1997, 1998, 1999
Pakistan 0.57
The desired output:
Country Name, 1997, 1998, 1999
Pakistan 0.57, 0.05, 0.79
Can you please help me with this issue? I have been asking some questions lately and nobody is answering me. I am really stuck here and only ask a question after I am exhausted of trying.
Try this:
[In] kv_list = [[key,*val] for key, val in data.items()]
[In] print(kv_list)
[Out] [['Pakistan', 0.57, 0.05, 0.79], ['India', 0.47, 0.12, 0.54], ['Bangladesh', 0.49, 0.17, 0.81]]
Then just use csv_out.writerows(kv_list).
I am new to Pythonland and I have a question. I have a list as below and want to convert it into a dataframe.
I read on Stackoverflow that it is better to create a dictionary then a list so I create one as follows.
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick", "nick","pick"]
data = ['100','50','A','107','62','B'] # The actual list has 1640 entries
dic = {key:[] for key in column_names}
dic['name'] = row_names
t = 0
while t< len(data):
dic['height'].append(data[t])
t = t+3
t = 1
while t< len(data):
dic['weight'].append(data[t])
t = t+3
So on and so forth, I have 10 columns so I wrote above code 10 times to complete the full dictionary. Then i convert
it to dataframe. It works perfectly fine, there has to
be a way to do this in shorter way. I don't know how to refer to key of a dictionary with a number. Should it be wrapped to a function. Also, how can I automate adding one to value of t before executing the next loop? Please help me.
You can iterate through columnn_names like this:
dic = {key:[] for key in column_names}
dic['name'] = row_names
for t, column_name in enumerate(column_names):
i = t
while i< len(data):
dic[column_name].append(data[i])
i += 3
Enumerate will automatically iterate through t form 0 to len(column_names)-1
i = 0
while True:
try:
for j in column_names:
d[j].append(data[i])
i += 1
except Exception as er: #So when i value exceed by data list it comes to exception and it will break the loop as well
print(er, "################")
break
The first issue that you have all columns data concatenated to a single list. You should first investigate how to prevent it and have list of lists with each column values in a separate list like [['100', '107'], ['50', '62'], ['A', 'B']]. Any way you need this data structure to proceed efficiently:
cl_count = len(column_names)
d_count = len(data)
spl_data = [[data[j] for j in range(i, d_count, cl_count)] for i in range(cl_count)]
Then you should use dict comprehension. This is a 3.x Python feature so it will not work in Py 2.x.
df = pd.DataFrame({j: spl_data[i] for i, j in enumerate(column_names)})
First, we should understand how an ideal dictionary for a dataframe should look like.
A Dataframe can be thought of in two different ways:
One is a traditional collection of rows..
'row 0': ['jack', 100, 50, 'A'],
'row 1': ['mick', 107, 62, 'B']
However, there is a second representation that is more useful, though perhaps not as intuitive at first.
A collection of columns:
'name': ['jack', 'mick'],
'height': ['100', '107'],
'weight': ['50', '62'],
'grade': ['A', 'B']
Now, here is the key thing to realise, the 2nd representation is more useful
because that is the representation interally supported and used in dataframes.
It does not run into conflict of datatype within a single grouping (each column needs to have 1 fixed datatype)
Across a row representation however, datatypes can vary.
Also, operations can be performed easily and consistently on an entire column
because of this consistency that cant be guaranteed in a row.
So, tl;dr DataFrames are essentially collections of equal length columns.
So, a dictionary in that representation can be easily converted into a DataFrame.
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick"]
data = [100, 50,'A', 107, 62,'B'] # The actual list has 1640 entries
So, With that in mind, the first thing to realize is that, in its current format, data is a very poor representation.
It is a collection of rows merged into a single list.
The first thing to do, if you're the one in control of how data is formed, is to not prepare it this way.
The goal is a list for each column, and ideally, prepare the list in that format.
Now, however, if it is given in this format, you need to iterate and collect the values accordingly. Here's a way to do it
column_names = ["name", "height" , "weight", "grade"] # Actual list has 10 entries
row_names = ["jack", "mick"]
data = [100, 50,'A', 107, 62,'B'] # The actual list has 1640 entries
dic = {key:[] for key in column_names}
dic['name'] = row_names
print(dic)
Output so far:
{'height': [],
'weight': [],
'grade': [],
'name': ['jack', 'mick']} #so, now, names are a column representation with all correct values.
remaining_cols = column_names[1:]
#Explanations for the following part given at the end
data_it = iter(data)
for row in zip(*([data_it] * len(remaining_cols))):
for i, val in enumerate(row):
dic[remaining_cols[i]].append(val)
print(dic)
Output:
{'name': ['jack', 'mick'],
'height': [100, 107],
'weight': [50, 62],
'grade': ['A', 'B']}
And we are done with the representation
Finally:
import pd
df = pd.DataFrame(dic, columns = column_names)
print(df)
name height weight grade
0 jack 100 50 A
1 mick 107 62 B
Edit:
Some explanation for the zip part:
zip takes any iterables and allows us through iterate through them together.
data_it = iter(data) #prepares an iterator.
[data_it] * len(remaining_cols) #creates references to the same iterator
Here, this is similar to [data_it, data_it, data_it]
The * in *[data_it, data_it, data_it] allows us to unpack the list into 3 arguments for the zip function instead
so, f(*[data_it, data_it, data_it]) is equivalent to f(data_it, data_it, data_it) for any function f.
the magic here is that traversing through an iterator/advancing an iterator will now reflect the change across all references
Putting it all together:
zip(*([data_it] * len(remaining_cols))) will actually allow us to take 3 items from data at a time, and assign it to row
So, row = (100, 50, 'A') in first iteration of zip
for i, val in enumerate(row): #just iterate through the row, keeping index too using enumerate
dic[remaining_cols[i]].append(val) #use indexes to access the correct list in the dictionary
Hope that helps.
If you are using Python 3.x, as suggested by l159, you can use a comprehension dict and then create a Pandas DataFrame out of it, using the names as row indexes:
data = ['100', '50', 'A', '107', '62', 'B', '103', '64', 'C', '105', '78', 'D']
column_names = ["height", "weight", "grade"]
row_names = ["jack", "mick", "nick", "pick"]
df = pd.DataFrame.from_dict(
{
row_label: {
column_label: data[i * len(column_names) + j]
for j, column_label in enumerate(column_names)
} for i, row_label in enumerate(row_names)
},
orient='index'
)
Actually, the intermediate dictionary is a nested dictionary: the keys of the outer dictionary are the row labels (in this case the items of the row_names list); the value associated with each key is a dictionary whose keys are the column labels (i.e., the items in column_names) and values are the correspondent elements in the data list.
The function from_dict is used to create the DataFrame instance.
So, the previous code produces the following result:
height weight grade
jack 100 50 A
mick 107 62 B
nick 103 64 C
pick 105 78 D
I am currently working on an assignment where I need to convert a nested list to a dictionary, where i have to separate the codes from the nested list below.
data = [['ABC', "Tel", "12/07/2017", 1.5, 1000],['ACE', "S&P", "12/08/2017", 3.2, 2000],['AEB', "ENG", "04/03/2017", 1.4, 3000]]
to get this
Code Name Purchase Date Price Volume
ABC Tel 12/07/2017 1.5 1000
ACE S&P 12/08/2017 3.2 2000
AEB ENG 04/03/2017 1.4 3000
so the remaining values are still in a list, but tagged to codes as keys.
Could anyone advice on this please,thank you!
You can use a dictcomp:
keys = ['Code','Name','Purchase Date','Price','Volume']
{k: v for k, *v in zip(keys, *data)}
Result:
{'Code': ['ABC', 'ACE', 'AEB'],
'Name': ['Tel', 'S&P', 'ENG'],
'Purchase Date': ['12/07/2017', '12/08/2017', '04/03/2017'],
'Price': [1.5, 3.2, 1.4],
'Volume': [1000, 2000, 3000]}
You can use pandas dataframe for that:
import pandas as pd
data = [['ABC', "Tel", "12/07/2017", 1.5, 1000],['ACE', "S&P", "12/08/2017", 3.2, 2000],['AEB', "ENG", "04/03/2017", 1.4, 3000]]
columns = ["Code","Name","Purchase Date","Price","Volume"]
df = pd.DataFrame(data, columns=columns)
print(df)
I assume that by dictionaries you mean a list of dictionaries, each representing a row with the header as its keys.
You can do that like this:
keys = ['Code','Name','Purchase Date','Price','Volume']
dictionaries = [ dict(zip(keys,row)) for row in data ]