I have a dictionary with unique ID and [sample distribution of scores] pairs, e.g.: '100': [0.5, 0.6, 0.2, 0.7, 0.3]. The arrays are not all the same length.
For each item/'scores' array in my dictionary, I want to fit a beta distribution like scipy.stats.beta.fit() over the distribution of scores and get the alpha/beta parameters for each sample. And then I want this in a new dictionary — so it'd be like, '101': (1.5, 1.8).
I know I could do this by iterating over my dictionary with a for-loop, but the dictionary is pretty massive/I'd like to know if there's a more computationally efficient way of doing it.
For context, the way I get this dictionary is from a pandas dataframe, where I do:
my_dictionary = df.groupby('unique_id')['score'].apply(list).to_dict()
The df looks like this:
For example:
df = pd.DataFrame({
'id': ['100', '100', '100', '101', '101', '102'],
'score' : [0.5, 0.3, 0.2, 1, 0.2, 0.9]
})
And then the resulting dictionary looks like:
{'100': [0.5, 0.3, 0.2], '101': [0.2, 0.1], '102': [0.9]}
Is there maybe also a way of fitting the beta distribution straight from the df.groupby level/without having to convert it into a dictionary first and then looping over the dictionary with scipy? Like is there something where I could do:
df.groupby('unique_id')['score'].apply(stats.beta.fit()).to_dict()
...or something like that?
Try this:
df=df.groupby('id').apply(lambda x: list(beta.fit(x.score)))
dc=df.to_dict()
Output:
df
id
100 [0.2626434905176847, 0.37866242902872393, 0.18...
101 [1.253982875508286, 0.8832540117966552, -0.093...
102 [1.044551187075241, 1.0167687597781938, 0.8999...
dtype: object
dc
{'100': [0.2626434905176847, 0.37866242902872393, 0.18487097639113187, 0.3151290236088682],
'101': [1.253982875508286, 0.8832540117966552, -0.09383386122371801, 1.0938338612237182],
'102': [1.044551187075241, 1.0167687597781938, 0.8999999999999999, 1.1272504901983386e-16]}
As I recognize You need to fit multiple beta.fit per row of dataframe df:
df['beta_fit'] = df['score'].apply( lambda x: stats.beta.fit(x))
Now result is stored in df['beta_fit']:
0 (0.5158954356434775, 0.4824876600627905, 0.154...
1 (0.18219650169013427, 0.18228236200252418, 0.1...
2 (2.874609362944296, 0.8497751096020354, -0.341...
3 (1.313976940871222, 0.5956397575363881, -0.093...
Name: beta_fit, dtype: object
If you want to keep the location (loc) and scale (scale) fixed, you need to indicate this in scipy.stats.beta.fit. You can use functools.partial for this:
import pandas as pd
>>> import scipy.stats
>>> from functools import partial
>>> df = pd.DataFrame({
... 'id': ['100', '100', '100', '101', '101', '102'],
... 'score' : [0.5, 0.3, 0.2, 0.1, 0.2, 0.9]
... })
>>> beta = partial(scipy.stats.beta.fit, floc=0, fscale=1)
>>> df.groupby('id')['score'].apply(beta)
id
100 (4.82261025047374, 9.616623800842953, 0, 1)
101 (0.7079910251948778, 0.910200073771759, 0, 1)
Name: score, dtype: object
Note that I have adjusted your input example, since it contains an incorrect value (1.0), and too few values for the fit to succeed in some cases.
Related
I want to convert a Pandas DataFrame into separate dicts, where the names of the dict are the columnn names and all dics have the same index.
the dataframe looks like this:
cBmsExp cCncC cDnsWd
PlantName
A.gre 2.5 0.45 896.8
A.rig 2.5 0.40 974.9
A.tex 3.5 0.45 863.1
the result should be:
cBmsExp = {"A.gre":2.5, "A.rig": 2.5, "A.tex": 3.5}
cCncC = {"A.gre":0.45, "A.rig": 0.4, "A.tex": 0.45}
cDnsWd = {"A.gre":898.8, "A.rig": 974.9, "A.tex": 863.1}
I can't figure out how a column name can become the name of a variable in my Python code.
I went through piles of stack overflow questions and answers, but I didn't find this type of problem among them.
Suggestions for code are very much appreciated!
It is not recommended, better is create dict of dicts and select by keys:
d = df.to_dict()
print (d)
{'cBmsExp': {'A.gre': 2.5, 'A.rig': 2.5, 'A.tex': 3.5},
'cCncC': {'A.gre': 0.45, 'A.rig': 0.4, 'A.tex': 0.45},
'cDnsWd': {'A.gre': 896.8, 'A.rig': 974.9, 'A.tex': 863.1}}
print (d['cBmsExp'])
{'A.gre': 2.5, 'A.rig': 2.5, 'A.tex': 3.5}
But possible, e.g. by globals:
for k, v in d.items():
globals()[k] = v
print (cBmsExp)
{'A.gre': 2.5, 'A.rig': 2.5, 'A.tex': 3.5}
I want to write a specific key with tuple values in a CSV file using Python. I cannot currently use numby or any other python external library. I am using "zip" to achieve this but only the first value associated with the key is getting directed, whereas, I want to print all the values in the tuple.
A sample dictionary and code are provided below:
data = {
"Pakistan": (0.57, 0.05, 0.79),
"India": (0.47, 0.12, 0.54),
"Bangladesh": (0.49, 0.17, 0.81)
}
con_name = input("Write up to three comma-separated countries for which you want to extract data: ")
count = len(re.findall(r'\w+', con_name))
if count == 1:
con_check1 = con_name.split()[0]
if con_check1.lower() in map(str.lower, data.keys()):
con_check1 = con_check1.capitalize()
x = list(data.keys()).index(con_check1)
y = [key for key in data.keys()][x]
csv_columns = ['Country Name','1997','1998','1999']
with open('Emissions_subset.csv','w') as out:
csv_out=csv.writer(out)
csv_out.writerow(csv_columns)
z = [y]
csv_out.writerows(zip(z, data[con_check1]))
The current output in the CSV file:
Country Name, 1997, 1998, 1999
Pakistan 0.57
The desired output:
Country Name, 1997, 1998, 1999
Pakistan 0.57, 0.05, 0.79
Can you please help me with this issue? I have been asking some questions lately and nobody is answering me. I am really stuck here and only ask a question after I am exhausted of trying.
Try this:
[In] kv_list = [[key,*val] for key, val in data.items()]
[In] print(kv_list)
[Out] [['Pakistan', 0.57, 0.05, 0.79], ['India', 0.47, 0.12, 0.54], ['Bangladesh', 0.49, 0.17, 0.81]]
Then just use csv_out.writerows(kv_list).
I have a dataframe and want to convert it to a list of dictionaries. I use read_csv() to create this dataframe. The dataframe looks like the following:
AccountName AccountType StockName Allocation
0 MN001 #1 ABC 0.4
1 MN001 #1 ABD 0.6
2 MN002 #2 EFG 0.5
3 MN002 #2 HIJ 0.4
4 MN002 #2 LMN 0.1
The desired output:
[{'ABC':0.4, 'ABD':0.6}, {'EFG':0.5, 'HIJ':0.4,'LMN':0.1}]
I have tried to research on similar topics and used the Dataframe.to_dict() function. I look forward to getting this done. Many thanks for your help!
import pandas as pd
import numpy as np
d = np.array([['MN001','#1','ABC', 0.4],
['MN001','#1','ABD', 0.6],
['MN002', '#2', 'EFG', 0.5],
['MN002', '#2', 'HIJ', 0.4],
['MN002', '#2', 'LMN', 0.1]])
df = pd.DataFrame(data=d, columns = ['AccountName','AccountType','StockName', 'Allocation'])
by_account_df = df.groupby('AccountName').apply(lambda x : dict(zip(x['StockName'],x['Allocation']))).reset_index(name='dic'))
by_account_lst = by_account_df['dic'].values.tolist()
And the result should be:
print(by_account_lst)
[{'ABC': '0.4', 'ABD': '0.6'}, {'EFG': '0.5', 'HIJ': '0.4', 'LMN': '0.1'}]
This should do it:
portfolios = []
for _, account in df.groupby('AccountName'):
portfolio = {stock['StockName']: stock['Allocation']
for _, stock in account.iterrows()}
portfolios.append(portfolio)
First use the groupby() function to group the rows of the dataframe by AccountName. To access the individual rows (stocks) for each account, you use the iterrows() method. As user #ebb-earl-co explained in the comments, the _ is there as a placeholder variable, because iterrows() returns (index, Series) tuples, and we only need the Series (the rows themselves). From there, use a dict comprehension to create a dictionary mapping StockName -> Allocation for each stock. Finally, append that dictionary to the list of portfolios, resulting in the expected output:
[{'ABC': 0.4, 'ABD': 0.6}, {'EFG': 0.5, 'HIJ': 0.4, 'LMN': 0.1}]
One more thing: if you decide later that you want to label each dict in the portfolios with the account name, you could do it like this:
portfolios = []
for acct_name, account in df.groupby('AccountName'):
portfolio = {stock['StockName']: stock['Allocation']
for _, stock in account.iterrows()}
portfolios.append({acct_name: portfolio})
This will return a list of nested dicts like this:
[{'MN001': {'ABC': 0.4, 'ABD': 0.6}},
{'MN002': {'EFG': 0.5, 'HIJ': 0.4, 'LMN': 0.1}}]
Note that in this case, I used the variable acct_name instead of assigning to _ because we actually will use the index to "label" the dicts in the portfolios list.
I am currently working on an assignment where I need to convert a nested list to a dictionary, where i have to separate the codes from the nested list below.
data = [['ABC', "Tel", "12/07/2017", 1.5, 1000],['ACE', "S&P", "12/08/2017", 3.2, 2000],['AEB', "ENG", "04/03/2017", 1.4, 3000]]
to get this
Code Name Purchase Date Price Volume
ABC Tel 12/07/2017 1.5 1000
ACE S&P 12/08/2017 3.2 2000
AEB ENG 04/03/2017 1.4 3000
so the remaining values are still in a list, but tagged to codes as keys.
Could anyone advice on this please,thank you!
You can use a dictcomp:
keys = ['Code','Name','Purchase Date','Price','Volume']
{k: v for k, *v in zip(keys, *data)}
Result:
{'Code': ['ABC', 'ACE', 'AEB'],
'Name': ['Tel', 'S&P', 'ENG'],
'Purchase Date': ['12/07/2017', '12/08/2017', '04/03/2017'],
'Price': [1.5, 3.2, 1.4],
'Volume': [1000, 2000, 3000]}
You can use pandas dataframe for that:
import pandas as pd
data = [['ABC', "Tel", "12/07/2017", 1.5, 1000],['ACE', "S&P", "12/08/2017", 3.2, 2000],['AEB', "ENG", "04/03/2017", 1.4, 3000]]
columns = ["Code","Name","Purchase Date","Price","Volume"]
df = pd.DataFrame(data, columns=columns)
print(df)
I assume that by dictionaries you mean a list of dictionaries, each representing a row with the header as its keys.
You can do that like this:
keys = ['Code','Name','Purchase Date','Price','Volume']
dictionaries = [ dict(zip(keys,row)) for row in data ]
I am trying to get a vector of specific dictionary values which are in a numpy array. Here is what the array looks like:
import numpy as np
edge_array = np.array(
[[1001, 7005, {'lanes': 9, 'length': 0.35, 'type': '99', 'modes': 'cw'}],
[1001, 8259, {'lanes': 10, 'length': 0.46, 'type': '99', 'modes': 'cw'}],
[1001, 14007, {'lanes': 7, 'length': 0.49, 'type': '99', 'modes': 'cw'}]])
I have a vector for the first two values of each row (i.e. 1001 and 7005, but I need another vector for the values associated with 'lanes'.
Here is my code so far:
row_idx = edge_array[:, 0]
col_idx = edge_array[:, 1]
lane_values = edge_array[:, 2['lanes']]
The error I get is as follows:
lane_values = edge_array[:, 2['lanes']]
TypeError: 'int' object has no attribute '__getitem__'
Please let me know if you need any further clarification, thanks!
The subexpression 2['lanes'] does not make sense: you are indexing into the number 2.
Instead, try:
[rec['lanes'] for rec in edge_array[:, 2]]
Or:
import operator
map(operator.itemgetter('lanes'), edge_array[:,2])
The above will give you a regular Python list; if you want a NumPy array you'll have to call np.array() on the list.
But the better solution here is to transform your data into a "structured array" which has named columns and then you can index efficiently by name. If your array has many rows, this will have a big impact on efficiency.
This is not a fully working example. Hard to work with that. The types are unclear. I suspect, that you work with numpy somehow, but well, hard to tell.
In all means, the indexing with 2['something'] is incorrect and the error tells you why. It is tried to index with a key in an integer. Look up how indexing is done in python / numpy.
But this is how you could extract your 'lanes':
map(lambda x: x['lanes'], edge_array[:, 2]))
# OR (if you want a vector/np-array)
vec_of_lanes = np.array(map(lambda x: x['lanes'], edge_array[:, 2])))
More in numpy-style:
vec_of_lanes = np.apply_along_axis(lambda x: x[2]['lanes'], 1, edge_array)
#Zwinck suggested a structured array. Here's one way of doing that
Define a dtype for the dictionary part. It has fields with different dtypes
dt1 = np.dtype([('lanes',int), ('length',float), ('type','S2'),('modes','S2')])
Embed that dtype in a larger one. I used a sub-array format for the first 2 values:
dt = np.dtype([('f0',int,(2,)), ('f1',dt1)])
Now create the array. I edited your expression to fit dt. The mix of tuples and lists is important. I could have transferred the data from your object array instead (todo?)
edge_array1 = np.array(
[([1001, 7005], ( 9, 0.35, '99','cw')),
([1001, 8259], ( 10, 0.46, '99','cw')),
([1001, 14007], (7, 0.49, '99', 'cw'))], dtype=dt)
Now the 2 int values can be accessed by the 'f0' field name:
In [513]: edge_array1['f0']
Out[513]:
array([[ 1001, 7005],
[ 1001, 8259],
[ 1001, 14007]])
while 'lanes' are accessed by a double application of field name indexing (since they are a field within the field):
In [514]: edge_array1['f1']['lanes']
Out[514]: array([ 9, 10, 7])