i want to get values from the dict that looks like
pair_devices_count =
{('tWAAAA.jg', 'ttNggB.jg'): 1,
('tWAAAM.jg', 'ttWVsM.jg'): 2,
('tWAAAN.CV', 'ttNggB.AS'): 1,
('tWAAAN.CV', 'ttNggB.CV'): 2,
('tWAAAN.CV', 'ttNggB.QG'): 1}
(Pairs of domain)
But when i use
train_data[['domain', 'target_domain']].apply(lambda x: pair_devices_count.get((x), 0))
it raises an error, because pandas series are not hashable
How can i get dict values to generate column
train['pair_devices_count']?
you cannot apply on multiple columns. You can try this :
train_data.apply(lambda x: pair_devices_count[(x.domain, x.target_domain)], axis=1)
pandas series are not hashable
Convert pd.Series to tuple before using .get consider following simple example
import pandas as pd
d = {('A','A'):1,('A','B'):2,('A','C'):3}
df = pd.DataFrame({'X':['A','A','A'],'Y':['C','B','A'],'Z':['X','Y','Z']})
df['d'] = df[['X','Y']].apply(lambda x:d.get(tuple(x)),axis=1)
print(df)
output
X Y Z d
0 A C X 3
1 A B Y 2
2 A A Z 1
Related
I have a function which does a bunch of stuff and returns pandas dataframes. The dataframe is extracted from a dynamic list and hence I'm using the below method to return these dataframes.
As soon as I call the function (code in 2nd block), my jupyter notebook just runs the cell infinitely like some infinity loop. Any idea how I can do this more efficiently.
funct(x):
some code which creates multiple dataframes
i = 0
for k in range(len(dynamic_list)):
i += 1
return globals()["df" + str(i)]
The next thing I do is call the function and try to assign it dynamically,
i = 0
for k in range(len(dynamic_list)):
i += 1
globals()["new_df" + str(i)] = funct(x)
I have tried returning selective dataframes from first function and it works just fine, like,
funct(x):
some code returning df1, df2, df3....., df_n
return df1, df2
new_df1, new_df2 = funct(x)
for each dataframe object your code is creating you can simply add it to a dictionary and set the key from your dynamic list.
Here is a simple example:
import pandas as pd
test_data = {"key1":[1, 2, 3], "key2":[1, 2, 3], "key3":[1, 2, 3]}
df = pd.DataFrame.from_dict(test_data)
dataframe example:
key1 key2 key3
0 1 1 1
1 2 2 2
2 3 3 3
I have used a fixed list of values to focus on but this can be dynamic based on however you are creating them.
values_of_interest_list = [1, 3]
Now we can do whatever we want to do with the dataframe, in this instance, I want to filter only data where we have a value from our list.
data_dict = {}
for value_of_interest in values_of_interest_list:
x_df = df[df["key1"] == value_of_interest]
data_dict[value_of_interest] = x_df
To see what we have, we can print out the created dictionary that contains the key we have assigned and the associated dataframe object.
for key, value in data_dict.items():
print(type(key))
print(type(value))
Which returns
<class 'int'>
<class 'pandas.core.frame.DataFrame'>
<class 'int'>
<class 'pandas.core.frame.DataFrame'>
Full sample code is below:
import pandas as pd
test_data = {"key1":[1, 2, 3], "key2":[1, 2, 3], "key3":[1, 2, 3]}
df = pd.DataFrame.from_dict(test_data)
values_of_interest_list = [1, 3]
# Dictionary for data
data_dict = {}
# Loop though the values of interest
for value_of_interest in values_of_interest_list:
x_df = df[df["key1"] == value_of_interest]
data_dict[value_of_interest] = x_df
for key, value in data_dict.items():
print(type(key))
print(type(value))
In one column, I have 4 possible (non-sequential) values: A, 2, +, ? and I want order rows according to a custom sequence 2, ?, A, +, I followed some code I followed online:
order_by_custom = pd.CategoricalDtype(['2', '?', 'A', '+'], ordered=True)
df['column_name'].astype(order_by_custom)
df.sort_values('column_name', ignore_index=True)
But for some reason, although it does sort, it still does so according to alphabetical (or binary value) position rather than the order I've entered them in the order_by_custom object.
Any ideas?
.astype does return Series after conversion, but you did not anything with it. Try assigning it to your df. Consider following example:
import pandas as pd
df = pd.DataFrame({'orderno':[1,2,3],'custom':['X','Y','Z']})
order_by_custom = pd.CategoricalDtype(['Z', 'Y', 'X'], ordered=True)
df['custom'] = df['custom'].astype(order_by_custom)
print(df.sort_values('custom'))
output
orderno custom
2 3 Z
1 2 Y
0 1 X
You can use a customized dictionary to sort it. For example a dictionary will be as:
my_custom_dict = {'2': 0, '?': 1, 'A': 2, '+' : 3}
If your column name is "my_column_name" then,
df.sort_values(by=['my_column_name'], key=lambda x: x.map(my_custom_dict))
I have the following dataframe:
And I made dictionaries from each unique appId as you see below:
with this command:
dfs = dict(tuple(timeseries.groupby('appId')))
After that I want to remove all dictionaries which have less than 30 rows from my dataframe. I removed those dictionaries from my dictionaries(dfs) and then I tried this code:
pd.concat([dfs]).drop_duplicates(keep=False)
but it doesn't work.
I believe you need transform size and then filter by boolean indexing:
df = pd.concat([dfs])
df = df[df.groupby('appId')['appId'].transform('size') >= 30]
#alternative 1
#df = df[df.groupby('appId')['appId'].transform('size').ge(30)]
#alternative 2 (slowier in large data)
#df = df.groupby('appId').filter(lambda x: len(x) >= 30)
Another approach is filter dictionary:
dfs = {k: v for k, v in dfs.items() if len(v) >= 30}
EDIT:
timeseries = timeseries[timeseries.groupby('appId')['appId'].transform('size') >= 30]
dfs = dict(tuple(timeseries.groupby('appId')))
In a simplified dataframe:
import pandas as pd
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5])
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
I would like to solve an equation which is attached as an image below:
I is equal to the values in the data frame. x is the int_range values so in this case from 350 to 355 with a dx=1.
a and b are optional constants
I need to get a dataframe as an output per each row
For now I do something like this, but I'm not sure it's correct:
dict_INT = {}
for index, row in df1.iterrows():
func = df1.loc[index]*df1.loc[index].index.astype('float')
x = df1.loc[index].index.astype('float')
dict_INT[index] = integrate.trapz(func, x)
df_out = pd.DataFrame(dict_INT, index=['INT']).T
df_fin = df_out/(a*b)
This is the final sum I get per row:
1 3.505796e+06
2 3.068796e+06
3 2.700446e+06
4 2.199336e+06
5 1.840992e+06
I solved this by first converting the dataframe to dict and then performing your equation by each item in row, then writing these value to dict using collections defaultdict. I will break it down:
import pandas as pd
from collections import defaultdict
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5]
)
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
dx = 1
df_dict = df1.to_dict() # convert df to dict for easier operations
integrated_dict = {} # initialize empty dict
d = defaultdict(list) # initialize empty dict of lists for tuples later
integrated_list = []
for k,v in df_dict.items(): # unpack df dict of dicts
for x,y in v.items(): # unpack dicts by column and index (x is index, y is column)
integrated_list.append((k, (((float(k)*float(y)*float(dx))/(a*b))))) #store a list of tuples.
for x,y in integrated_list: # create dict with column header as key and new integrated calc as value (currently a tuple)
d[x].append(y)
d = {k:tuple(v) for k, v in d.items()} # unpack to multiple values
integrated_df = pd.DataFrame.from_dict(d) # to df
integrated_df['Sum'] = integrated_df.iloc[:, :].sum(axis=1)
output (updated to include sum):
350 351 352 353 354 \
0 660539.653524 678928.103226 697410.576822 710302.382557 722004.527599
1 578070.704898 594694.141935 611402.972521 622015.269056 631317.086738
2 505890.250896 521785.529032 537763.142652 547984.294624 556969.473835
3 418189.952210 432314.245161 446512.126165 455795.202628 464025.483871
4 340576.344086 353243.212903 365976.797133 374493.356033 382109.376344
355 Sum
0 733761.502987 4.202947e+06
1 640661.416965 3.678162e+06
2 565996.646356 3.236389e+06
3 383188.781362 2.600026e+06
4 389762.516129 2.206162e+06
How can you combine multiple columns from a dataframe into a list?
Input:
df = pd.DataFrame(np.random.randn(10000, 7), columns=list('ABCDEFG'))
If I wanted to create a list from column A I would perform:
df1 = df['A'].tolist()
But if I wanted to combine numerous columns into this list it wouldn't be efficient write df['A','B','C'...'Z'].tolist()
I have tried to do the following but it just adds the columns headers to a list.
df1 = list(df.columns)[0:8]
Intended input:
A B C D E F G
0 0.787576 0.646178 -0.561192 -0.910522 0.647124 -1.388992 0.728360
1 0.265409 -1.919283 -0.419196 -1.443241 -2.833812 -1.066249 0.553379
2 0.343384 0.659273 -0.759768 0.355124 -1.974534 0.399317 -0.200278
Intended Output:
[0.787576, 0.646178, -0.561192, -0.910522, 0.647124, -1.388992, 0.728360,
0.265409, -1.919283, -0.419196, -1.443241, -2.833812, -1.066249, 0.553379,
0.343384, 0.659273, -0.759768, 0.355124, -1.974534, 0.399317, -0.200278]
Is this what you are looking for
lst = df.values.tolist()
flat_list = [item for x in lst for item in x]
print(flat_list)
You can using to_dict
df = pd.DataFrame(np.random.randn(10, 10), columns=list('ABCDEFGHIJ'))
df.to_dict('l')
Out[1036]:
{'A': [-0.5611441440595607,
-0.3785906500723589,
-0.19480328695097676,
-0.7472526275034221,
-2.4232786057647457,
0.10506614562827334,
0.4968179288412277,
1.635737019365132,
-1.4286421753281746,
0.4973223222844811],
'B': [-1.0550082961139444,
-0.1420067090193365,
0.30130476834580633,
1.1271866812852227,
0.38587456174846285,
-0.531163142682951,
-1.1335754634118729,
0.5975963084356348,
-0.7361022807495443,
1.4329395663140427],
...}
Or adding values.tolist()
df[list('ABC')].values.tolist()
Out[1041]:
[[0.09552771302434987, 0.18551596484768904, -0.5902249875268607],
[-1.5285190712746388, 1.2922627021799646, -0.8347422966138306],
[-0.4092028716404067, -0.5669107267579823, 0.3627970727410332],
[-1.3546346273319263, -0.9352316948439341, 1.3568726575880614],
[-1.3509518030469496, 0.10487182694997808, -0.6902134363370515]]
Edit : np.concatenate(df[list('ABC')].T.values.tolist())