I need to iterate (vector operation not possible) over a very large dataframe (10 million x 70). df.iterrows and directly accessing the dataframe using df.loc[i, col] is way too slow. In the past I would first turn the dataframe to a dictionary of dictionaries which allowws me to iterate very quickly. However, this method takes up a lot of memory and is not feasible anymore for my current data.
I need to sacrifice some lookup speed to save memory. What is the best way to do this? Would turning my dataframe into an dictionary of row series {index: Series} work?
Do you mean something like this:
In [1112]: pd.DataFrame(df.reset_index().to_dict(orient='records'))
Out[1112]:
index id block check
0 0 6 25 yes
1 1 6 32 no
2 2 9 18 yes
3 3 12 17 no
4 4 15 23 yes
5 5 15 11 yes
6 6 15 15 yes
In [1113]: df.reset_index().to_dict(orient='records')
Out[1113]:
[{'index': 0, 'id': 6, 'block': 25, 'check': 'yes'},
{'index': 1, 'id': 6, 'block': 32, 'check': 'no'},
{'index': 2, 'id': 9, 'block': 18, 'check': 'yes'},
{'index': 3, 'id': 12, 'block': 17, 'check': 'no'},
{'index': 4, 'id': 15, 'block': 23, 'check': 'yes'},
{'index': 5, 'id': 15, 'block': 11, 'check': 'yes'},
{'index': 6, 'id': 15, 'block': 15, 'check': 'yes'}]
you could just do this (thanks #oppressionslayer for the example df):
df
id block check
0 6 25 yes
1 6 32 no
2 9 18 yes
3 12 17 no
4 15 23 yes
5 15 11 yes
6 15 15 yes
df.to_dict('index')
output:
{0: {'id': 6, 'block': 25, 'check': 'yes'}, 1: {'id': 6, 'block': 32, 'check': 'no'}, 2: {'id': 9, 'block': 18, 'check': 'yes'}, 3: {'id': 12, 'block': 17, 'check': 'no'}, 4: {'id': 15, 'block': 23, 'check': 'yes'}, 5: {'id': 15, 'block': 11, 'check': 'yes'}, 6: {'id': 15, 'block': 15, 'check': 'yes'}}
if you specifically (for some reason) want it to be {index:series} you could do this, which can be accessed the same way (i.e. df_name[i][col])
df.T.to_dict('series')
Related
I have two list of dictionaries, namely
bandits = [{'health': 15, 'damage': 2, 'id': 0}, {'health': 10, 'damage': 2, 'id': 0}, {'health': 12, 'damage': 2, 'id': 0}]
hero = [{'name': "Arthur", 'health': 50, 'damage': 5, 'id': 0}]
What I would like to do, is simulate a hero strike on each member of the bandits list, which consist in substracting the damage value of hero to the health value of each bandits entry. As an illustration, with the values given above, after the hero has dealt its blow, the bandits list should read
bandits = [{'health': 10, 'damage': 2, 'id': 0}, {'health': 5, 'damage': 2, 'id': 0}, {'health': 7, 'damage': 2, 'id': 0}]
I have tried several things, amongst which
for i, v in enumerate(bandits):
bandits[i] = {k: (bandits[i][k] - hero[0].get('damage')) for k in bandits[i] if k=='health'}
which yields
bandits = [{'health': 10}, {'health': 5}, {'health': 7}]
i.e. the results for the health are good, but all other key:val pairs in the dictionaries contained in the bandits list are deleted. How can I correct my code?
Depended on the goals/use case you can iterate the collection and update the value in-place (variable names are used from the "I have tried several things" code):
bandit = [{'health': 15, 'damage': 2, 'id': 0}, {'health': 10, 'damage': 2, 'id': 0}, {'health': 12, 'damage': 2, 'id': 0}]
knight_data = [{'name': "Arthur", 'health': 50, 'damage': 5, 'id': 0}]
for b in bandit:
for k in knight_data:
b['health'] -= k['damage']
Or:
for b in bandit:
b['health'] -= knight_data[0]['damage']
Don't create new dictionaries, just subtract from the values in the existing dictionaries.
for bandit in bandits:
bandit['health'] -= hero[0]['damage']
i am trying to convert a data-frame to a dict in the below format:
name age country state pincode
user1 10 in tn 1
user2 11 in tx 2
user3 12 eu gh 3
user4 13 eu io 4
user5 14 us pi 5
user6 15 us ew 6
the output groups users based on countries and had a dictionary of users with the details of users inside a dictionary
{
'in': {
'user1': {'age': 10, 'state': 'tn', 'pincode': 1},
'user2': {'age': 11, 'state': 'tx', 'pincode': 2}
},
'eu': {
'user3': {'age': 12, 'state': 'gh', 'pincode': 3},
'user4': {'age': 13, 'state': 'io', 'pincode': 4},
},
'us': {
'user5': {'age': 14, 'state': 'pi', 'pincode': 5},
'user6': {'age': 15, 'state': 'ew', 'pincode': 6},
}
}
I am currently doing this by below statement(this is not completely correct as i am using a list inside the loop, instead it should have been a dict):
op2 = {}
for i, row in sample2.iterrows():
if row['country'] not in op2:
op2[row['country']] = []
op2[row['country']] = {row['name'] : {'age':row['age'],'state':row['state'],'pincode':row['pincode']}}
I want a the solution to work if there are additional columns added to the df. for example telephone number. Since the statement I have written is static it won't give me the additional rows in my output. Is there a built in method in pandas that does this?
You can combine to_dict with groupby:
{k:v.drop('country',axis=1).to_dict('i')
for k,v in df.set_index('name').groupby('country')}
Output:
{'eu': {'user3': {'age': 12, 'state': 'gh', 'pincode': 3},
'user4': {'age': 13, 'state': 'io', 'pincode': 4}},
'in': {'user1': {'age': 10, 'state': 'tn', 'pincode': 1},
'user2': {'age': 11, 'state': 'tx', 'pincode': 2}},
'us': {'user5': {'age': 14, 'state': 'pi', 'pincode': 5},
'user6': {'age': 15, 'state': 'ew', 'pincode': 6}}}
One of the columns of my pandas dataframe looks like this
>> df
Item
0 [{"id":A,"value":20},{"id":B,"value":30}]
1 [{"id":A,"value":20},{"id":C,"value":50}]
2 [{"id":A,"value":20},{"id":B,"value":30},{"id":C,"value":40}]
I want to expand it as
A B C
0 20 30 NaN
1 20 NaN 50
2 20 30 40
I tried
dfx = pd.DataFrame()
for i in range(df.shape[0]):
df1 = pd.DataFrame(df.item[i]).T
header = df1.iloc[0]
df1 = df1[1:]
df1 = df1.rename(columns = header)
dfx = dfx.append(df1)
But this takes a lot of time as my data is huge. What is the best way to do this?
My original json data looks like this:
{
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
{
'_id': '5b1284e0b840a768f5545ef6',
'device': '0035sdf121',
'customerId': '38',
'variantId': '31',
'timeStamp': datetime.datetime(2018, 6, 2, 11, 50, 11),
'item': [{'id': A, 'value': 20},
{'id': B, 'value': 30},
{'id': C, 'value': 50}
},
.............
}
I agree with #JeffH, you should really look at how you are constructing the DataFrame.
Assuming you are getting this from somewhere out of your control then you can convert to the your desired DataFrame with:
In []:
pd.DataFrame(df['Item'].apply(lambda r: {d['id']: d['value'] for d in r}).values.tolist())
Out[]:
A B C
0 20 30.0 NaN
1 20 NaN 50.0
2 20 30.0 40.0
In my DataFrame I have list with dicts. When I do
data.stations.apply(lambda x: x)[5]
the output is:
[{'id': 245855,
'outlets': [{'connector': 13, 'id': 514162, 'power': 0},
{'connector': 3, 'id': 514161, 'power': 0},
{'connector': 7, 'id': 514160, 'power': 0}]},
{'id': 245856,
'outlets': [{'connector': 13, 'id': 514165, 'power': 0},
{'connector': 3, 'id': 514164, 'power': 0},
{'connector': 7, 'id': 514163, 'power': 0}]},
{'id': 245857,
'outlets': [{'connector': 13, 'id': 514168, 'power': 0},
{'connector': 3, 'id': 514167, 'power': 0},
{'connector': 7, 'id': 514166, 'power': 0}]}]
So it looks like 3 dicts in a list.
When I do
data.stations.apply(lambda x: x[0] )[5]
It does what it should:
{'id': 245855,
'outlets': [{'connector': 13, 'id': 514162, 'power': 0},
{'connector': 3, 'id': 514161, 'power': 0},
{'connector': 7, 'id': 514160, 'power': 0}]}
HOWEVER, when I chose second or third element, it doesn't work:
data.stations.apply(lambda x: x[1])[5]
This gives an error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-118-1210ba659690> in <module>()
----> 1 data.stations.apply(lambda x: x[1])[5]
~\AppData\Local\Continuum\Anaconda3\envs\geo2\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2549 else:
2550 values = self.asobject
-> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-118-1210ba659690> in <lambda>(x)
----> 1 data.stations.apply(lambda x: x[1])[5]
IndexError: list index out of range
Why? It should just give me the second element.
The reason might be simple that all the list entries in each row might not be of same length. Lets consider an example
data = pd.DataFrame({'stations':[[{'1':2,'3':4},{'1':2,'3':4},{'1':2,'3':4}],
[{'1':2,'3':4},{'1':2,'3':4}],
[{'1':2,'3':4}],
[{'1':2,'3':4},{'1':2,'3':4},{'1':2,'3':4}]]
})
stations
0 [{'1': 2, '3': 4}, {'1': 2, '3': 4}, {'1': 2, ...
1 [{'1': 2, '3': 4}, {'1': 2, '3': 4}]
2 [{'1': 2, '3': 4}]
3 [{'1': 2, '3': 4}, {'1': 2, '3': 4}, {'1': 2, ...
If you do :
data['stations'].apply(lambda x: x[0])[3]
You will get :
{'1': 2, '3': 4}
But if you do:
data['stations'].apply(lambda x: x[1])[3]
You will get Index Error... list out of bounds because if you observe the 3rd row there is only one element in the list. Hope it clears your doubt.
I have below list with nested lists (sort of key,values)
inp1=[{'id': 0, 'name': 98, 'value': 9}, {'id': 1, 'name': 66, 'value': 8}, {'id': 2, 'name': 29, 'value': 5}, {'id': 3, 'name': 99, 'value': 3}, {'id': 4, 'name': 15, 'value': 9}]
Am trying to replace 'name' with 'wid' and 'value' with 'wrt', how can I do it on same list?
My output should be like
inp1=[{'id': 0, 'wid': 98, 'wrt': 9}, {'id': 1, 'wid': 66, 'wrt': 8}, {'id': 2, 'wid': 29, 'wrt': 5}, {'id': 3, 'wid': 99, 'wrt': 3}, {'id': 4, 'wid': 15, 'wrt': 9}]
I tried below, but it doesn't work as list cannot be indexed with string but integer
inp1['name'] = inp1['wid']
inp1['value'] = inp1['wrt']
I tried if I can find any examples, but mostly I found only this for dictionary and not list.
You need to iterate each item, and remove the old entry (dict.pop is handy for this - it removes an entry and return the value) and assign to new keyes:
>>> inp1 = [
... {'id': 0, 'name': 98, 'value': 9},
... {'id': 1, 'name': 66, 'value': 8},
... {'id': 2, 'name': 29, 'value': 5},
... {'id': 3, 'name': 99, 'value': 3},
... {'id': 4, 'name': 15, 'value': 9}
... ]
>>>
>>> for d in inp1:
... d['wid'] = d.pop('name')
... d['wrt'] = d.pop('value')
...
>>> inp1
[{'wid': 98, 'id': 0, 'wrt': 9},
{'wid': 66, 'id': 1, 'wrt': 8},
{'wid': 29, 'id': 2, 'wrt': 5},
{'wid': 99, 'id': 3, 'wrt': 3},
{'wid': 15, 'id': 4, 'wrt': 9}]
def f(item):
if(item.has_key('name') and not item.has_key('wid')):
item['wid']=item.pop('name')
if(item.has_key('value') and not item.has_key('wrt')):
item['wrt']=item.pop('value')
map(f,inp1)
Output:
[{'wrt': 9, 'wid': 98, 'id': 0}, {'wrt': 8, 'wid': 66, 'id': 1}, {'wrt': 5, 'wid': 29, 'id': 2}, {'wrt': 3, 'wid': 99, 'id': 3}, {'wrt': 9, 'wid': 15, 'id': 4}]