How can I change the original DataFrame from a group? - python

Let's suppose I have the following DataFrame:
import pandas as pd
df = pd.DataFrame({'label': ['a', 'a', 'b', 'b', 'a', 'b', 'c', 'c', 'a', 'a'],
'numbers': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'arbitrarydata': [False] * 10})
I want to assign a value to the arbitrarydata column according to the values in both of the other colums. A naive approach would be as follows:
for _, grp in df.groupby(('label', 'numbers')):
grp.arbitrarydata = pd.np.random.rand()
Naturally, this doesn't propagate changes back to df. Is there a way to modify a group such that changes are reflected in the original DataFrame ?

Try using transform, e.g.:
df['arbitrarydata'] = df.groupby(('label', 'numbers')).transform(lambda x: np.random.rand())

Related

how to behave all data as a single group in pandas groupby

I have a dataset and I need to groupby my dataset based on column group:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'group': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
"target": arr})
for g_name, g_df in df.groupby("group"):
print("GROUP: {}".format(g_name))
print(g_df)
However, sometimes group might not exist as a column and in this case, I am trying to whole data as a single group.
for g_name, g_df in df.groupby(SOMEPARAMETERS):
print(g_df)
output:
target
1
2
4
7
11
16
22
29
37
46
Is it possible to change the parameter of groupby to get whole data as a single group?
Assuming you mean something like this where you have two columns on which you want to group:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'group1': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'group2': ['C', 'D', 'D', 'C', 'D', 'D', 'C', 'D', 'D', 'C'],
'target': arr})
Then you can easily extend your example with:
for g_name, g_df in df.groupby(["group1", "group2"]):
print("GROUP: {}".format(g_name))
print(g_df)
Is this what you meant?

Creating variable number of lists from pandas dataframes

I have a pandas dataframe being generated by some other piece of code - the dataframe may have different number of columns each time it is generated: let's call them col1,col2,...,coln where n is not fixed. Please note that col1,col2,... are just placeholders, the actual names of columns can be arbitrary like TimeStamp or PrevState.
From this, I want to convert each column into a list, with the name of the list being the same as the column. So, I want a list named col1 with the entries in the first column of the dataframe and so on till coln.
How do I do this?
Thanks
It is not recommended, better is create dictionary:
d = df.to_dict('list')
And then select list by keys of dict from columns names:
print (d['col'])
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
d = df.to_dict('list')
print (d)
{'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [4, 5, 4, 5, 5, 4], 'C': [7, 8, 9, 4, 2, 3]}
print (d['A'])
['a', 'b', 'c', 'd', 'e', 'f']
import pandas as pd
df = pd.DataFrame()
df["col1"] = [1,2,3,4,5]
df["colTWO"] = [6,7,8,9,10]
for col_name in df.columns:
exec(col_name + " = " + df[col_name].values.__repr__())

Populate values in a dataframe based on matching row and column of another dataframe

I have two data-frames and I want to populate new column values in data-frame1 based on matching Zipcode and date from another data-frame2.
The sample input and desired output are given below. The date formats are not the same. Dataframe 1 has more than 100k records and data-frame2 has columns for every month.
Any suggestions would be of great help since I am a newbie to python.
you are looking for pd.merge. Here is an example which shows how you can use it.
df1 = pd.DataFrame({'x1': [1, 2, 3, 4, 5, 6],
'y': ['a', 'b', 'c', 'd', 'e', 'f']})
df2 = pd.DataFrame({'x2': [1, 2, 3, 4, 5, 6],
'y': ['h', 'i', 'j', 'k', 'l', 'm']})
pd.merge(df1, df2, left_on='x1', right_on='x2')

How to groupby the keys in dictionary and sum up the values in python?

How to groupby two keys in dictionary and get the sum of the values of the other key val.
Input:
data = {'key1':['a','a', 'b', 'b'], 'key2':['m','n', 'm', 'm'],
'val':[1, 2, 3, 4]}
In this example, I want to groupby the key1 and the key2, and then sum up the value in val.
Expected:
data = {'key1':['a','a', 'b', 'b'], 'key2':['m','n', 'm', 'm'],
'val':[1, 2, 3, 4], 'val_sum':[1, 2, 7, 7]}
Actually, I don't want to convert the dictionary data into pandas.DataFrame then convert back to dictionary to achieve it, because my data is actually very big.
Update:
To help understand the generating val_sum, I post my code using pandas.DataFrame.
df = pd.DataFrame(data)
tmp = df.groupby(['key1', 'key2'])['val'].agg({'val_sum':'sum'})
df['val_sum'] = df.set_index(['key1', 'key2']).index.map(tmp.to_dict()['val_sum'])
And the result is shown as follows:
key1 key2 val val_sum
0 a m 1 1
1 a n 2 2
2 b m 3 7
3 b m 4 7
You can build your own summing solution using a defaultdict, say as follows.
from collections import defaultdict
data = {'key1':['a','a', 'b', 'b'], 'key2':['m','n', 'm', 'm'],
'val':[1, 2, 3, 4]}
keys_to_group = ['key1','key2']
temp = defaultdict(int) #initializes sum to zero
for i, *key_group in zip(data['val'], *[data[key] for key in keys_to_group]):
print(i, key_group) #key_group now looks like ['a', 'm'] or ['b', 'm'] or so on
temp[tuple(key_group)] += i
val_sum = [temp[key_group] for key_group in zip(*[data[key] for key in keys_to_group])]
data['val_sum'] = val_sum
print(data)
{'key1': ['a', 'a', 'b', 'b'],
'key2': ['m', 'n', 'm', 'm'],
'val': [1, 2, 3, 4],
'val_sum': [1, 2, 7, 7]}
Having said that however, it does seem like your data is more suited for tabular structures, and if you plan to do more than just this one operation, it might make sense to load it up in a dataframe anyways.

Pymnet - creating a multilayered network visualisation

I have the code below to load the data:
from pymnet import *
import pandas as pd
nodes_id = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 1, 2, 3, 'aa', 'bb', 'cc']
layers = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
nodes = {'nodes': nodes_id, 'layers': layers}
df_nodes = pd.DataFrame(nodes)
to = ['b', 'c', 'd', 'f', 1, 2, 3, 'bb', 'cc', 2, 3, 'a', 'g']
from_edges = ['a', 'a', 'b', 'e', 'a', 'b', 'e', 'aa', 'aa', 'aa', 1, 2, 3]
edges = {'to': to, 'from': from_edges}
df_edges = pd.DataFrame(edges)
I am attempting to use pymnet as a package to create a multi-layered network. (http://www.mkivela.com/pymnet/)
Does anybody know how to create a 3 layered network visualisation using this diagram? The tutorials seem to add nodes one at a time and it is unclear how to use a nodes and edges dataframe for this purpose. The layer groups are provided in the df_nodes.
Thanks
I've wondered the same, have a look at this post:
https://qiita.com/malimo1024/items/499a4ebddd14d29fd320
Use the format of this: mnet[from_node,to_node_2,layer_1,layer_2] = 1 to add edges (inter/intra).
For example:
from pymnet import *
import matplotlib.pyplot as plt
%matplotlib inline
mnet = MultilayerNetwork(aspects=1)
mnet['sato','tanaka','work','work'] = 1
mnet['sato','suzuki','friendship','friendship'] = 1
mnet['sato','yamada','friendship','friendship'] = 1
mnet['sato','yamada','work','work'] = 1
mnet['sato','sato','work','friendship'] = 1
mnet['tanaka','tanaka','work','friendship'] = 1
mnet['suzuki','suzuki','work','friendship'] = 1
mnet['yamada','yamada','work','friendship'] = 1
fig=draw(mnet)

Categories

Resources