how to behave all data as a single group in pandas groupby

how to behave all data as a single group in pandas groupby - python

I have a dataset and I need to groupby my dataset based on column group:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'group': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
"target": arr})
for g_name, g_df in df.groupby("group"):
print("GROUP: {}".format(g_name))
print(g_df)
However, sometimes group might not exist as a column and in this case, I am trying to whole data as a single group.
for g_name, g_df in df.groupby(SOMEPARAMETERS):
print(g_df)
output:
target
1
2
4
7
11
16
22
29
37
46
Is it possible to change the parameter of groupby to get whole data as a single group?

Assuming you mean something like this where you have two columns on which you want to group:
import numpy as np
import pandas as pd
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'group1': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'group2': ['C', 'D', 'D', 'C', 'D', 'D', 'C', 'D', 'D', 'C'],
'target': arr})
Then you can easily extend your example with:
for g_name, g_df in df.groupby(["group1", "group2"]):
print("GROUP: {}".format(g_name))
print(g_df)
Is this what you meant?

Related

Merging dataframes where the common column has repeating values

I would like to merge several sensor files which have a common column as "date" whose value is the time the sensor data was logged in. These sensors log the data every second. My task is to join these sensor data into one big dataframe. Since there could be a millisecond difference between the exact time the sensor data is logged in, we have created a window of 30 seconds using pandas pd.DatetimeIndex.floor method. Now I want to merge these files using the "date" column. The following is an example I was working on:
import pandas as pd
data1 = {
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
}
data2 = {
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
}
It is not necessary that the different sensor files will have a same amount of data. The sensor data looks like the below. The vertical axis could relate to the time (increasing downward). The second (B) and second last window (C) should overlap as they belong to the same time window.
The resultant dataframe should look something like that:
The A, B, C, and D values represent 30 sec window (for example, 'A' could be 07:00:00, 'B' could be 07:00:30, 'C' could be 07:01:00, and D could be 07:01:30). Now as we can see, the starting and ending window could be less than 30 (since sensor logs data every second, each window should have 30 values. In the example the number of rows of B and C window should be 30 each, not 6 as shown in the example). The reason is if the sensor has started reporting the values at 07:00:27, then it falls in the window of 'A' but could report only 3 values. Similarly, if the sensors has stopped reporting the values at 07:01:04, then it falls in the window of C but could report only 4 values. However, B and C windows will always have 30 values (In the example I have shown only 6 for ease of understanding).
I would like to merge the dataframes such that the values from the same window overlap as shown in figure (B and C) while the start and end windows, should show NaN values where there is no data. (In the above example, Value1 from sensor1 started reporting data 1 second earlier while Value2 from sensor 2 stopped reporting data 2 seconds after sensor1 stopped reporting).
How to achieve such joins in the pandas?

You can build your DataFrame with the following solution that requires only built-in Python structures. I don't see a particular interest in trying to use pandas methods. I'm not even sure that we can achieve this result only with pandas methods because you handle each value column differently, but I'm curious if you find a way.
from collections import defaultdict
import pandas as pd
data1 = {
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
}
data2 = {
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
}
# Part 1
datas = [data1, data2]
## Compute where to fill dicts with NaNs
dates = sorted(set(data1["date"] + data2["date"]))
dds = [{} for i in range(2)]
for d in dates:
for i in range(2):
dds[i][d] = [v for k, v in zip(datas[i]["date"], datas[i]["value%i" % (i + 1)]) if k == d]
## Fill dicts
nan = float("nan")
for d in dates:
n1, n2 = map(len, [dd[d] for dd in dds])
if n1 < n2:
dds[0][d] += (n2 - n1) * [nan]
elif n1 > n2:
dds[1][d] = (n1 - n2) * [nan] + dds[1][d]
# Part 2: Build the filled data columns
data = defaultdict(list)
for d in dates:
n = len(dds[0][d])
data["date"] += d * n
for i in range(2):
data["value%i" % (i + 1)] += dds[i][d]
data = pd.DataFrame(data)

if I understand the question correctly, you're might be looking for something like this:
data1 = pandas.DataFrame({
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
})
data2 = pandas.DataFrame({
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
})
b = pandas.concat([data1, data2]).sort_values(by='date', ascending=True)

Looking to drop first 5 rows of dataframe after ever new value occurs

I am looking to drop the first 5 rows each time a new value occurs in a dataframe
data = {
'col1': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
'col2': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
}
df = pd.DataFrame(data)
I am looking to drop the first 5 rows after each new value. Ex: 'A' value is new... delete first 5 rows. Now encounter 'B' value... delete its first 5 rows...

You need to do the following:
mask = df.groupby('col1').cumcount() >= 5
df = df.loc[mask]

You can use a negative tail:
df.groupby('col1').tail(-5)
To group by consecutive values:
group = df['col1'].ne(df['col1'].shift()).cumsum()
df.groupby(group).tail(-5)
Output:
col1 col2
5 A 6
6 A 7
12 B 13
13 B 14
19 C 20
20 C 21
NB. As pointed out by #Mark, there is an issue for older pandas versions (<1.4), in which case the cumcount approach can be used.

Count sequence within a column in pandas

I have a following problem. Suppose I have this dataframe:
import pandas as pd
d = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3]}
df = pd.DataFrame(data=d)
I need to add a new column that add a number to each project per name. Desired output is:
import pandas as pd
dnew = {'Name': ['c', 'c', 'c', 'a', 'a', 'b', 'b', 'd', 'd'], 'Project': ['aa','ab','bc', 'aa', 'ab','aa', 'ab','ca', 'cb'],
'col2': [3, 4, 0, 6, 45, 6, -3, 8, -3], 'New_column': ['1', '1','1','2', '2','2','2','3','3']}
NEWdf = pd.DataFrame(data=dnew)
In other words: 'aa','ab','bc' in Project occurs in the first rows, so I add 1 to the new column. 'aa', 'ab' is the second Project from the beginning. It occurs for Name 'a' and 'b', so I add 2 to the both new column. 'ca', 'cb' is the third project and it occurs only for name 'd', so I add 3 only to the name 'd'.
I tried to combine groupby with a for loop, but it did not worked to me. Thanks a lot for a help!

Looks like networkx since Name and Project are related , you can use:
import networkx as nx
G=nx.from_pandas_edgelist(df, 'Name', 'Project')
l = list(nx.connected_components(G))
s = pd.Series(map(list,l)).explode()
df['new'] = df['Project'].map({v:k for k,v in s.items()}).add(1)
print(df)
Name Project col2 new
0 a aa 3 1
1 a ab 4 1
2 b bb 6 2
3 b bc 6 2
4 c aa 6 1
5 c ab 6 1

Pymnet - creating a multilayered network visualisation

I have the code below to load the data:
from pymnet import *
import pandas as pd
nodes_id = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 1, 2, 3, 'aa', 'bb', 'cc']
layers = [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
nodes = {'nodes': nodes_id, 'layers': layers}
df_nodes = pd.DataFrame(nodes)
to = ['b', 'c', 'd', 'f', 1, 2, 3, 'bb', 'cc', 2, 3, 'a', 'g']
from_edges = ['a', 'a', 'b', 'e', 'a', 'b', 'e', 'aa', 'aa', 'aa', 1, 2, 3]
edges = {'to': to, 'from': from_edges}
df_edges = pd.DataFrame(edges)
I am attempting to use pymnet as a package to create a multi-layered network. (http://www.mkivela.com/pymnet/)
Does anybody know how to create a 3 layered network visualisation using this diagram? The tutorials seem to add nodes one at a time and it is unclear how to use a nodes and edges dataframe for this purpose. The layer groups are provided in the df_nodes.
Thanks

I've wondered the same, have a look at this post:
https://qiita.com/malimo1024/items/499a4ebddd14d29fd320
Use the format of this: mnet[from_node,to_node_2,layer_1,layer_2] = 1 to add edges (inter/intra).
For example:
from pymnet import *
import matplotlib.pyplot as plt
%matplotlib inline
mnet = MultilayerNetwork(aspects=1)
mnet['sato','tanaka','work','work'] = 1
mnet['sato','suzuki','friendship','friendship'] = 1
mnet['sato','yamada','friendship','friendship'] = 1
mnet['sato','yamada','work','work'] = 1
mnet['sato','sato','work','friendship'] = 1
mnet['tanaka','tanaka','work','friendship'] = 1
mnet['suzuki','suzuki','work','friendship'] = 1
mnet['yamada','yamada','work','friendship'] = 1
fig=draw(mnet)

"Balancing" a list of symbols

Consider a list with elements drawn from a set of symbols, e.g. {A, B, C}:
List --> A, A, B, B, A, A, A, A, A, B, C, C, B, B
Indexing indices --> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13
How I can re-order this list so that, for any symbol, we have approximately half of the symbols in the first half of the list i.e. [0, [N/2]] of the list and half on the second half? i.e. [[N/2, N]]
Note that there could be multiple solutions to this problem. We also want to compute the resulting list of indices of the permutation, so that we can apply the new ordering to any list associated with the original one.
Is there a name for this problem? Any efficient algorithms for it? Most of the solutions I can think of are very brute-force.

You can use a dictionary here, this will take O(N) time:
from collections import defaultdict
lst = ['A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'B', ' C', ' C', ' B', 'B']
d = defaultdict(list)
for i, x in enumerate(lst):
d[x].append(i)
items = []
indices = []
for k, v in d.items():
n = len(v)//2
items.extend([k]*n)
indices.extend(v[:n])
for k, v in d.items():
n = len(v)//2
items.extend([k]*(len(v)-n))
indices.extend(v[n:])
print items
print indices
Output:
['A', 'A', 'A', ' C', 'B', 'B', 'A', 'A', 'A', 'A', ' C', 'B', 'B', ' B']
[0, 1, 4, 10, 2, 3, 5, 6, 7, 8, 11, 9, 13, 12]

You can do this by getting the rank order of the symbols, then picking alternate ranks for each half of the output array:
x = np.array(['A', 'A', 'B', 'B', 'A', 'A', 'A',
'A', 'A', 'B', 'C', 'C', 'B', 'B'])
order = np.argsort(x)
idx = np.r_[order[0::2], order[1::2]]
print(x[idx])
# ['A' 'A' 'A' 'A' 'B' 'B' 'C' 'A' 'A' 'A' 'B' 'B' 'B' 'C']
print(idx)
# [ 0 4 6 8 3 12 10 1 5 7 2 9 13 11]
By default np.argsort uses the quicksort algorithm, with average time complexity O(N log N). The indexing step would be O(1).

You can use collections.Counter which is even better than just a defaultdict -- and you can place items into the first half and second half separately. That way, if you prefer, you can shuffle the first half and second half as much as you want (and just keep track of the shuffling permutation, with e.g. NumPy's argsort).
import collections
L = ['A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'B', 'C', 'C', 'B', 'B']
idx_L = list(enumerate(L))
ctr = collections.Counter(L)
fh = []
fh_idx = []
sh = []
sh_idx = []
for k, v in ctr.iteritems():
idxs = [i for i,e in idx_L if e == k]
fh = fh + [k for i in range(v//2)]
fh_idx = fh_idx + idxs[:v//2]
sh = sh + [k for i in range(v // 2, v)]
sh_idx = sh_idx + idxs[v//2:]
shuffled = fh + sh
idx_to_shuffled = fh_idx + sh_idx

print shuffled
print idx_to_shuffled
which gives
['A', 'A', 'A', 'C', 'B', 'B', 'A', 'A', 'A', 'A', 'C', 'B', 'B', 'B']
[0, 1, 4, 10, 2, 3, 5, 6, 7, 8, 11, 9, 12, 13]

Shuffle the list with the indices, then split it in half. This method won't perfectly split the symbols every time, but as the number of repeats of each symbol gets larger, it will approach a perfect split.
import random
symbols = ['A', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'A', 'B', 'C', 'C', 'B', 'B']
indices = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
both = zip(symbols, indices)
random.shuffle(both)
symbols2, indices2 = zip(*both)
print symbols2
print indices2
Some sample outputs:
Trial #1:
('A', 'C', 'B', 'A', 'A', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C')
( 7, 10, 2, 4, 1, 13, 8, 0, 5, 6, 9, 3, 12, 11)
# |
Trial #2
('A', 'A', 'B', 'B', 'C', 'A', 'A', 'A', 'B', 'C', 'A', 'A', 'B', 'B')
( 6, 0, 9, 3, 11, 1, 8, 4, 13, 10, 7, 5, 2, 12)
# |
Trial #3
('A', 'A', 'C', 'C', 'B', 'B', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B')
( 4, 5, 11, 10, 2, 3, 0, 13, 12, 6, 7, 8, 1, 9)
# |

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.