Merging dataframes where the common column has repeating values

Merging dataframes where the common column has repeating values - python

I would like to merge several sensor files which have a common column as "date" whose value is the time the sensor data was logged in. These sensors log the data every second. My task is to join these sensor data into one big dataframe. Since there could be a millisecond difference between the exact time the sensor data is logged in, we have created a window of 30 seconds using pandas pd.DatetimeIndex.floor method. Now I want to merge these files using the "date" column. The following is an example I was working on:
import pandas as pd
data1 = {
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
}
data2 = {
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
}
It is not necessary that the different sensor files will have a same amount of data. The sensor data looks like the below. The vertical axis could relate to the time (increasing downward). The second (B) and second last window (C) should overlap as they belong to the same time window.
The resultant dataframe should look something like that:
The A, B, C, and D values represent 30 sec window (for example, 'A' could be 07:00:00, 'B' could be 07:00:30, 'C' could be 07:01:00, and D could be 07:01:30). Now as we can see, the starting and ending window could be less than 30 (since sensor logs data every second, each window should have 30 values. In the example the number of rows of B and C window should be 30 each, not 6 as shown in the example). The reason is if the sensor has started reporting the values at 07:00:27, then it falls in the window of 'A' but could report only 3 values. Similarly, if the sensors has stopped reporting the values at 07:01:04, then it falls in the window of C but could report only 4 values. However, B and C windows will always have 30 values (In the example I have shown only 6 for ease of understanding).
I would like to merge the dataframes such that the values from the same window overlap as shown in figure (B and C) while the start and end windows, should show NaN values where there is no data. (In the above example, Value1 from sensor1 started reporting data 1 second earlier while Value2 from sensor 2 stopped reporting data 2 seconds after sensor1 stopped reporting).
How to achieve such joins in the pandas?

You can build your DataFrame with the following solution that requires only built-in Python structures. I don't see a particular interest in trying to use pandas methods. I'm not even sure that we can achieve this result only with pandas methods because you handle each value column differently, but I'm curious if you find a way.
from collections import defaultdict
import pandas as pd
data1 = {
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
}
data2 = {
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
}
# Part 1
datas = [data1, data2]
## Compute where to fill dicts with NaNs
dates = sorted(set(data1["date"] + data2["date"]))
dds = [{} for i in range(2)]
for d in dates:
for i in range(2):
dds[i][d] = [v for k, v in zip(datas[i]["date"], datas[i]["value%i" % (i + 1)]) if k == d]
## Fill dicts
nan = float("nan")
for d in dates:
n1, n2 = map(len, [dd[d] for dd in dds])
if n1 < n2:
dds[0][d] += (n2 - n1) * [nan]
elif n1 > n2:
dds[1][d] = (n1 - n2) * [nan] + dds[1][d]
# Part 2: Build the filled data columns
data = defaultdict(list)
for d in dates:
n = len(dds[0][d])
data["date"] += d * n
for i in range(2):
data["value%i" % (i + 1)] += dds[i][d]
data = pd.DataFrame(data)

if I understand the question correctly, you're might be looking for something like this:
data1 = pandas.DataFrame({
'date': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D'],
'value1': list(range(1, 20))
})
data2 = pandas.DataFrame({
'date': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'D', 'D', 'D', 'D', 'D'],
'value2': list(range(1, 21))
})
b = pandas.concat([data1, data2]).sort_values(by='date', ascending=True)

Related

keep random list elements and remove the others

I have a larg list of elements:
list= [a1, b, c, b, c, d, a2, c,...a3........]
And i want to remove a specific elements from it a1, a2, a3
suppose that i can get the indexes of the elements start with a
a_indexes = [0,6, ...]
Now, i want to remove most of these elements start with a a but not all of them, i want to keep 20 of them chosen arbitrary. How can i do so ?
I know that to remove an elements from a list list_ can use:
list_.remove(list[element position])
But i am not sure how to play with the a list.

Here's a an approach that will work if I understand the question correctly.
We have a list containing numerous items. We want to remove some elements that match a certain criterion - but not all.
So:
from random import sample
li = ['a','b','a','b','a','b','a']
dc = 'a'
keep = 1 # this is how many we want to keep in the list
if (k := li.count(dc) - keep) > 0: # sanity check
il = [i for i, v in enumerate(li) if v == dc]
for i in sorted(sample(il, k), reverse=True):
li.pop(i)
print(li)
Note how the sample is sorted. This is important because we're popping elements. If we do that in no particular order then we could end up removing the wrong elements.
An example of output might be:
['b', 'b', 'a', 'b']

Suppose you have this list:
li=['d', 'a', 'c', 'a', 'g', 'b', 'f', 'a', 'c', 'g', 'e', 'f', 'e', 'g', 'b', 'b', 'c', 'e', 'a', 'd', 'g', 'd', 'd', 'a', 'c', 'e', 'a', 'c', 'f', 'a', 'b', 'a', 'a', 'f', 'b', 'd', 'd', 'b', 'f', 'a', 'd', 'g', 'd', 'b', 'e']
You can define a character to delete and a count k of how many to delete:
delete='a'
k=3
Then use random.shuffle to generate a random group of k indices to delete:
idx=[i for i,c in enumerate(li) if c==delete]
random.shuffle(idx)
idx=idx[:k]
>>> idx
[3, 7, 31]
Then delete those indices:
new_li=[e for i,e in enumerate(li) if i not in idx]

How to combine adjacent same elements of a list in python? [duplicate]

This question already has answers here:
How do I use itertools.groupby()?
(15 answers)
Closed last month.
This is my list:
nab = ['b', 'b', 'a', 'b', 'b', 'b', 'a', 'a', 'a', 'a']
I want to combine the same elements which are adjacent into another list, and if they are not the same, just return the element itself.
The output that I am looking for is:
['b', 'a', 'b', 'a']
I mean:
two 'b' ---> 'b', one 'a' ---> 'a', three 'b' ---> 'b', four 'a' ---> 'a'
I want to know the length of the new list.

Thank you so much #tdelaney, I did it as below:
import itertools
nab = ['B', 'B', 'A', 'B', 'B', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'A', 'B', 'B', 'B', 'B', 'A']
U = []
key_func = lambda x: x[0]
for key, group in itertools.groupby(nab, key_func):
U.append(list(group))
print(U)
print(len(U))
Output:
[['B', 'B'], ['A'], ['B', 'B'], ['A', 'A', 'A', 'A'], ['B', 'B', 'B'], ['A', 'A'], ['B', 'B'], ['A', 'A'], ['B'], ['A'], ['B', 'B', 'B', 'B'], ['A']]

How to restrict the longest sequence of the same letter using python

How can I determine the longest sequence of the same letter using python?
For example I use the following code to print a shuffled list with 3 conditions A,B and C
from random import shuffle
condition = ["A"]*20
condition_B = ["B"]*20
condition_C = ["C"]*20
condition.extend(condition_B)
condition.extend(condition_C)
shuffle(condition)
print(condition)
Now i want to make sure that the same condition does not happen more than three times in a row.
E.g., allowed: [A, B, C, A, B, B, C, C, C, A, B….]
Not allowed: [A, A, B, B, B, B, C, A, B...] (because of four B’s in a row)
How can I solve this problem?
Thank you in advance.

Maybe you should build the list sequentially, rather than shuffling:
result = []
for i in range(60): # for each item in original list
start = true # we haven't found a suitable one yet
if start or i>2: # don't do checking unless 3 items in list
while start or (
c==shuf[-1] and # is the chosen value
c==shuf[-2] and # the same as any of
c==shuf[-3] ): # the last 3 items?
idx = random.randint(0,len(condition)) # chose a new one
c = condition[idx]
start = false
result.append(c) # add to result
del condition[i] # remove from list
Warning! not tested - just conceptual...

# Validate with this function it return false if more than three consecutive characters are same else True.
def isValidShuffle( test_condition):
for i in range(len(test_condition)-4):
if len(set(test_condition[ i:i+4])) == 1:
# set size will be 1 all four consecutive chars are same
return False
return True
Simplest way to create shuffled sequence of A,B,C for which isValidShuffle will return True.
from random import shuffle
# condition list contains 20 A's 20 B's 20 C's
seq = ['A','B','C']
condition = []
for seq_i in range(20):
shuffle(seq)
condition += seq
print(condition) # at most two consecutive characters will be same
print(isValidShuffle(condition))
-----------------------------------------------------------------------------
Output
['A', 'B', 'C', 'B', 'C', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C', 'A', 'B', 'C', 'B', 'A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C', 'A', 'C', 'A', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'A', 'B', 'C', 'C', 'A', 'B', 'A', 'B', 'C', 'B', 'A', 'C', 'C', 'A', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'A', 'B', 'C']
...............................................................................................................................................................
This is not imposing your restriction while creating shuffled sequence but keeps on trying until it find the sequence which meets your consecutive char restriction.
validshuffle = False
condition = ['A']*20 + ['B']*20 + ['C']*20
while not validshuffle:
shuffle(condition)
if isValidShuffle(condition):
validshuffle = True
print(condition)
-------------------------------------------------------------------------------
Output
try
try
['A', 'C', 'A', 'B', 'B', 'C', 'B', 'C', 'A', 'C', 'A', 'C', 'B', 'B', 'B', 'C', 'A', 'A', 'B', 'C', 'A', 'A', 'B', 'B', 'C', 'B', 'B', 'C', 'B', 'C', 'C', 'B', 'A', 'B', 'B', 'A', 'C', 'A', 'A', 'C', 'A', 'C', 'B', 'C', 'A', 'A', 'C', 'A', 'C', 'A', 'C', 'B', 'B', 'B', 'A', 'B', 'C', 'A', 'C', 'A']

If you just want to know, how long is the longest subsequence, you could do this.
This is iterating over it the sequence and recording the length of the subsequences of the same character, saving it, getting the max for each subsequence, and then, getting the max of characters.
This is not exactly the problem you mention, but It could be useful.
from random import shuffle
sequence = ['A']*20 + ['B']*20 + ['C']*20
sequences = {'A': [], 'B':[], 'C':[]}
shuffle(sequence)
current = sequence[0]
acc = 0
for elem in sequence:
if elem == current:
acc += 1
else:
sequences[current].append(acc)
current = elem
acc = 1
else:
sequences[current].append(acc)
for key, seqs in sequences.items():
sequences[key] = max(seqs)
print(max(sequences.items(), key=lambda i: i[1]))

Find viable combination from a list of preferences

I have an object that looks like this:
a - ['A', 'B', 'C']
b - ['A', 'B', 'C']
c - ['A', 'B', 'C', 'D']
d - ['A', 'B', 'C', 'D']
each one of the keys has a number of available options as denoted by the list (e.g. a can choose between A, B, C and so on). I want to find a combination of pairs that will satisfy everyone. This could be:
# Chosen Remaining Available Options
------------------------------------------
a - B - ['A', 'B', 'C'] - ['A', 'B', 'C']
b - A - ['A', 'C'] - ['A', 'B', 'C']
c - D - ['C', 'D'] - ['A', 'B', 'C', 'D']
d - C - ['C'] - ['A', 'B', 'C', 'D']
So in the example above a chose item B, reducing the pool of available options for the remaining participants. b then chose item A, and so on.
I do this by looping through all participants based on how big is the pool of their available choices, the idea being that if I have a participant that wants only one item, then there is no choice but to give him that item, removing it from the pool.
import random
team_choices = {'a': ['A', 'B', 'C'],
'b': ['A', 'B', 'C'],
'c': ['A', 'B', 'C', 'D'],
'd': ['A', 'B', 'C', 'D']}
teams_already_created = []
for team_b in sorted(team_choices, key=team_choices.__getitem__, reverse=False):
available_opponents = [opponent for opponent in team_choices[team_b] if opponent not in teams_already_created]
chosen_opponent = random.choice(available_opponents)
teams_already_created.append(chosen_opponent)
The way I do it though will not always work well since there is no guarantee that at some point it will make a choice that will at a later point choke some other player, leaving him no available options. And if chosen_opponent is empty then obviously this will fail.
Is there a better way of doing this that will work everytime?

This is the problem of finding a maximum matching. There are polynomial-time algorithms (e.g., Hopcroft–Karp).

Create combinations from a list and 3 ranges using itertools

I have the following:
a_list = [A,B,C]
r1 = range(1,5)
r2 = range(1,5)
r3 = range(1,5)
I would like to be able to find the various combinations of the elements in this list against the ranges. For example:
combi1 = [A, B, C, C, C]
combi2 = [A, A, B, C, C]
combi3 = [A, A, A, B, C]
combi4 = [A, B, B, C, C]
Etc.
I am able to do so if there were only 2 range, but I'm not sure how to fit 3 range in.
inc = range(1, 5)
desc = range(5, 1, -1)
combis = [list(itertools.chain(*(itertools.repeat(elem, n) for elem, n in zip(list, [i,j])))) for i,j in zip(inc,desc)]
SOLUTION:
def all_exist(avalue, bvalue):
return all(any(x in y for y in bvalue) for x in avalue)
combins = itertools.combinations_with_replacement(a_list, 5)
combins_list = [list(i) for i in combins]
for c in combins_list:
if all_exist(a_list, c) == True:
print c
output:
['A', 'A', 'A', 'B', 'C']
['A', 'A', 'B', 'B', 'C']
['A', 'A', 'B', 'C', 'C']
['A', 'B', 'B', 'B', 'C']
['A', 'B', 'B', 'C', 'C']
['A', 'B', 'C', 'C', 'C']

#doyz. I think this is may be what you are looking for :
From a list abc = ['A','B','C'], you want to obtain its various combinations with replacement. Python has built-in itertools to do this.
import itertools
abc = ['A', 'B', 'C'];
combins = itertools.combinations_with_replacement(abc, 5);
combins_list = [list(i) for i in combins];
print(combins_list[0:10]);
This is the first 10 combinations with replacement :
[['A', 'A', 'A', 'A', 'A'], ['A', 'A', 'A', 'A', 'B'], ['A', 'A', 'A', 'A', 'C'], \
['A', 'A', 'A', 'B', 'B'], ['A', 'A', 'A', 'B', 'C'], ['A', 'A', 'A', 'C', 'C'], \
['A', 'A', 'B', 'B', 'B'], ['A', 'A', 'B', 'B', 'C'], ['A', 'A', 'B', 'C', 'C'], ['A', 'A', 'C', 'C', 'C']]
If you want to include all elements in abc, here is one way, that also includes the permutations :
import itertools
abc = ['A', 'B', 'C'];
combins = itertools.combinations_with_replacement(abc, 5);
combins_list = list(combins);
combins_all =[];
for i in combins_list:
if len(set(i))==len(abc):
combins_all.append(i);
print(combins_all);
include_permutations=[];
for i in combins_all:
permut = list(itertools.permutations(i));
include_permutations.append(permut);
print(include_permutations);
Is this okay?
*Note : itertools.combinations_woth_replacement and itertools.permutations do not result in a list, or tuple, but a different object itself, so you can't treat it as those.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.