I would like to merge two dataframes. Both have the same column names, but different numbers of rows.
The values from the smaller dataframe should then replace the values from the other dataframe
So far I tried using pd.merge
pd.merge(df1, df2, how='left', on='NodeID)
But I do not know how to tell the merge command to use the values from the right dataframe for the columnes 'X' and 'Y'.
df1 = pd.DataFrame(data={'NodeID': [1, 2, 3, 4, 5], 'X': [0, 0, 0, 0, 0], 'Y': [0, 0, 0, 0, 0]})
df2 = pd.DataFrame(data={'NodeID': [2, 4], 'X': [1, 1], 'Y': [1, 1]})
The result should then look like this:
df3 = pd.DataFrame(data={'NodeID': [1, 2, 3, 4, 5], 'X': [0, 1, 0, 1, 0], 'Y':[0, 1, 0, 1, 0]})
This is can be done with concat and drop_duplicates
pd.concat([df2,df1]).drop_duplicates('NodeID').sort_values('NodeID')
Out[763]:
NodeID X Y
0 1 0 0
0 2 1 1
2 3 0 0
1 4 1 1
4 5 0 0
Related
I have a data frame (df) with these columns: user, vector, and group.
df = pd.DataFrame({'user': ['user_1', 'user_2', 'user_3', 'user_4', 'user_5', 'user_6'], 'vector': [[1, 0, 2, 0], [1, 8, 0, 2],[6, 2, 0, 0], [5, 0, 2, 2], [3, 8, 0, 0],[6, 0, 0, 2]], 'group': ['A', 'B', 'C', 'B', 'A', 'A']})
I want to calculate aggregated variance for each group.
I tried this code, but it return an error
aggregated_variance = (df.groupby('group', as_index=False)['vector'].agg(["var"]))
ValueError: no results
You can use .explode to clean up your data and then perform a .groupby operation:
out = (
df.explode('vector')
.groupby('group')['vector'].var(ddof=1)
)
print(out)
group
A 7.060606
B 7.428571
C 8.000000
Name: vector, dtype: float64
The trick here lies in the use of .explode:
>>> df.head()
user vector group
0 user_1 [1, 0, 2, 0] A
1 user_2 [1, 8, 0, 2] B
2 user_3 [6, 2, 0, 0] C
3 user_4 [5, 0, 2, 2] B
4 user_5 [3, 8, 0, 0] A
>>> df.explode('vector').head()
user vector group
0 user_1 1 A
0 user_1 0 A
0 user_1 2 A
0 user_1 0 A
1 user_2 1 B
...
If you take the sum() after you group df, you will have a dataframe that shows a list of all vector values for each group. Then, create a lambda function to calculate the variance of each list of vector values.
aggregated = df.groupby("group").sum()['vector']
aggregated_variance = aggregated.apply(lambda x: np.var(x)).reset_index()
I am no data scientist. I do know python and I currently have to manage time series data that is coming in at a regular interval. Much of this data is all zero's or values that are the same for a long time, and to save memory I'd like to filter them out. Is there some standard method for this (which I am obviously unaware of) or should I implement my own algorithm ?
What I want to achieve is the following:
interval value result
(summed)
1 0 0
2 0 # removed
3 0 0
4 1 1
5 2 2
6 2 # removed
7 2 # removed
8 2 2
9 0 0
10 0 0
Any help appreciated !
You could use pandas query on dataframes to achieve this:
import pandas as pd
matrix = [[1,0, 0],
[2, 0, 0],
[3, 0, 0],
[4, 1, 1],
[5, 2, 2],
[6, 2, 0],
[7, 2, 0],
[8, 2, 2],
[9, 0, 0],
[10,0, 0]]
df = pd.DataFrame(matrix, columns=list('abc'))
print(df.query("c != 0"))
There is no quick function call to do what you need. The following is one way
import pandas as pd
df = pd.DataFrame({'interval':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'value':[0, 0, 0, 1, 2, 2, 2, 2, 0, 0]}) # example dataframe
df['group'] = df['value'].ne(df['value'].shift()).cumsum() # column that increments every time the value changes
df['key'] = 1 # create column of ones
df['key'] = df.groupby('group')['key'].transform('cumsum') # get the cumulative sum
df['key'] = df.groupby('group')['key'].transform(lambda x: x.isin( [x.min(), x.max()])) # check which key is minimum and which is maximum by group
df = df[df['key']==True].drop(columns=['group', 'key']) # keep only relevant cases
df
Here is the code:
l = [0, 0, 0, 1, 2, 2, 2, 2, 0, 0]
for (i, ll) in enumerate(l):
if i != 0 and ll == l[i-1] and i<len(l)-1 and ll == l[i+1]:
continue
print(i+1, ll)
It produces what you want. You haven't specified format of your input data, so I assumed they're in a list. The conditions ll == l[i-1] and ll == l[i+1] are key to skipping the repeated values.
Thanks all!
Looking at the answers I guess I can conclude I'll need to roll my own. I'll be using your input as inspiration.
Thanks again !
I'm attempting to append a binary numpy array to another numpy array to feed into a neural network. The binary list is dependent on the column that the array is coming from.
For example, an array that comes from the third column is [0 0 1 0 0 0 0 0 0].
Here is an example:
Data (list of arrays):
[[0, 1, 1, 1, 0], [0, 1, 0, 0, 1], [1, 0, 0, 0, 0]]
Let's say that the first two elements came from the first column of a dataframe and the third element came from the second column. After appending the binary array the data would look something like this:
[([0, 1, 1, 1, 0],
[1 0 0 0 0 0 0 0 0]),
([0, 1, 0, 0, 1],
[1 0 0 0 0 0 0 0 0]),
([1, 0, 0, 0, 0],
[0 1 0 0 0 0 0 0 0])]
For context, I was originally training on just a single column of a dataframe, however I want to be able to train over the entire dataframe now.
Is there a way to automatically append this array to my data depending on the column the data is coming from so that the neural network can train on the whole data set rather than just going column by column?
Additionally, would this require two input layers or just one?
Maybe you could add a more concrete example to your question. But anyway, is this what you're expecting?
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'col1': [[0,0,1], [1,1,1]], 'col2': [[1,1,0],[0,0,0]]})
In [3]: df
Out[3]:
col1 col2
0 [0, 0, 1] [1, 1, 0]
1 [1, 1, 1] [0, 0, 0]
In [4]: for col_index, col_name in enumerate(df.columns):
...: array_to_append = [0] * len(df.columns)
...: array_to_append[col_index] = 1
...: df[col_name] = df[col_name].map(lambda x: (x, array_to_append))
...:
In [5]: df
Out[5]:
col1 col2
0 ([0, 0, 1], [1, 0]) ([1, 1, 0], [0, 1])
1 ([1, 1, 1], [1, 0]) ([0, 0, 0], [0, 1])
I have a pd dataframe. When I call pd.values, the result is like:
np.array([
[1, 2, [0, 0, 0], 3],
[1, 2, [0, 0, 0], 3]
])
and I want it to look like this when calling pd.values:
np.array([
[1, 2, 0, 0, 0, 3],
[1, 2, 0, 0, 0, 3]
])
Please help me out.
Assuming your dataframe is:
df = pd.DataFrame([
[1, 2, [0, 0, 0], 3],
[1, 2, [0, 0, 0], 3]
])
I'd use the insight from this post by #wim where I present the modified function below.
This flattens arbitrarily nested collections.
from collections import Iterable
def flatten(collection):
for element in collection:
if isinstance(element, Iterable) and not isinstance(element, str):
yield from flatten(element)
else:
yield element
I can then use this to flatten each row of the dataframe:
pd.DataFrame([*map(list, map(flatten, df.values))])
0 1 2 3 4 5
0 1 2 0 0 0 3
1 1 2 0 0 0 3
How to filter duplicated rows with overlapped labels? I need a subset of Dataframe where duplicated rows are replaced with one row for which label count is max.
Consider a dataframe df:
df = pd.DataFrame({
'X' : [1, -1, 1, 1, 3, -2, -1, -1],
'Y' : [2, 3, 2, 2, 2, -1, 3, 3],
'label' : [0, 1, 1, 0, 2, 1, 2, 2]
})
After filtering, the following subset df_output is expected
df_output = pd.DataFrame({
'X' : [1, -1, 3, -2],
'Y' : [2, 3, 2, -1],
'label' : [0, 2, 2, 1]
})
I think you are looking for groupby mode i.e
df.groupby(['X','Y'])['label'].apply(lambda x : x.mode().values[0]).reset_index()
Output :
X Y label
0 -2 -1 1
1 -1 3 2
2 1 2 0
3 3 2 2