I am no data scientist. I do know python and I currently have to manage time series data that is coming in at a regular interval. Much of this data is all zero's or values that are the same for a long time, and to save memory I'd like to filter them out. Is there some standard method for this (which I am obviously unaware of) or should I implement my own algorithm ?
What I want to achieve is the following:
interval value result
(summed)
1 0 0
2 0 # removed
3 0 0
4 1 1
5 2 2
6 2 # removed
7 2 # removed
8 2 2
9 0 0
10 0 0
Any help appreciated !
You could use pandas query on dataframes to achieve this:
import pandas as pd
matrix = [[1,0, 0],
[2, 0, 0],
[3, 0, 0],
[4, 1, 1],
[5, 2, 2],
[6, 2, 0],
[7, 2, 0],
[8, 2, 2],
[9, 0, 0],
[10,0, 0]]
df = pd.DataFrame(matrix, columns=list('abc'))
print(df.query("c != 0"))
There is no quick function call to do what you need. The following is one way
import pandas as pd
df = pd.DataFrame({'interval':[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'value':[0, 0, 0, 1, 2, 2, 2, 2, 0, 0]}) # example dataframe
df['group'] = df['value'].ne(df['value'].shift()).cumsum() # column that increments every time the value changes
df['key'] = 1 # create column of ones
df['key'] = df.groupby('group')['key'].transform('cumsum') # get the cumulative sum
df['key'] = df.groupby('group')['key'].transform(lambda x: x.isin( [x.min(), x.max()])) # check which key is minimum and which is maximum by group
df = df[df['key']==True].drop(columns=['group', 'key']) # keep only relevant cases
df
Here is the code:
l = [0, 0, 0, 1, 2, 2, 2, 2, 0, 0]
for (i, ll) in enumerate(l):
if i != 0 and ll == l[i-1] and i<len(l)-1 and ll == l[i+1]:
continue
print(i+1, ll)
It produces what you want. You haven't specified format of your input data, so I assumed they're in a list. The conditions ll == l[i-1] and ll == l[i+1] are key to skipping the repeated values.
Thanks all!
Looking at the answers I guess I can conclude I'll need to roll my own. I'll be using your input as inspiration.
Thanks again !
Related
I am trying to "map a list of elements to a range of an element from another list to create unique matrices." Let me explain with a drawing.
Kickstart-inspired question
I hope that it makes sense.
This is inspired by Google Kickstart competition, which means that it is not a question exactly required by the contest.
But I thought of this question and I think that it is worth exploring.
But I am stuck with myself and not being able to move on much.
Here is the code I have, which obviously is not a correct solution.
values = input("please enter your input: ")
values = values.split()
values = [int(i) for i in values]
>>> please enter your input: 2 4 3 1 0 0 1 0 1 1 0 0 1 1 0 6 4 1 0 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 0 1 1 1 0
rows_columns = []
matrix = []
for i in values:
if i > 1:
rows_columns[:1].append(i) # The "2" at the very beginning indicates how many matrices should be formed
elif i <= 1:
matrix.append(i)
rows_columns[:1]
>>> [4, 3, 6, 4]
matrix_all = []
for i in range(1, len(rows_columns)):
matrix_sub = []
for j in range(rows_columns[i]):
matrix_sub.append(matrix[j])
if matrix_sub not in matrix_all:
matrix_all.append(matrix_sub)
>>> [[1, 0, 0, 1], [1, 0, 0], [1, 0, 0, 1, 0, 1], [1, 0, 0, 1]]
I really wonder if the nested loop is a good idea to solve this question. This is the best way I could think of for the last couple of hours. What I want to get as a final result looks like below.
Final expected output
Given that there is information about how many rows and columns there should be on a matrix on one list and just enough numbers of elements to form the matrix on the other, what would be the solution to map(or create) the two matrices out of the other list, based on the dimensionality information on a list?
I hope that it is clear, let me know when it is not.
Thanks!
Without using numpy, here is one working solution, based on the input found in your code snippet, and the expected result listed in your final expected result link:
values = [2, 4, 3, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 6, 4, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,
0, 1, 0, 1, 0, 1, 1, 1, 0]
v_idx = 1
"""
As per example, the number of matrices desired is found in the first input list element.
In the above values list, we want 2 matrices. The for loop below therefore executes exactly 2 times
"""
for matrix_nr in range(values[0]):
# The nr of rows and nr of columns are the next two elements in the values list
nr_rows = values[v_idx]
nr_cols = values[v_idx + 1]
# Calculate the start index for the next matrix specifications
new_idx = v_idx+2+(nr_rows*nr_cols)
# Slice the values list to extract the values for the current matrix to format
sub_elements = values[v_idx+2: new_idx]
matrix = []
# Append elements to the matrix by slicing values according to nr_rows and nr_cols
for r in range(nr_rows):
start_idx = r*nr_cols
end_idx = (r+1)*nr_cols
matrix.append(sub_elements[start_idx:end_idx])
print(matrix)
v_idx = new_idx
This gives the expected result:
[[1, 0, 0], [1, 0, 1], [1, 0, 0], [1, 1, 0]]
[[1, 0, 0, 0], [1, 0, 0, 1], [1, 1, 1, 1], [1, 0, 1, 0], [1, 0, 1, 0], [1, 1, 1, 0]]
As said, numpy could very likely be used to be a lot more efficient.
I'm trying to solve the following python interview questions using Pandas:
Given a m x n matrix, if an element is 0, set its entire row and column to 0. Do it in-place.
without using (enumerate)!!!
Here are some examples:
Example 1
[[1, 1, 1], [1, 0, 1], [1, 1, 1]] # input
[[1, 0, 1], [0, 0, 0], [1, 0, 1]] # output
Example 2
[[0, 1, 2, 0], [3, 4, 5, 2], [1, 3, 1, 5]] # input
[[0, 0, 0, 0], [0, 4, 5, 0], [0, 3, 1, 0]] # output
You can try this:
lst = [[1, 1, 1], [1, 0, 1], [1, 1, 1]]
df = pd.DataFrame(lst)
df_result = df.copy(deep=True)
df_result.loc[df.eq(0).any(axis=1)] = 0
df_result.loc[:, df.eq(0).any(axis=0)] = 0
result = df_result.values.tolist()
output:
[[1, 0, 1], [0, 0, 0], [1, 0, 1]]
Using only built-in Python functions:
# Example data (list)
lst = [[1, 1, 1], [1, 0, 1], [1, 1, 1]]
# For each row, if any of the values in the row is 0, replace all the values with 0
# Obs: I'm using a `list comprehension` to make the code shorter
for row in lst:
if any([value==0 for value in row]):
row[:] = [0] * len(row)
Using numpy:
# Import and create the array from the list
import numpy as np
a = np.array(lst)
# Set zeros in-place
a[(a==0).any(1), :] = 0
Using pandas:
# Import and create the dataframe from the list
import pandas as pd
df = pd.DataFrame(lst)
# Set zeros in-place
df.iloc[df.eq(0).any(1), :] = 0
The output for all of them is the same (rows with all zeros if there's at least one original zero on them). That logic was applied in all examples here. As you're still learning nested lists in Python, I would recommend to continue your studies with Python built-in classes, methods, functions, and etc. Afterwards you may want to take a look how indexing works in numpy and pandas so that you can get a better understanding of the code here.
Output:
print(lst)
[[1, 1, 1], [0, 0, 0], [1, 1, 1]]
print(a)
[[1 1 1]
[0 0 0]
[1 1 1]]
# ignore the first line and column,
# as they indicate the row and column names, respectively:
print(df)
0 1 2
0 1 1 1
1 0 0 0
2 1 1 1
I have an numpy array like this:
a = np.array([[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0],
[1, 0, 0, 1, 1],
[1, 0, 1, 0, 1]])
Question 1:
As shown in the title, I want to replace all elements with zero after the first zero appeared. The result should be like this :
a = np.array([[1, 0, 0, 0, 0],
[1, 1, 1, 1, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Question 2: how to slice different columns for each row like this example?
As I am dealing with an array with large size. If any one could find an efficient way to solve this please. Thank you very much.
One way to accomplish question 1 is to use numpy.cumprod
>>> np.cumprod(a, axis=1)
array([[1, 0, 0, 0, 0],
[1, 1, 1, 1, 0],
[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0]])
Question 1:
You could iterate over the array like so:
for i in range(a.shape[0]):
j = 0
row = a[i]
while row[j]>0:
j += 1
row[j+1:] = 0
This will change the array in-place. If you are interested in very high performance, the answers to this question could be of use to find the first zero faster. np.where scans the entire array for this and therefore is not optimal for the task.
Actually, the fastest solution will depend a bit on the distribution of your array entries: If there are many floats in there and rarely is there ever a zero, the while loops in the code above will interrupt late on average, requiring to write only "a few" zeros. If however there are only two possible entries like in your sample array and these occur with a similar probability (i.e. ~50%), there would be a lot of zeros to be written to a, and the following will be faster:
b = np.zeros(a.shape)
for i in range(a.shape[0]):
j = 0
a_row = a[i]
b_row = b[i]
while a_row[j]>0:
b_row[j] = a_row[j]
j += 1
Question 2:
If you mean to slice each row individually on a similar criterion dealing with a first occurence of some kind, you could simply adapt this iteration pattern. If the criterion is more global (like finding the maximum of the row, for example) built-in methods like np.where exist that will be more efficient, but it probably would depend a bit on the criterion itself which choice is best.
Question 1: An efficient way to do this would be the following.
import numpy as np
a = np.array([[1, 0, 1, 1, 1],
[1, 1, 1, 1, 0],
[1, 0, 0, 1, 1],
[1, 0, 1, 0, 1]])
for row in a:
zeros = np.where(row == 0)[0]
if (len(zeros)):# Check if zero exists
row[zeros[0]:] = 0
print(a)
Output:
[[1 0 0 0 0]
[1 1 1 1 0]
[1 0 0 0 0]
[1 0 0 0 0]]
Question 2: Using the same array, for each row rowIdx, you can have a array of columns colIdxs that you want to extract from.
rowIdx = 2
colIdxs = [1, 3, 4]
print(a[rowIdx, colIdxs])
Output:
[0 1 1]
I prefer Ayrat's creative answer for the first question, but if you need to slice different columns for different rows in large size, this could help you:
indexer = tuple(np.s_[i:a.shape[1]] for i in (a==0).argmax(axis=1))
for i,j in enumerate(indexer):
a[i,j]=0
indexer:
(slice(1, 5, None), slice(4, 5, None), slice(1, 5, None), slice(1, 5, None))
or:
indexer = (a==0).argmax(axis=1)
for i in range(a.shape[0]):
a[i,indexer[i]:]=0
indexer:
[1 4 1 1]
output:
[[1 0 0 0 0]
[1 1 1 1 0]
[1 0 0 0 0]
[1 0 0 0 0]]
Given a data frame df:
Column A: [0, 1, 3, 4, 6]
Column B: [0, 0, 0, 0, 0]
The goal is to conditionally replace values in column B. If column A's values exist in a set assginedToA, we replace the corresponding values in column B with a constant b.
For example: if b=1 and assignedToA={1,4}, the result would be
Column A: [0, 1, 3, 4, 6]
Column B: [0, 1, 0, 1, 0]
My code for finding the A values and write B values into it looks like this:
df.loc[df['A'].isin(assignedToA),'B']=b
This code works, but it is really slow for a huge dataframe.
Do you have any advice, how to speed this process up?
The dataframe df has around 5 Million rows and assignedToA has a maximum of 7 values.
You may find a performance improvement by dropping down to numpy:
df = pd.DataFrame({'A': [0, 1, 3, 4, 6],
'B': [0, 0, 0, 0, 0]})
def jp(df, vals, k):
B = df['B'].values
B[np.in1d(df['A'], list(vals))] = k
df['B'] = B
return df
def original(df, vals, k):
df.loc[df['A'].isin(vals),'B'] = k
return df
df = pd.concat([df]*100000)
%timeit jp(df, {1, 4}, 1) # 8.55ms
%timeit original(df, {1, 4}, 1) # 16.6ms
I have pandas.DataFrame that I'm interested only in the values of the last column.
np.shape(dataframe.iloc[:,:]) # the output is (2190,460)
# Now here is the shape of one cell in the last column
np.shape(dataframe.iloc[0,-1]) # the output is ( 20,)
dataframe.iloc[0,-1] # the output [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
My question is how can get this column saved in the following shape : (2190, 20)
since running:
np.shape(dataframe.iloc[:,-1]) # the output (2190,)
And this shape is causing me a huge problems
Solution using a loop:
Test_Labels = []
for i in range(len(dataframe)):
Test_Labels.append(dataframe.iloc[i,-1])
np.shape(Test_Labels)
If someone can solve it using a pandas function, will be glad to see it.
You can get the last column from a pandas.DataFrame like:
Code:
df[df.columns[-1]]
Test Code:
df = pd.DataFrame({"a": [dt.datetime(2017, 1, 3),
dt.datetime(2017, 2, 4),
dt.datetime(2017, 3, 5)],
"b": [[2, 4], [6, 8], [10, 12]]})
print(df)
print(df[df.columns[-1]])
Results:
a b
0 2017-01-03 [2, 4]
1 2017-02-04 [6, 8]
2 2017-03-05 [10, 12]
0 [2, 4]
1 [6, 8]
2 [10, 12]
Name: b, dtype: object
But I need the result to be an array of arrays not lists
If you need to convert the array of lists, to an array of arrays, then cast the whole to a numpy.array, numpy will reach in and convert the inner lists to arrays.
last_col = np.array(list(df[df.columns[-1]]))
print(last_col)
print(last_col.shape)
Results:
[[ 2 4]
[ 6 8]
[10 12]]
(3, 2)