How do you index outliers in python? - python

I am trying to remove outliers from a list in python. I want to get the index values of each outlier from an original list so I can remove it from (another) corresponding list.
~~Simple example~~
my list with outliers:
y = [1,2,3,4,500] #500 is the outlier; has a index of 4
my corresponding list:
x= [1,2,3,4,5] #I want to remove 5, has the same index of 4
MY RESULT/GOAL:
y=[1,2,3,4]
x=[1,2,3,4]
This is my code, and I want to achieve the same with klist and avglatlist
import numpy as np
klist=['1','2','3','4','5','6','7','8','4000']
avglatlist=['1','2','3','4','5','6','7','8','9']
klist = np.array(klist).astype(np.float)
klist=klist[(abs(klist - np.mean(klist))) < (2 * np.std(klist))]
indices=[]
for k in klist:
if (k-np.mean(klist))>((2*np.std(klist))):
i=klist.index(k)
indices.append(i)
print('indices'+str(indices))
avglatlist = np.array(avglatlist).astype(np.float)
for index in sorted(indices, reverse=True):
del avglatlist[index]
print(len(klist))
print(len(avglatlist))

How to get the index values of each outlier in a list?
Say an outlier is defined as 2 standard deviations from a mean. This means you'd want to know the indices of values in a list where zscores have absolute values greater than 2.
I would use np.where:
import numpy as np
from scipy.stats import zscore
klist = np.array([1, 2, 3, 4, 5, 6, 7, 8, 4000])
avglatlist = np.arange(1, klist.shape[0] + 1)
indices = np.where(np.absolute(zscore(klist)) > 2)[0]
indices_filter = [i for i,n in enumerate(klist) if i not in indices]
print(avglatlist[indices_filter])
If you don't actually need to know the indices, use a boolean mask instead:
import numpy as np
from scipy.stats import zscore
klist = np.array([1, 2, 3, 4, 5, 6, 7, 8, 4000])
avglatlist = np.arange(1, klist.shape[0] + 1)
mask = np.absolute(zscore(klist)) > 2
print(avglatlist[~mask])
Both solutions print:
[1 2 3 4 5 6 7 8]

You are really close. All you need to do is apply the same filtering regime to a numpy version of avglatlist. I've changed a few variable names for clarity.
import numpy as np
klist = ['1', '2', '3', '4', '5', '6', '7', '8', '4000']
avglatlist = ['1', '2', '3', '4', '5', '6', '7', '8', '9']
klist_np = np.array(klist).astype(np.float)
avglatlist_np = np.array(avglatlist).astype(np.float)
klist_filtered = klist_np[(abs(klist_np - np.mean(klist_np))) < (2 * np.std(klist_np))]
avglatlist_filtered = avglatlist_np[(abs(klist_np - np.mean(klist_np))) < (2 * np.std(klist_np))]

Related

How to make a list of the lowest number among each month in .csv without python module?

I am trying to list out the minimal number of ticket sold sorting by the month.
csv_file = open(csvfile,"r")
date_and_ticket = []
for data in csv_file:
data = data.replace('\n','').split(',')
date_and_ticket.append(data[0:2])
print(date_and_ticket)
From here, I am able to sort out the result like this:
[['16/3/20', '9'], ['17/3/20', '4'], ['18/4/20', '4'], ['19/1/20', '5'], ['17/6/20', '89'], ['18/6/20', '104'], ['19/6/20', '128'], ['20/1/20', '79']]
However, I would like to sort the data according to their months in chronological order and add the value zero if the month is not in the list.
This is what I hope to do:
[5,0,4,4,0,89,0,0,0,0,0,0]
and here is the small portion of the .csv
https://drive.google.com/file/d/1aqMwZcSzbY8WpeyTzXP76Sl46acO23bI/view?usp=sharing
Any advice would be greatly appreciated thank you! :)
One way using dict.setdefault to create monthly values:
l = [['16/3/20', '9'], ['17/3/20', '4'], ['18/4/20', '4'],
['19/1/20', '5'], ['17/6/20', '89'], ['18/6/20', '104'],
['19/6/20', '128'], ['20/1/20', '79']]
res = {}
for d, v in l:
month = int(d.split("/")[1])
res.setdefault(month, []).append(int(v))
Output:
{1: [5, 79], 3: [9, 4], 4: [4], 6: [89, 104, 128]}
Then dict.get to make 0 for absent months:
[min(res.get(i, [0])) for i in range(1, 13)]
Output:
[5, 0, 4, 4, 0, 89, 0, 0, 0, 0, 0, 0]
One way is to use pandas.
Read your csv in a Pandas Dataframe:
import pandas as pd
df = pd.read_csv(csvfile)
Your df will look like:
In [1190]: df
Out[1190]:
date ticket_sold
0 16/3/20 9
1 17/3/20 4
2 18/4/20 4
3 19/1/20 5
4 17/6/20 89
5 18/6/20 104
6 19/6/20 128
7 20/1/20 79
# Convert `date` column to datetime and extract month
In [1196]: df['date'] = pd.to_datetime(df['date']).dt.month
# Groupby `month` and pick minimum tickets_sold per month
In [1203]: x = df.groupby('date')['ticket_sold'].min()
In [1208]: import numpy as np
# Fill data for missing months with 0
In [1207]: output = x.reindex(np.arange(1,13)).fillna(0).astype(int).values.tolist()
In [1209]: output
Out[1207]: [5, 0, 4, 4, 0, 89, 0, 0, 0, 0, 0, 0]

Structured 2D Numpy Array: setting column and row names

I'm trying to find a nice way to take a 2d numpy array and attach column and row names as a structured array. For example:
import numpy as np
column_names = ['a', 'b', 'c']
row_names = ['1', '2', '3']
matrix = np.reshape((1, 2, 3, 4, 5, 6, 7, 8, 9), (3, 3))
# TODO: insert magic here
matrix['3']['a'] # 7
I've been able to use set the columns like this:
matrix.dtype = [(n, matrix.dtype) for n in column_names]
This lets me do matrix[2]['a'] but now I want to rename the rows so I can do matrix['3']['a'].
As far as I know it's not possible to "name" the rows with pure structured NumPy arrays.
But if you have pandas it's possible to provide an "index" (which essentially acts like a "row name"):
>>> import pandas as pd
>>> import numpy as np
>>> column_names = ['a', 'b', 'c']
>>> row_names = ['1', '2', '3']
>>> matrix = np.reshape((1, 2, 3, 4, 5, 6, 7, 8, 9), (3, 3))
>>> df = pd.DataFrame(matrix, columns=column_names, index=row_names)
>>> df
a b c
1 1 2 3
2 4 5 6
3 7 8 9
>>> df['a']['3'] # first "column" then "row"
7
>>> df.loc['3', 'a'] # another way to index "row" and "column"
7

Appropriate Time to Dynamically Set Variable Names?

Edit: Turns out the answer is an emphatic "no". However, I'm still struggling to populate the lists with the right amount of entries.
I've been searching StackOverflow all over for this, and I keep seeing that dynamically setting variable names is not a good solution. However, I can't think of another way to to this.
I have a DataFrame created from pandas (read in from excel) that has columns with string headers and integer entries, and one column that has the numbers (let's call it Week) 1 through 52 increasing sequentially. What I want to do is create separate lists each named for the column headers and the entry is the week number appearing the number of times of the listed integer.
This is simple for a few columns, just manually create lists names, but as the number of columns grows, this could get a little out of hand.
Atrocious explanation, it was the best I could come up with. Hopefully a simplified example will clarify.
week str1 str2 str3
1 8 2 5
2 1 0 3
3 2 1 1
Desired output:
str1_count = [1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3] # eight 1's, one 2, and two 3's
str2_count = [1, 1, 3] # two 1's, one 3
str3_count = [1, 1, 1, 1, 1, 2, 2, 2, 3] # five 1's, three 2's, one 3
What I have so far:
results = {}
df = pd.DataFrame(from_csv(...., sep = ","))
for key in df:
for i in df[key]
results[key] = i # this only creates a list with the int value of the most recent i
So, like this?
import collections
import csv
import io
reader = csv.DictReader(io.StringIO('''
week,str1,str2,str3
1,8,2,5
2,1,0,3
3,2,1,1
'''.strip()))
data = collections.defaultdict(list)
for row in reader:
for key in ('str1', 'str2', 'str3'):
data[key].extend([row['week']]*int(row[key]))
from pprint import pprint
pprint(dict(data))
# Output:
{'str1': ['1', '1', '1', '1', '1', '1', '1', '1', '2', '3', '3'],
'str2': ['1', '1', '3'],
'str3': ['1', '1', '1', '1', '1', '2', '2', '2', '3']}
Note: Pandas is good for crunching data and doing some interesting operations on it, but if you just need something simple you don't need it. This is one of those cases.

Sorting a scipy.stats.itemfreq result containing strings

The Problem
I'm attempting to count the frequency of a list of strings and sort it in descending order. scipy.stats.itemfreq generates the frequency results which are output as a numpy array of string elements. This is where I'm stumped. How do I sort it?
So far I have tried operator.itemgetter which appeared to work for a small list until I realised that it is sorting by the first string character rather than converting the string to an integer so '5' > '11' as it is comparing 5 and 1 not 5 and 11.
I'm using python 2.7, numpy 1.8.1, scipy 0.14.0.
Example Code:
from scipy.stats import itemfreq
import operator as op
items = ['platypus duck','platypus duck','platypus duck','platypus duck','cat','dog','platypus duck','elephant','cat','cat','dog','bird','','','cat','dog','bird','cat','cat','cat','cat','cat','cat','cat']
items = itemfreq(items)
items = sorted(items, key=op.itemgetter(1), reverse=True)
print items
print items[0]
Output:
[array(['platypus duck', '5'],
dtype='|S13'), array(['dog', '3'],
dtype='|S13'), array(['', '2'],
dtype='|S13'), array(['bird', '2'],
dtype='|S13'), array(['cat', '11'],
dtype='|S13'), array(['elephant', '1'],
dtype='|S13')]
['platypus duck' '5']
Expected Output:
I'm after the ordering so something like:
[array(['cat', '11'],
dtype='|S13'), array(['platypus duck', '5'],
dtype='|S13'), array(['dog', '3'],
dtype='|S13'), array(['', '2'],
dtype='|S13'), array(['bird', '2'],
dtype='|S13'), array(['elephant', '1'],
dtype='|S13')]
['cat', '11']
Summary
My question is: how do I sort the array (which in this case is a string array) in descending order of counts? Please feel free to suggest alternative and faster/improved methods to my code sample above.
It is unfortunate that itemfreq returns the unique items and their counts in the same array. For your case, it means the counts are converted to strings, which is just dumb.
If you can upgrade numpy to version 1.9, then instead of using itemfreq, you can use numpy.unique with the argument return_counts=True (see below for how to accomplish this in older numpy):
In [29]: items = ['platypus duck','platypus duck','platypus duck','platypus duck','cat','dog','platypus duck','elephant','cat','cat','dog','bird','','','cat','dog','bird','cat','cat','cat','cat','cat','cat','cat']
In [30]: values, counts = np.unique(items, return_counts=True)
In [31]: values
Out[31]:
array(['', 'bird', 'cat', 'dog', 'elephant', 'platypus duck'],
dtype='|S13')
In [32]: counts
Out[32]: array([ 2, 2, 11, 3, 1, 5])
Get indices that puts counts in decreasing order:
In [38]: idx = np.argsort(counts)[::-1]
In [39]: values[idx]
Out[39]:
array(['cat', 'platypus duck', 'dog', 'bird', '', 'elephant'],
dtype='|S13')
In [40]: counts[idx]
Out[40]: array([11, 5, 3, 2, 2, 1])
For older versions of numpy, you can combine np.unique and np.bincount, as follows:
In [46]: values, inv = np.unique(items, return_inverse=True)
In [47]: counts = np.bincount(inv)
In [48]: values
Out[48]:
array(['', 'bird', 'cat', 'dog', 'elephant', 'platypus duck'],
dtype='|S13')
In [49]: counts
Out[49]: array([ 2, 2, 11, 3, 1, 5])
In [50]: idx = np.argsort(counts)[::-1]
In [51]: values[idx]
Out[51]:
array(['cat', 'platypus duck', 'dog', 'bird', '', 'elephant'],
dtype='|S13')
In [52]: counts[idx]
Out[52]: array([11, 5, 3, 2, 2, 1])
In fact, the above is exactly what itemfreq does. Here's the definition of itemfreq in the scipy source code (without the docstring):
def itemfreq(a):
items, inv = np.unique(a, return_inverse=True)
freq = np.bincount(inv)
return np.array([items, freq]).T
A much simpler way of achieving your task - obtaining the frequency of an item and having the items sorted by frequency - is to use the pandas function value_counts (for the original post and more suggestions see here):
import pandas as pd
import numpy as np
x = np.array(["bird","cat","dog","dog","cat","cat"])
pd.value_counts(x)
cat 3
dog 2
bird 1
dtype: int64
Getting only the number of occurences, sorted:
y = pd.value_counts(x).values
array([3, 2, 1])
Getting only the unique names of the items you want to count, sorted:
z = pd.value_counts(x).index
Index(['cat', 'dog', 'bird'], dtype='object')

Array update anomaly in Python [duplicate]

This question already has answers here:
List of lists changes reflected across sublists unexpectedly
(17 answers)
Closed 9 years ago.
I wrote the following code in python. Within checkd,
when I update d[ii][jj], it seems as if the compiler takes its own liberties and makes all the following column entries to 1.
Code:
def checkd(d, level, a, b):
i = len(b)
j = len(a)
print ['0'] + list(a)
for ii in range(i):
for jj in range(j):
if a[jj] == b[ii]:
#print a[jj] +" "+ b[ii] + " Matched."
d[ii][jj] = 1
print b[ii] + "\t" + a[jj] + "\t" + str(d[ii][jj])
print [b[ii]] + [str(m) for m in d[ii]]
return d
a = raw_input("First word:")
b = raw_input("Second word:")
w = input("Size of words to check:")
d = [[0]*len(a)]*len(b)
d = checkd(d, w, a, b)
print d
for x in d : print x
Output:
First word:ascend
Second word:nd
Size of words to check:2
['0', 'a', 's', 'c', 'e', 'n', 'd']
n a 0
n s 0
n c 0
n e 0
n n 1
n d 0
['n', '0', '0', '0', '0', '1', '0']
d a 0
d s 0
d c 0
d e 0
d n 1
d d 1
['d', '0', '0', '0', '0', '1', '1']
[[0, 0, 0, 0, 1, 1], [0, 0, 0, 0, 1, 1]]
[0, 0, 0, 0, 1, 1]
[0, 0, 0, 0, 1, 1]
As you would notice, not only is this leading to some random match(d,n,1?!) in the "d" row,
the returned 2d array is just a copy of the last row of the one in the function.
I have some experience with Python. I am not looking for a workaround (not that I'd mind) as much as an explanation for this behaviour, if possible?
Thanks!
This makes a list of len(b) references to the same list of len(a) zeros. A total of two interconnected lists are created.
d = [[0] * len(a)] * len(b)
What you want to do is:
d = [[0] * len(a) for _ in b]
ints are immutable, so they are safe to duplicate like that.

Categories

Resources