Edit: Turns out the answer is an emphatic "no". However, I'm still struggling to populate the lists with the right amount of entries.
I've been searching StackOverflow all over for this, and I keep seeing that dynamically setting variable names is not a good solution. However, I can't think of another way to to this.
I have a DataFrame created from pandas (read in from excel) that has columns with string headers and integer entries, and one column that has the numbers (let's call it Week) 1 through 52 increasing sequentially. What I want to do is create separate lists each named for the column headers and the entry is the week number appearing the number of times of the listed integer.
This is simple for a few columns, just manually create lists names, but as the number of columns grows, this could get a little out of hand.
Atrocious explanation, it was the best I could come up with. Hopefully a simplified example will clarify.
week str1 str2 str3
1 8 2 5
2 1 0 3
3 2 1 1
Desired output:
str1_count = [1, 1, 1, 1, 1, 1, 1, 1, 2, 3, 3] # eight 1's, one 2, and two 3's
str2_count = [1, 1, 3] # two 1's, one 3
str3_count = [1, 1, 1, 1, 1, 2, 2, 2, 3] # five 1's, three 2's, one 3
What I have so far:
results = {}
df = pd.DataFrame(from_csv(...., sep = ","))
for key in df:
for i in df[key]
results[key] = i # this only creates a list with the int value of the most recent i
So, like this?
import collections
import csv
import io
reader = csv.DictReader(io.StringIO('''
week,str1,str2,str3
1,8,2,5
2,1,0,3
3,2,1,1
'''.strip()))
data = collections.defaultdict(list)
for row in reader:
for key in ('str1', 'str2', 'str3'):
data[key].extend([row['week']]*int(row[key]))
from pprint import pprint
pprint(dict(data))
# Output:
{'str1': ['1', '1', '1', '1', '1', '1', '1', '1', '2', '3', '3'],
'str2': ['1', '1', '3'],
'str3': ['1', '1', '1', '1', '1', '2', '2', '2', '3']}
Note: Pandas is good for crunching data and doing some interesting operations on it, but if you just need something simple you don't need it. This is one of those cases.
Related
I am trying to list out the minimal number of ticket sold sorting by the month.
csv_file = open(csvfile,"r")
date_and_ticket = []
for data in csv_file:
data = data.replace('\n','').split(',')
date_and_ticket.append(data[0:2])
print(date_and_ticket)
From here, I am able to sort out the result like this:
[['16/3/20', '9'], ['17/3/20', '4'], ['18/4/20', '4'], ['19/1/20', '5'], ['17/6/20', '89'], ['18/6/20', '104'], ['19/6/20', '128'], ['20/1/20', '79']]
However, I would like to sort the data according to their months in chronological order and add the value zero if the month is not in the list.
This is what I hope to do:
[5,0,4,4,0,89,0,0,0,0,0,0]
and here is the small portion of the .csv
https://drive.google.com/file/d/1aqMwZcSzbY8WpeyTzXP76Sl46acO23bI/view?usp=sharing
Any advice would be greatly appreciated thank you! :)
One way using dict.setdefault to create monthly values:
l = [['16/3/20', '9'], ['17/3/20', '4'], ['18/4/20', '4'],
['19/1/20', '5'], ['17/6/20', '89'], ['18/6/20', '104'],
['19/6/20', '128'], ['20/1/20', '79']]
res = {}
for d, v in l:
month = int(d.split("/")[1])
res.setdefault(month, []).append(int(v))
Output:
{1: [5, 79], 3: [9, 4], 4: [4], 6: [89, 104, 128]}
Then dict.get to make 0 for absent months:
[min(res.get(i, [0])) for i in range(1, 13)]
Output:
[5, 0, 4, 4, 0, 89, 0, 0, 0, 0, 0, 0]
One way is to use pandas.
Read your csv in a Pandas Dataframe:
import pandas as pd
df = pd.read_csv(csvfile)
Your df will look like:
In [1190]: df
Out[1190]:
date ticket_sold
0 16/3/20 9
1 17/3/20 4
2 18/4/20 4
3 19/1/20 5
4 17/6/20 89
5 18/6/20 104
6 19/6/20 128
7 20/1/20 79
# Convert `date` column to datetime and extract month
In [1196]: df['date'] = pd.to_datetime(df['date']).dt.month
# Groupby `month` and pick minimum tickets_sold per month
In [1203]: x = df.groupby('date')['ticket_sold'].min()
In [1208]: import numpy as np
# Fill data for missing months with 0
In [1207]: output = x.reindex(np.arange(1,13)).fillna(0).astype(int).values.tolist()
In [1209]: output
Out[1207]: [5, 0, 4, 4, 0, 89, 0, 0, 0, 0, 0, 0]
I am trying to remove outliers from a list in python. I want to get the index values of each outlier from an original list so I can remove it from (another) corresponding list.
~~Simple example~~
my list with outliers:
y = [1,2,3,4,500] #500 is the outlier; has a index of 4
my corresponding list:
x= [1,2,3,4,5] #I want to remove 5, has the same index of 4
MY RESULT/GOAL:
y=[1,2,3,4]
x=[1,2,3,4]
This is my code, and I want to achieve the same with klist and avglatlist
import numpy as np
klist=['1','2','3','4','5','6','7','8','4000']
avglatlist=['1','2','3','4','5','6','7','8','9']
klist = np.array(klist).astype(np.float)
klist=klist[(abs(klist - np.mean(klist))) < (2 * np.std(klist))]
indices=[]
for k in klist:
if (k-np.mean(klist))>((2*np.std(klist))):
i=klist.index(k)
indices.append(i)
print('indices'+str(indices))
avglatlist = np.array(avglatlist).astype(np.float)
for index in sorted(indices, reverse=True):
del avglatlist[index]
print(len(klist))
print(len(avglatlist))
How to get the index values of each outlier in a list?
Say an outlier is defined as 2 standard deviations from a mean. This means you'd want to know the indices of values in a list where zscores have absolute values greater than 2.
I would use np.where:
import numpy as np
from scipy.stats import zscore
klist = np.array([1, 2, 3, 4, 5, 6, 7, 8, 4000])
avglatlist = np.arange(1, klist.shape[0] + 1)
indices = np.where(np.absolute(zscore(klist)) > 2)[0]
indices_filter = [i for i,n in enumerate(klist) if i not in indices]
print(avglatlist[indices_filter])
If you don't actually need to know the indices, use a boolean mask instead:
import numpy as np
from scipy.stats import zscore
klist = np.array([1, 2, 3, 4, 5, 6, 7, 8, 4000])
avglatlist = np.arange(1, klist.shape[0] + 1)
mask = np.absolute(zscore(klist)) > 2
print(avglatlist[~mask])
Both solutions print:
[1 2 3 4 5 6 7 8]
You are really close. All you need to do is apply the same filtering regime to a numpy version of avglatlist. I've changed a few variable names for clarity.
import numpy as np
klist = ['1', '2', '3', '4', '5', '6', '7', '8', '4000']
avglatlist = ['1', '2', '3', '4', '5', '6', '7', '8', '9']
klist_np = np.array(klist).astype(np.float)
avglatlist_np = np.array(avglatlist).astype(np.float)
klist_filtered = klist_np[(abs(klist_np - np.mean(klist_np))) < (2 * np.std(klist_np))]
avglatlist_filtered = avglatlist_np[(abs(klist_np - np.mean(klist_np))) < (2 * np.std(klist_np))]
I need to extract all information from example.csv. The file has three parts of information and is formatted as below:
Date,2017/07/15,Time,20:00,
ColA, ColB, ColC,
1, 2, 3,
4, 5, 6,
ColD, ColE
7, 8,
I use df=pd.read_csv('example.csv', header=None) to read all the information from the csv, but I'm only getting an error message. My goal is to have a table like:
Date Time ColA_1 ColB_1 ColC_1 ColA_2 ColB_2 ColC_2 ColD ColE
2017/07/15 20:00 1 2 3 4 5 6 7 8
Please help. Thanks.
Your formatting wishes are very specific so I don't really see anything simpler than the following:
# Load this using open from the csv
s = "Date,2017/07/15,Time,20:00\nColA, ColB, ColC\n1, 2, 3\n4, 5, 6\nColD, ColE\n7, 8"
s = s.replace(" ", "")
s_arr = s.split('\n')
s_arr = [x.split(',') for x in s_arr]
columns = [s_arr[0][0], s_arr[0][2]] + s_arr[1][0:3] + s_arr[4][0:2]
row = [s_arr[0][1], s_arr[0][3],[s_arr[2][0],s_arr[3][0]],[s_arr[2][1],s_arr[3][1]],[s_arr[2][2],s_arr[3][2]]] + s_arr[5][0:2]
This gives:
columns = ['Date', 'Time', 'ColA', 'ColB', 'ColC', 'ColD', 'ColE']
row = ['2017/07/15', '20:00', ['1', '4'], ['2', '5'], ['3', '6'], '7', '8']
The lists can be used to initialize your pandas table. Depending on how the rows are organized in the csv you may need to split it another level (e.g. if there is two white lines between rows, then you can use split('\n\n')).
I am trying to use raw_input in the python code to get user input of lists as below.
input_array.append(list(raw_input()));
User input as:
1 2 3 5 100
But the code is interpreting input as
[['1', ' ', '2', ' ', '3', ' ', '5', ' ', '1', '0', '0']]
Try: If I use plain input() instead of raw_input(), I am facing the issue in console.
"SyntaxError: ('invalid syntax', ('<string>', 1, 3, '1 2 3 4 100'))"
Note: I am not allowed to give the input in list format like
[1,2,3,5,100]
Could somebody please tell me how to proceed further.
>>> [int(x) for x in raw_input().split()]
1 2 3 5 100
[1, 2, 3, 5, 100]
>>> raw_input().split()
1 2 3 5 100
['1', '2', '3', '5', '100']
Creates a new list split by whitespace and then
[int(x) for x in raw_input().split()]
Converts each string in this new list into an integer.
list()
is a function that constructs a list from an iterable such as
>>> list({1, 2, 3}) # constructs list from a set {1, 2, 3}
[1, 2, 3]
>>> list('123') # constructs list from a string
['1', '2', '3']
>>> list((1, 2, 3))
[1, 2, 3] # constructs list from a tuple
so
>>> list('1 2 3 5 100')
['1', ' ', '2', ' ', '3', ' ', '5', ' ', '1', '0', '0']
also works, the list function iterates through the string and appends each character to a new list. However you need to separate by spaces so the list function is not suitable.
input takes a string and converts it into an object
'1 2 3 5 100'
is not a valid python object, it is 5 numbers separated by spaces.
To make this clear, consider typing
>>> 1 2 3 5 100
SyntaxError: invalid syntax
into a Python Shell. It is just invalid syntax. So input raises this error as well.
On an important side note:
input is not a safe function to use so even if your string was '[1,2,3,5,100]' as you mentioned you should not use input because harmful python code can be executed through input.
If this case ever arises, use ast.literal_eval:
>>> import ast
>>> ast.literal_eval('[1,2,3,5,100]')
[1, 2, 3, 5, 100]
I'm new to Python. So I want to get this done with loops without using some fancy stuff like generators. I have two 2D arrays, one integer array and the other string array like this:
Integer 2D list:
Here, dataset2d[0][0] is number of rows in the table, dataset[0][1] is number of columns. So the below 2D list has 6 rows and 4 columns
dataset2d = [
[6, 4],
[0, 0, 0, 1],
[1, 0, 2, 0],
[2, 2, 0, 1],
[1, 1, 1, 0],
[0, 0, 1, 1],
[1, 0, 2, 1]
]
String 2D list:
partition2d = [
['A', '1', '2', '4'],
['B', '3', '5'],
['C', '6']
]
partition[*][0] i.e first column is a label. For group A, 1,2 and 4 are the row numbers that I need to pick up from dataset2d and apply a formula. So it means I will read 1, go to row 1 in dataset2d and read the first column value i.e dataset2d[1][0], then I will read 2 from partition2d, go to row 2 of dataset 2d and read the first column i.e dataset2d[2][0]. Similarly next one I'll read dataset2d[4][0].
Then I will do some calculations, get a value and store it in a 2D list, then go to the next column in dataset2d for those rows. So in this example, next column values read would be dataset2d[1][1], dataset2d[2][1], dataset2d[4][1]. And again do some calculation and get one value for that column, store it. I'll do this until I reach the last column of dataset2d.
The next row in partition2d is [B, 3, 5]. So I'll start with dataset2d[3][0], dataset2d[5][0]. Get a value for that column be a formula. Then real dataset2d [3][1], dataset2d[5][1] etc. until I reach last column. I do this until all rows in partition2d are read.
What I tried:
for partitionRow in partition2d:
for partitionCol in partitionRow:
for colDataset in dataset2d:
print dataset2d[partitionCol][colDataset]
What problem I'm facing:
partition2d is a string array where I need to skip the first column which has characters like A,B,C.
I want to iterate in dataset2d column wise only over the row numbers given in partition2d. So the colDataset should increment only after I'm done with that column.
Update1:
I'm reading the contents from a text file, and the data in 2D lists can vary, depending on file content and size, but the structure of file1 i.e dataset2d and file2 i.e partition2d will be the same.
Update2: Since Eric asked about how the output should look like.
0.842322 0.94322 0.34232 0.900009 (For A)
0.642322 0.44322 0.24232 0.800009 (For B)
This is just an example and the numbers are randomly typed by me.
So the first number 0.842322 is the result of applying the formula to column 0 of dataset2d i.e dataset2d[parttionCol][0] for group A having considered rows 1,2,4.
The second number, 0.94322 is the result of applying formula to column 1 of dataset2d i.e dataset2d[partitionCol][1] for group A having considered rows 1,2 4.
The third number, 0.34232 is the result of applying formula to column 2 of dataset2d i.e dataset2d[partitionCol][2] for group A having considered rows 1,2 4. Similarly we get 0.900009.
The first number in second row, i.e 0.642322 is the result of applying the formula to column 0 of dataset2d i.e dataset2d[parttionCol][0] for group B having considered rows 3,5. And so on.
You can use Numpy (I hope this is not fancy for you):
import numpy
dataset2D = [ [6, 4], [0, 0, 0, 1], [1, 0, 2, 0], [2, 2, 0, 1], [1, 1, 1, 0], [0, 0, 1, 1], [1, 0, 2, 1] ]
dataset2D_size = dataset2D[0]
dataset2D = numpy.array(dataset2D)
partition2D = [ ['A', '1', '2', '4'], ['B', '3', '5'], ['C', '6'] ]
for partition in partition2D:
label = partition[0]
row_indices = [int(i) for i in partition[1:]]
# Take the specified rows
rows = dataset2D[row_indices]
# Iterate the columns (this is the power of Python!)
for column in zip(*rows):
# Now, column will contain one column of data from specified row indices
print column, # Apply your formula here
print
or if you don't want to install Numpy, here is what you can do (this is what you want, actually):
dataset2D = [ [6, 4], [0, 0, 0, 1], [1, 0, 2, 0], [2, 2, 0, 1], [1, 1, 1, 0], [0, 0, 1, 1], [1, 0, 2, 1] ]
partition2D = [ ['A', '1', '2', '4'], ['B', '3', '5'], ['C', '6'] ]
dataset2D_size = dataset2D[0]
for partition in partition2D:
label = partition[0]
row_indices = [int(i) for i in partition[1:]]
rows = [dataset2D[row_idx] for row_idx in row_indices]
for column in zip(*rows):
print column,
print
both will print:
(0, 1, 1) (0, 0, 1) (0, 2, 1) (1, 0, 0)
(2, 0) (2, 0) (0, 1) (1, 1)
(1,) (0,) (2,) (1,)
Explanation of second code (without Numpy):
[dataset2D[row_idx] for row_idx in row_indices]
This is basically you take each row (dataset2D[row_idx]) and collate them together as a list. So the result of this expression is a list of lists (which comes from the specified row indices)
for column in zip(*rows):
Then zip(*rows) will iterate column-wise (the one you want). This works by taking the first element of each row, then combine them together to form a tuple. In each iteration, the result is stored in variable column.
Then inside the for column in zip(*rows): you already have your intended column-wise iterated elements from specified rows!
To apply your formula, just change the print column, into the stuff you wanna do. For example I modify the code to include row and column number:
print 'Processing partition %s' % label
for (col_num, column) in enumerate(zip(*rows)):
print 'Column number: %d' % col_num
for (row_num, element) in enumerate(column):
print '[%d,%d]: %d' % (row_indices[row_num], col_num, element)
which will result in:
Processing partition A
Column number: 0
[1,0]: 0
[2,0]: 1
[4,0]: 1
Column number: 1
[1,1]: 0
[2,1]: 0
[4,1]: 1
Column number: 2
[1,2]: 0
[2,2]: 2
[4,2]: 1
Column number: 3
[1,3]: 1
[2,3]: 0
[4,3]: 0
Processing partition B
Column number: 0
[3,0]: 2
[5,0]: 0
Column number: 1
[3,1]: 2
[5,1]: 0
Column number: 2
[3,2]: 0
[5,2]: 1
Column number: 3
[3,3]: 1
[5,3]: 1
Processing partition C
Column number: 0
[6,0]: 1
Column number: 1
[6,1]: 0
Column number: 2
[6,3]: 2
Column number: 3
[6,3]: 1
I hope this helps.
Here's an extensible solution using an iterator:
def partitions(data, p):
for partition in p:
label = partition[0]
row_indices = partition[1:]
rows = [dataset2D[row_idx] for row_idx in row_indices]
columns = zip(*rows)
yield label, columns
for label, columns in partitions(dataset2D, partitions2d):
print "Processing", label
for column in columns:
print column
to address your problems:
What problem I'm facing:
partition2d is a string array where I need to
skip the first column which has characters like A,B,C.
I want to
iterate in dataset2d column wise only over the row numbers given in
partition2d. So the colDataset should increment only after I'm done
with that column.
Problem 1 can be solved using slicing - if you want to iterate on partition2d from the second element only you can to something for partitionCol in partitionRow[1:]. This will slice the row starting from the second element to the end.
So something like:
for partitionRow in partition2d:
for partitionCol in partitionRow[1:]:
for colDataset in dataset2d:
print dataset2d[partitionCol][colDataset]
Problem 2 I didn't understand what you want :)
partition2d is a string array where I need to skip the first column
which has characters like A,B,C.
This is called slicing:
for partitionCol in partitionRow[1:]:
the above snippet will skip the first column.
for colDataset in dataset2d:
Already does what you want. There is no structure here like in C++ loops. Although you could do stuff in a very Unpythonic way:
i=0
for i in range(len(dataset2d)):
print dataset2d[partitionCol][i]
i=+1
This is a very bad way of doing stuff. For arrays and matrices, I suggest you don't re-invent the wheel (that is also Pythonic stuff), look at Numpy. And especially at:
numpy.loadtxt
Setup:
d = [[6,4],[0,0,0,1],[1,0,2,0],[2,2,0,1],[1,1,1,0],[0,0,1,1],[1,0,2,1]]
s = [['A',1,2,4],['B',3,5],['C',6]]
The results are put into a list l
l = []
for r in s: #go over each [character,index0,index1,...]
new_r = [r[0]] #create a new list to values given by indexN. Add in the character by default
for i,c in enumerate(r[1:]): #go over each indexN. Using enumerate to keep track of what N is.
new_r.append(d[c][i]) #i is now the N in indexN. c is the column.
l.append(new_r) #add that new list to l
Resulting in
>>> l
[['A', 0, 0, 1], ['B', 2, 0], ['C', 1]]
The execution of the first iteration would look like:
for r in s:
#-> r = ['A',1,2,4]
new_r = [r[0]] #= ['A']
for i,c in enumerate([r[1:] = [1,2,4])
#-> i = 0, c = 1
new_r.append(d[1][i])
#-> i = 1, c = 2
#...