I have binary values filled in a csv file and a list of real number values that I would like to apply multiplication on both files. How can I discard those values which multiply with the value of 0 in csv file? Can anyone help me with the algorithm part?
Binary.csv
This is 3 lines binary values.
0 1 0 0 1 0 1 0 0
1 0 0 0 0 1 0 1 0
0 0 1 0 1 0 1 0 0
Real.csv
This is one line real number values.
0.1 0.2 0.4 0.1 0.5 0.5 0.3 0.6 0.3
Before desired output
0.0 0.2 0.0 0.0 0.5 0.0 0.3 0.0 0.0
0.1 0.0 0.0 0.0 0.0 0.5 0.0 0.6 0.0
0.0 0.0 0.4 0.0 0.5 0.0 0.3 0.0 0.0
Desired output
0.2 0.5 0.3
0.1 0.5 0.6
0.4 0.5 0.3
Code
import numpy as np
import itertools
a = np.array([[0,1,0,0,1,0,1,0,0],[1,0,0,0,0,1,0,1,0],[0,0,1,0,1,0,1,0,0]])
b = np.array([0.1,0.2,0.4,0.1,0.5,0.5,0.3,0.6,0.3])
c=(a * b)
d=itertools.compress(c,str(True))
print d
The above code is just another alternatives that I tried at the same time. Sorry for inconvenience. Very appreciate all your helps here.
Several ways to do this, mine is simplistic:
import csv
with open('real.csv', 'rb') as csvfile:
for row in csv.reader(csvfile, delimiter=' '):
reals = row
with open('binary.csv', 'rb') as csvfile:
pwreader = csv.reader(csvfile, delimiter=' ')
for row in pwreader:
result = []
for i,b in enumerate(row):
if b == '1' :
result.append(reals[i])
print " ".join(result)
You will notice that there is no multiplication here. When you read from a CSV file the values are strings. You could convert each field to a numeric, construct a bit-mask, then work it out from there, but is it worth it? I have just used a simple string comparison. The output is a string anyway.
Edit: now I find you have numpy arrays in your code, ignoring the csv files. Please stop changing the goalposts!
Related
If I have a dataframe
0 1 2
0 0.1 0.2 0.3
1 0.2 0.3 0.4
2 0.3 0.4 0.5
3 0.4 0.5 0.6
and I want to append it to my existing excel file, I will then write code
wb = load_workbook('test1.xlsx')
ws = wb['Sheet1']
for index, row in newdf.iterrows():
cell = 'D%d' % (index + 2)
ws[cell] = row[0]
wb.save('test1.xlsx')
The problem with that code is, I only get the first column, since its iterating over the index. So for example this is excel sheet.
A B C D
1
2 0.1
3 0.2
4 0.3
5 0.4
How to make iteration until all the column finish with start at D2 cell ? I ve tried iteritems also it did not work.
expected result should be
A B C D E F
1
2 0.1 0.2 0.3
3 0.2 0.3 0.4
4 0.3 0.4 0.5
5 0.4 0.5 0.6
Here is a proposition based on your code :
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
wb = load_workbook("test1.xlsx")
ws = wb["Sheet1"]
for i, row in df.iterrows():
for j, value in enumerate(row):
r, c = i + 2, get_column_letter(j + 4)
cell_coordinates = f"{c}{r}"
ws[cell_coordinates] = value
wb.save("test1.xlsx")
There is also this answer from #Charlie Clark that you can adapt to your needs.
Output :
I have two time-based data. One is the accelerometer's measurement data, another is label data.
For example,
accelerometer.csv
timestamp,X,Y,Z
1.0,0.5,0.2,0.0
1.1,0.2,0.3,0.0
1.2,-0.1,0.5,0.0
...
2.0,0.9,0.8,0.5
2.1,0.4,0.1,0.0
2.2,0.3,0.2,0.3
...
label.csv
start,end,label
1.0,2.0,"running"
2.0,3.0,"exercising"
Maybe these data are unrealistic because these are just examples.
In this case, I want to merge these data to below:
merged.csv
timestamp,X,Y,Z,label
1.0,0.5,0.2,0.0,"running"
1.1,0.2,0.3,0.0,"running"
1.2,-0.1,0.5,0.0,"running"
...
2.0,0.9,0.8,0.5,"exercising"
2.1,0.4,0.1,0.0,"exercising"
2.2,0.3,0.2,0.3,"exercising"
...
I'm using the "iterrows" of pandas. However, the number of rows of real data is greater than 10,000. Therefore, the running time of program is so long. I think, there is at least one method for this work without iteration.
My code like to below:
import pandas as pd
acc = pd.read_csv("./accelerometer.csv")
labeled = pd.read_csv("./label.csv")
for index, row in labeled.iterrows():
start = row["start"]
end = row["end"]
acc.loc[(start <= acc["timestamp"]) & (acc["timestamp"] < end), "label"] = row["label"]
How can I modify my code to get rid of "for" iteration?
If the times in accelerometer don't go outside the boundaries of the times in label, you could use merge_asof:
accmerged = pd.merge_asof(acc, labeled, left_on='timestamp', right_on='start', direction='backward')
Output (for the sample data in your question):
timestamp X Y Z start end label
0 1.0 0.5 0.2 0.0 1.0 2.0 running
1 1.1 0.2 0.3 0.0 1.0 2.0 running
2 1.2 -0.1 0.5 0.0 1.0 2.0 running
3 2.0 0.9 0.8 0.5 2.0 3.0 exercising
4 2.1 0.4 0.1 0.0 2.0 3.0 exercising
5 2.2 0.3 0.2 0.3 2.0 3.0 exercising
Note you can remove the start and end columns with drop if you want to:
accmerged = accmerged.drop(['start', 'end'], axis=1)
Output:
timestamp X Y Z label
0 1.0 0.5 0.2 0.0 running
1 1.1 0.2 0.3 0.0 running
2 1.2 -0.1 0.5 0.0 running
3 2.0 0.9 0.8 0.5 exercising
4 2.1 0.4 0.1 0.0 exercising
5 2.2 0.3 0.2 0.3 exercising
I have ~1.2k files that when converted into dataframes look like this:
df1
A B C D
0 0.1 0.5 0.2 C
1 0.0 0.0 0.8 C
2 0.5 0.1 0.1 H
3 0.4 0.5 0.1 H
4 0.0 0.0 0.8 C
5 0.1 0.5 0.2 C
6 0.1 0.5 0.2 C
Now, I have to subset each dataframe with a window of fixed size along the rows, and add its contents to a second dataframe, with all its values originally initialized to 0.
df_sum
A B C
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
For example, let's set the window size to 3. The first subset therefore will be
window = df.loc[start:end, 'A':'C']
window
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
window.index = correct_index
df_sum = df_sum.add(window, fill_value=0)
df_sum
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
After that, the window will be the subset of df1 from rows 1-4, then rows 2-5, and finally rows 3-6. Once the first file has been scanned, the second file will begin, until all file have been processed. As you can see, this approach relies on df.loc for the subset and df.add for the addition. However, despite the ease of coding, it is very inefficient. On my machine it takes about 5 minutes to process the whole batch of 1.2k files of 200 lines each. I know that an implementation based on numpy arrays is orders of magnitude faster (about 10 seconds), but a bit more complicated in terms of subsetting and adding. Is there any way to increase the performance of this method while stile using dataframe? For example substituting the loc with a more performing slice method.
Example:
def generate_index_list(window_size):
before_offset = -(window_size - 1)// 2
after_offset = (window_size - 1)// 2
index_list = list()
for n in range(before_offset, after_offset + 1):
index_list.append(str(n))
return index_list
window_size = 3
for file in os.listdir('.'):
df1 = pd.read_csv(file, sep= '\t')
starting_index = (window_size - 1)//2
before_offset = (window_size - 1)// 2
after_offset = (window_size -1)//2
for index in df1.iterrows():
if index < starting_index or index + before_offset + 1 > len(profile.index):
continue
indexes = generate_index_list(window_size)
window = df1.loc[index - before_offset:index + after_offset, 'A':'C']
window.index = indexes
df_sum = df_sum.add(window, fill_value=0)
Expected output:
df_sum
A B C
0 1.0 1.1 2.0
1 1.0 1.1 2.0
2 1.1 1.6 1.4
Consider building a list of subsetted data frames with.loc and .head. Then run groupby aggregation after individual elements are concatenated.
window_size = 3
def window_process(file):
csv_df = pd.read_csv(file, sep= '\t')
window_dfs = [(csv_df.loc[i:,['A', 'B', 'C']] # ROW AND COLUMN SLICE
.head(window) # SELECT FIRST WINDOW ROWS
.reset_index(drop=True) # RESET INDEX TO 0, 1, 2, ...
) for i in range(df.shape[0])]
sum_df = (pd.concat(window_dfs) # COMBINE WINDOW DFS
.groupby(level=0).sum()) # AGGREGATE BY INDEX
return sum_df
# BUILD LONG DF FROM ALL FILES
long_df = pd.concat([window_process(f) for file in os.listdir('.')])
# FINAL AGGREGATION
df_sum = long_df.groupby(level=0).sum()
Using posted data sample, below are the outputs of each window_dfs:
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
A B C
0 0.0 0.0 0.8
1 0.5 0.1 0.1
2 0.4 0.5 0.1
A B C
0 0.5 0.1 0.1
1 0.4 0.5 0.1
2 0.0 0.0 0.8
A B C
0 0.4 0.5 0.1
1 0.0 0.0 0.8
2 0.1 0.5 0.2
A B C
0 0.0 0.0 0.8
1 0.1 0.5 0.2
2 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
1 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
With final df_sum to show accuracy of DataFrame.add():
df_sum
A B C
0 1.2 2.1 2.4
1 1.1 1.6 2.2
2 1.1 1.6 1.4
I have some timeseries data that basically contains information on price change period by period. For example, let's say:
df = pd.DataFrame(columns = ['TimeStamp','PercPriceChange'])
df.loc[:,'TimeStamp']=[1457280,1457281,1457282,1457283,1457284,1457285,1457286]
df.loc[:,'PercPriceChange']=[0.1,0.2,-0.1,0.1,0.2,0.1,-0.1]
so that df looks like
TimeStamp PercPriceChange
0 1457280 0.1
1 1457281 0.2
2 1457282 -0.1
3 1457283 0.1
4 1457284 0.2
5 1457285 0.1
6 1457286 -0.1
What I want to achieve is to calculate the overall price change before the an increase/decrease streak ends, and store the value in the row where the streak started. That is, what I want is a column 'TotalPriceChange' :
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 1.1 * 1.2 - 1 = 0.31
1 1457281 0.2 0
2 1457282 -0.1 -0.1
3 1457283 0.1 1.1 * 1.2 * 1.1 - 1 = 0.452
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 -0.1
I can identify the starting points using something like:
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
to get
TimeStamp PercPriceChange turn
0 1457280 0.1 NaN or 1?
1 1457281 0.2 0
2 1457282 -0.1 1
3 1457283 0.1 1
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 1
Given this column "turn", I need help proceeding with my quest (or perhaps we don't need this "turn" at all). I am pretty sure I can write a nested for-loop going through the entire DataFrame row by row, calculating what I need and populating the column 'TotalPriceChange', but given that I plan on doing this on a fairly large data set (think minute or hour data for couple of years), I imagine nested for-loops will be really slow.
Therefore, I just wanted to check with you experts to see if there is any efficient solution to my problem that I am not aware of. Any help would be much appreciated!
Thanks!
The calculation you are looking for looks like a groupby/product operation.
To set up the groupby operation, we need to assign a group value to each row. Taking the cumulative sum of the turn column gives the desired result:
df['group'] = df['turn'].cumsum()
# 0 0
# 1 0
# 2 1
# 3 2
# 4 2
# 5 2
# 6 3
# Name: group, dtype: int64
Now we can define the TotalPriceChange column (modulo a little cleanup work) as
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
import pandas as pd
df = pd.DataFrame({'PercPriceChange': [0.1, 0.2, -0.1, 0.1, 0.2, 0.1, -0.1],
'TimeStamp': [1457280, 1457281, 1457282, 1457283, 1457284, 1457285, 1457286]})
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
df['group'] = df['turn'].cumsum()
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
mask = (df['group'].diff() != 0)
df.loc[~mask, 'TotalPriceChange'] = 0
df = df[['TimeStamp', 'PercPriceChange', 'TotalPriceChange']]
print(df)
yields
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 0.320
1 1457281 0.2 0.000
2 1457282 -0.1 -0.100
3 1457283 0.1 0.452
4 1457284 0.2 0.000
5 1457285 0.1 0.000
6 1457286 -0.1 -0.100
I am stuck (and in a bit of a time crunch) and was hoping for some help. This is probably a simple task but I can't seem to solve it..
I have a matrix, say 5 by 5, with an additional starting column of names for the rows and the same names for the columns in a text file like this:
b e a d c
b 0.0 0.1 0.3 0.2 0.5
e 0.1 0.0 0.4 0.9 0.3
a 0.3 0.4 0.0 0.7 0.6
d 0.2 0.9 0.7 0.0 0.1
c 0.5 0.3 0.6 0.1 0.0
I have multiple files that have the same format and size of matrix but the order of the names are different. I need a way to change these around so they are all the same and maintain the 0.0 diagonal. So any swapping I do to the columns I must do to the rows.
I have been searching a bit and it seems like NumPy might do what I want but I have never worked with it or arrays in general. Any help is greatly appreciated!
In short: How do I get a text file into an array which I can then swap around rows and columns to a desired order?
I suggest you use pandas:
from StringIO import StringIO
import pandas as pd
data = StringIO("""b e a d c
b 0.0 0.1 0.3 0.2 0.5
e 0.1 0.0 0.4 0.9 0.3
a 0.3 0.4 0.0 0.7 0.6
d 0.2 0.9 0.7 0.0 0.1
c 0.5 0.3 0.6 0.1 0.0
""")
df = pd.read_csv(data, sep=" ")
print df.sort_index().sort_index(axis=1)
output:
a b c d e
a 0.0 0.3 0.6 0.7 0.4
b 0.3 0.0 0.5 0.2 0.1
c 0.6 0.5 0.0 0.1 0.3
d 0.7 0.2 0.1 0.0 0.9
e 0.4 0.1 0.3 0.9 0.0
Here's the start of a horrific Numpy version (use HYRY's answer...)
import numpy as np
with open("myfile", "r") as myfile:
lines = myfile.read().split("\n")
floats = [[float(item) for item in line.split()[1:]] for line in lines[1:]]
floats_transposed = np.array(floats).transpose().tolist()
from copy import copy
f = open('input', 'r')
data = []
for line in f:
row = line.rstrip().split(' ')
data.append(row)
#collect labels, strip empty spaces
r = data.pop(0)
c = [row.pop(0) for row in data]
r.pop(0)
origrow, origcol = copy(r), copy(c)
r.sort()
c.sort()
newgrid = []
for row, rowtitle in enumerate(r):
fromrow = origrow.index(rowtitle)
newgrid.append(range(len(c)))
for col, coltitle in enumerate(c):
#We ask this len(row) times, so memoization
#might matter on a large matrix
fromcol = origcol.index(coltitle)
newgrid[row][col] = data[fromrow][fromcol]
print "\t".join([''] + r)
clabel = c.__iter__()
for line in newgrid:
print "\t".join([clabel.next()] + line)