If I have a dataframe
0 1 2
0 0.1 0.2 0.3
1 0.2 0.3 0.4
2 0.3 0.4 0.5
3 0.4 0.5 0.6
and I want to append it to my existing excel file, I will then write code
wb = load_workbook('test1.xlsx')
ws = wb['Sheet1']
for index, row in newdf.iterrows():
cell = 'D%d' % (index + 2)
ws[cell] = row[0]
wb.save('test1.xlsx')
The problem with that code is, I only get the first column, since its iterating over the index. So for example this is excel sheet.
A B C D
1
2 0.1
3 0.2
4 0.3
5 0.4
How to make iteration until all the column finish with start at D2 cell ? I ve tried iteritems also it did not work.
expected result should be
A B C D E F
1
2 0.1 0.2 0.3
3 0.2 0.3 0.4
4 0.3 0.4 0.5
5 0.4 0.5 0.6
Here is a proposition based on your code :
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
wb = load_workbook("test1.xlsx")
ws = wb["Sheet1"]
for i, row in df.iterrows():
for j, value in enumerate(row):
r, c = i + 2, get_column_letter(j + 4)
cell_coordinates = f"{c}{r}"
ws[cell_coordinates] = value
wb.save("test1.xlsx")
There is also this answer from #Charlie Clark that you can adapt to your needs.
Output :
Related
I tried to construct a column without using a loop for. However I am using the value that I calculated at the previous step.
Example :
f-
0
0.3
1
0.4
2
0.45
3
0.6
f+
0
0.21
1
0.78
2
0.54
3
0.9
Matrix P
0
1
0
0.9
0.1
1
0.1
0.9
DataFrame df to be filled
0
1
2
0
0.5
...
...
1
0.5
....
...
Then the column 1 in DataFrame df becomes :
prob = f-[0,0] * df[0,0] / (f-[0,0] * df[0,0] * f+[0,0] * df[0,1])
df[0] = P.dot([prob, 1 - prob])
Here is the code with a for loop :
for i in np.arange(0, n):
p = f_m[i+1] * df.iloc[0, i] / (f_m[i+1] * df.iloc[0, i] + f_p[i+1] * df.iloc[1, i])
xi[i + 1] = p_matrix.dot(np.array([p, 1 - p]))
Does someone have the solution to create it without a for loop ? And then each column will be calculated the same way
Consider a dataframe df of the following structure:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2
B Y 10 0.2 0.7 0.8
...
I would like to create duplicates for each row in this dataframe (specific to the Name and Slide) for the following combinations of Height and Weight shown by this list:-
list_combinations = [[3,0.1],[10,0.2],[5,1.3]]
The desired output:-
Name Slide Height Weight Status General
A X 3 0.1 0.5 0.2 #original
A X 10 0.2 0.5 0.2 # modified duplicate
A X 5 1.3 0.5 0.2 # modified duplicate
B Y 10 0.2 0.7 0.8 #original
B Y 3 0.1 0.7 0.8 # modified duplicate
B Y 5 1.3 0.7 0.8 # modified duplicate
etc. ...
Any suggestions and help would be much appreciated.
We can do merge with cross
out = pd.DataFrame(list_combinations,columns = ['Height','Weight']).\
merge(df,how='cross',suffixes = ('','_')).\
reindex(columns=df.columns).sort_values('Name')
Name Slide Height Weight Status General
0 A X 3 0.1 0.5 0.2
2 A X 10 0.2 0.5 0.2
4 A X 5 1.3 0.5 0.2
1 B Y 3 0.1 0.7 0.8
3 B Y 10 0.2 0.7 0.8
5 B Y 5 1.3 0.7 0.8
I have ~1.2k files that when converted into dataframes look like this:
df1
A B C D
0 0.1 0.5 0.2 C
1 0.0 0.0 0.8 C
2 0.5 0.1 0.1 H
3 0.4 0.5 0.1 H
4 0.0 0.0 0.8 C
5 0.1 0.5 0.2 C
6 0.1 0.5 0.2 C
Now, I have to subset each dataframe with a window of fixed size along the rows, and add its contents to a second dataframe, with all its values originally initialized to 0.
df_sum
A B C
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
For example, let's set the window size to 3. The first subset therefore will be
window = df.loc[start:end, 'A':'C']
window
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
window.index = correct_index
df_sum = df_sum.add(window, fill_value=0)
df_sum
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
After that, the window will be the subset of df1 from rows 1-4, then rows 2-5, and finally rows 3-6. Once the first file has been scanned, the second file will begin, until all file have been processed. As you can see, this approach relies on df.loc for the subset and df.add for the addition. However, despite the ease of coding, it is very inefficient. On my machine it takes about 5 minutes to process the whole batch of 1.2k files of 200 lines each. I know that an implementation based on numpy arrays is orders of magnitude faster (about 10 seconds), but a bit more complicated in terms of subsetting and adding. Is there any way to increase the performance of this method while stile using dataframe? For example substituting the loc with a more performing slice method.
Example:
def generate_index_list(window_size):
before_offset = -(window_size - 1)// 2
after_offset = (window_size - 1)// 2
index_list = list()
for n in range(before_offset, after_offset + 1):
index_list.append(str(n))
return index_list
window_size = 3
for file in os.listdir('.'):
df1 = pd.read_csv(file, sep= '\t')
starting_index = (window_size - 1)//2
before_offset = (window_size - 1)// 2
after_offset = (window_size -1)//2
for index in df1.iterrows():
if index < starting_index or index + before_offset + 1 > len(profile.index):
continue
indexes = generate_index_list(window_size)
window = df1.loc[index - before_offset:index + after_offset, 'A':'C']
window.index = indexes
df_sum = df_sum.add(window, fill_value=0)
Expected output:
df_sum
A B C
0 1.0 1.1 2.0
1 1.0 1.1 2.0
2 1.1 1.6 1.4
Consider building a list of subsetted data frames with.loc and .head. Then run groupby aggregation after individual elements are concatenated.
window_size = 3
def window_process(file):
csv_df = pd.read_csv(file, sep= '\t')
window_dfs = [(csv_df.loc[i:,['A', 'B', 'C']] # ROW AND COLUMN SLICE
.head(window) # SELECT FIRST WINDOW ROWS
.reset_index(drop=True) # RESET INDEX TO 0, 1, 2, ...
) for i in range(df.shape[0])]
sum_df = (pd.concat(window_dfs) # COMBINE WINDOW DFS
.groupby(level=0).sum()) # AGGREGATE BY INDEX
return sum_df
# BUILD LONG DF FROM ALL FILES
long_df = pd.concat([window_process(f) for file in os.listdir('.')])
# FINAL AGGREGATION
df_sum = long_df.groupby(level=0).sum()
Using posted data sample, below are the outputs of each window_dfs:
A B C
0 0.1 0.5 0.2
1 0.0 0.0 0.8
2 0.5 0.1 0.1
A B C
0 0.0 0.0 0.8
1 0.5 0.1 0.1
2 0.4 0.5 0.1
A B C
0 0.5 0.1 0.1
1 0.4 0.5 0.1
2 0.0 0.0 0.8
A B C
0 0.4 0.5 0.1
1 0.0 0.0 0.8
2 0.1 0.5 0.2
A B C
0 0.0 0.0 0.8
1 0.1 0.5 0.2
2 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
1 0.1 0.5 0.2
A B C
0 0.1 0.5 0.2
With final df_sum to show accuracy of DataFrame.add():
df_sum
A B C
0 1.2 2.1 2.4
1 1.1 1.6 2.2
2 1.1 1.6 1.4
I have some timeseries data that basically contains information on price change period by period. For example, let's say:
df = pd.DataFrame(columns = ['TimeStamp','PercPriceChange'])
df.loc[:,'TimeStamp']=[1457280,1457281,1457282,1457283,1457284,1457285,1457286]
df.loc[:,'PercPriceChange']=[0.1,0.2,-0.1,0.1,0.2,0.1,-0.1]
so that df looks like
TimeStamp PercPriceChange
0 1457280 0.1
1 1457281 0.2
2 1457282 -0.1
3 1457283 0.1
4 1457284 0.2
5 1457285 0.1
6 1457286 -0.1
What I want to achieve is to calculate the overall price change before the an increase/decrease streak ends, and store the value in the row where the streak started. That is, what I want is a column 'TotalPriceChange' :
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 1.1 * 1.2 - 1 = 0.31
1 1457281 0.2 0
2 1457282 -0.1 -0.1
3 1457283 0.1 1.1 * 1.2 * 1.1 - 1 = 0.452
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 -0.1
I can identify the starting points using something like:
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
to get
TimeStamp PercPriceChange turn
0 1457280 0.1 NaN or 1?
1 1457281 0.2 0
2 1457282 -0.1 1
3 1457283 0.1 1
4 1457284 0.2 0
5 1457285 0.1 0
6 1457286 -0.1 1
Given this column "turn", I need help proceeding with my quest (or perhaps we don't need this "turn" at all). I am pretty sure I can write a nested for-loop going through the entire DataFrame row by row, calculating what I need and populating the column 'TotalPriceChange', but given that I plan on doing this on a fairly large data set (think minute or hour data for couple of years), I imagine nested for-loops will be really slow.
Therefore, I just wanted to check with you experts to see if there is any efficient solution to my problem that I am not aware of. Any help would be much appreciated!
Thanks!
The calculation you are looking for looks like a groupby/product operation.
To set up the groupby operation, we need to assign a group value to each row. Taking the cumulative sum of the turn column gives the desired result:
df['group'] = df['turn'].cumsum()
# 0 0
# 1 0
# 2 1
# 3 2
# 4 2
# 5 2
# 6 3
# Name: group, dtype: int64
Now we can define the TotalPriceChange column (modulo a little cleanup work) as
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
import pandas as pd
df = pd.DataFrame({'PercPriceChange': [0.1, 0.2, -0.1, 0.1, 0.2, 0.1, -0.1],
'TimeStamp': [1457280, 1457281, 1457282, 1457283, 1457284, 1457285, 1457286]})
df['turn'] = 0
df['PriceChange_L1'] = df['PercPriceChange'].shift(periods=1, freq=None, axis=0)
df.loc[ df['PercPriceChange'] * df['PriceChange_L1'] < 0, 'turn' ] = 1
df['group'] = df['turn'].cumsum()
df['PercPriceChange_plus_one'] = df['PercPriceChange']+1
df['TotalPriceChange'] = df.groupby('group')['PercPriceChange_plus_one'].transform('prod') - 1
mask = (df['group'].diff() != 0)
df.loc[~mask, 'TotalPriceChange'] = 0
df = df[['TimeStamp', 'PercPriceChange', 'TotalPriceChange']]
print(df)
yields
TimeStamp PercPriceChange TotalPriceChange
0 1457280 0.1 0.320
1 1457281 0.2 0.000
2 1457282 -0.1 -0.100
3 1457283 0.1 0.452
4 1457284 0.2 0.000
5 1457285 0.1 0.000
6 1457286 -0.1 -0.100
I am stuck (and in a bit of a time crunch) and was hoping for some help. This is probably a simple task but I can't seem to solve it..
I have a matrix, say 5 by 5, with an additional starting column of names for the rows and the same names for the columns in a text file like this:
b e a d c
b 0.0 0.1 0.3 0.2 0.5
e 0.1 0.0 0.4 0.9 0.3
a 0.3 0.4 0.0 0.7 0.6
d 0.2 0.9 0.7 0.0 0.1
c 0.5 0.3 0.6 0.1 0.0
I have multiple files that have the same format and size of matrix but the order of the names are different. I need a way to change these around so they are all the same and maintain the 0.0 diagonal. So any swapping I do to the columns I must do to the rows.
I have been searching a bit and it seems like NumPy might do what I want but I have never worked with it or arrays in general. Any help is greatly appreciated!
In short: How do I get a text file into an array which I can then swap around rows and columns to a desired order?
I suggest you use pandas:
from StringIO import StringIO
import pandas as pd
data = StringIO("""b e a d c
b 0.0 0.1 0.3 0.2 0.5
e 0.1 0.0 0.4 0.9 0.3
a 0.3 0.4 0.0 0.7 0.6
d 0.2 0.9 0.7 0.0 0.1
c 0.5 0.3 0.6 0.1 0.0
""")
df = pd.read_csv(data, sep=" ")
print df.sort_index().sort_index(axis=1)
output:
a b c d e
a 0.0 0.3 0.6 0.7 0.4
b 0.3 0.0 0.5 0.2 0.1
c 0.6 0.5 0.0 0.1 0.3
d 0.7 0.2 0.1 0.0 0.9
e 0.4 0.1 0.3 0.9 0.0
Here's the start of a horrific Numpy version (use HYRY's answer...)
import numpy as np
with open("myfile", "r") as myfile:
lines = myfile.read().split("\n")
floats = [[float(item) for item in line.split()[1:]] for line in lines[1:]]
floats_transposed = np.array(floats).transpose().tolist()
from copy import copy
f = open('input', 'r')
data = []
for line in f:
row = line.rstrip().split(' ')
data.append(row)
#collect labels, strip empty spaces
r = data.pop(0)
c = [row.pop(0) for row in data]
r.pop(0)
origrow, origcol = copy(r), copy(c)
r.sort()
c.sort()
newgrid = []
for row, rowtitle in enumerate(r):
fromrow = origrow.index(rowtitle)
newgrid.append(range(len(c)))
for col, coltitle in enumerate(c):
#We ask this len(row) times, so memoization
#might matter on a large matrix
fromcol = origcol.index(coltitle)
newgrid[row][col] = data[fromrow][fromcol]
print "\t".join([''] + r)
clabel = c.__iter__()
for line in newgrid:
print "\t".join([clabel.next()] + line)