Recently I have been working with an excel sheet for work and I need to format it in a certain way (shown below). The following is the excel sheet I'm working with (apologies for the REDACTED, some of the information is sensitive, also apologize for the image, I am fairly new to Stack Overflow and do not know how to add excel data):
Above is the format that I currently am using, but I need to convert the data to the following format:
As you can see I need the data to go from 10 lines, down to 1 line per unique LBREFID. I have already tried to use different Pandas functions such as .tolist() and .pivot() for the data, but that would result in data that does not resemble the desired format. This is an interesting problem that I, unfortunately, do not have the time to solve. Thank you in advance for your help.
import pandas._testing as tm
import pandas as pd
import numpy as np
tests = ["BMCELL", "PLASMA", "NEOPLASMABM", "NEOPLASMATBM", "CD138", "CD56", "CYCLIND1", "KAPPA", "LAMBDA", "NEOPLASMA"]
df = load_workbook(filename='GregFileComparison\\NovemberData.xlsx')
sheet = df['Sheet1']
i = 0
a = 3
e = 11
while (i <= 227):
for row in sheet['A' + str(a) + ':E' + str(e)]:
for cell in row:
cell.value = None
for row in sheet['I' + str(a) + ':J' + str(e)]:
for cell in row:
cell.value = None
for row in sheet['M' + str(a) + ':N' + str(e)]:
for cell in row:
cell.value = None
a += 10
e += 10
i += 1
sheet.delete_cols(12)
sheet.delete_cols(7)
i = 11
while (i <= 19):
sheet.insert_cols(i)
i += 1
counter = 10
i = 0
while (i <= 9):
sheet.cell(row=1, column=counter).value = tests[i]
counter += 1
i += 1
j = 0
i = 3
counter = 1
while (j <= 250):
while (counter <= 9):
sheet.move_range("J" + str(i), rows=-(counter), cols=counter)
i += 1
counter += 1
j += 1
counter = 0
sheet.delete_cols(6)
sheet.delete_cols(6)
df.save('output.xlsx')```
I found that hardcoding the transformations on the excel sheet worked best.
Related
I am trying to create a list of conditions to use the numpy select statement to create a 'Week #' column depending on the date a record was created. However, it doesn't quite seem to work. Any suggestions?
#Creating list for start dates
weekStartDay = []
weekValues = []
weekConditions = []
counter = 1
demoStartDate = min(demographic['Date'])
demoEndDate = max(demographic['Date'])
while demoStartDate <= demoEndDate:
weekStartDay.append(demoStartDate)
demoStartDate += timedelta(days=7)
weekStartDay.append(demoStartDate)
while counter <= len(weekConditions):
weekValues.append(counter+1)
counter += 1
#Assigning condition statement for numpy conditions
for i in range(len(weekStartDay)):
weekConditions.append( (demographic['Date'] >= weekStartDay[i]) & (demographic['Date'] < weekStartDay[i+1]) )
#Creating week value assignment column
demographic['Week'] = np.select(weekConditions,weekValues)
I believe I've found a solution to the problem.
#Creating list for start dates
weekStartDay = []
weekValues = []
weekConditions = []
counter = 1
i = 0
demoStartDate = min(demographic['Date'])
demoEndDate = max(demographic['Date'])
while demoStartDate <= demoEndDate:
weekStartDay.append(demoStartDate)
demoStartDate += timedelta(days=7)
weekStartDay.append(demoStartDate)
while counter <= len(weekStartDay):
weekValues.append(counter)
counter += 1
#Assigning condition statement for numpy conditions
while i != len(weekStartDay):
for i in range(len(weekStartDay)):
weekConditions.append( (demographic['Date'] >= weekStartDay[i-1]) & (demographic['Date'] < weekStartDay[i]) )
i += 1
#Creating week value assignment column
demographic['Week'] = np.select(weekConditions,weekValues)
I would like to convert y dataframe from one format (X:XX:XX:XX) of values to another (X.X seconds)
Here is my dataframe looks like:
Start End
0 0:00:00:00
1 0:00:00:00 0:07:37:80
2 0:08:08:56 0:08:10:08
3 0:08:13:40
4 0:08:14:00 0:08:14:84
And I would like to transform it in seconds, something like that
Start End
0 0.0
1 0.0 457.80
2 488.56 490.80
3 493.40
4 494.0 494.84
To do that I did:
i = 0
j = 0
while j < 10:
while i < 10:
if data.iloc[i, j] != "":
Value = (int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100)
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], Value)
i += 1
else:
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], "")
i += 1
data.update(NewValue)
i = 0
j += 1
But I failed to replace the new values in my oldest dataframe in a permament way, when I do:
print(data)
I still get my old data frame in the wrong format.
Some one could hep me? I tried so hard!
Thank you so so much!
You are using pandas.DataFrame.update that requires a pandas dataframe as an argument. See the Example part of the update function documentation to really understand what update does https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
If I may suggest a more idiomatic solution; you can directly map a function to all values of a pandas Series
def parse_timestring(s):
if s == "":
return s
else:
# weird to use centiseconds and not milliseconds
# l is a list with [hour, minute, second, cs]
l = [int(nbr) for nbr in s.split(":")]
return sum([a*b for a,b in zip(l, (3600, 60, 1, 0.01))])
df["Start"] = df["Start"].map(parse_timestring)
You can remove the if ... else ... from parse_timestring if you replace all empty string with nan values in your dataframe with df = df.replace("", numpy.nan) then use df["Start"] = df["Start"].map(parse_timestring, na_action='ignore')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
The datetimelibrary is made to deal with such data. You should also use the apply function of pandas to avoid iterating on the dataframe like that.
You should proceed as follow :
from datetime import datetime, timedelta
def to_seconds(date):
comp = date.split(':')
delta = (datetime.strptime(':'.join(comp[1:]),"%H:%M:%S") - datetime(1900, 1, 1)) + timedelta(days=int(comp[0]))
return delta.total_seconds()
data['Start'] = data['Start'].apply(to_seconds)
data['End'] = data['End'].apply(to_seconds)
Thank you so much for your help.
Your method was working. I also found a method using loop:
To summarize, my general problem was that I had an ugly csv file that I wanted to transform is a csv usable for doing statistics, and to do that I wanted to use python.
my csv file was like:
MiceID = 1 Beginning End Type of behavior
0 0:00:00:00 Video start
1 0:00:01:36 grooming type 1
2 0:00:03:18 grooming type 2
3 0:00:06:73 0:00:08:16 grooming type 1
So in my ugly csv file I was writing only the moment of the begining of the behavior type without the end when the different types of behaviors directly followed each other, and I was writing the moment of the end of the behavior when the mice stopped to make any grooming, that allowed me to separate sequences of grooming. But this type of csv was not usable for easily making statistics.
So I wanted 1) transform all my value in seconds to have a correct format, 2) then I wanted to fill the gap in the end colonne (a gap has to be fill with the following begining value, as the end of a specific behavior in a sequence is the begining of the following), 3) then I wanted to create columns corresponding to the duration of each behavior, and finally 4) to fill this new column with the duration.
My questionning was about the first step, but I put here the code for each step separately:
step 1: transform the values in a good format
import pandas as pd
import numpy as np
data = pd.read_csv("D:/Python/TestPythonTraitementDonnéesExcel/RawDataBatch2et3.csv", engine = "python")
data.replace(np.nan, "", inplace = True)
i = 0
j = 0
while j < len(data.columns):
while i < len(data.index):
if (":" in data.iloc[i, j]) == True:
Value = str((int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100))
data = data.replace([data.iloc[i, j]], Value)
data.update(data)
i += 1
else:
i += 1
i = 0
j += 1
print(data)
step 2: fill the gaps
i = 0
j = 2
while j < len(data.columns):
while i < len(data.index) - 1:
if data.iloc[i, j] == "":
data.iloc[i, j] = data.iloc[i + 1, j - 1]
data.update(data)
i += 1
elif np.all(data.iloc[i:len(data.index), j] == ""):
break
else:
i += 1
i = 0
j += 4
print(data)
step 3: create a new colunm for each mice:
j = 1
k = 0
while k < len(data.columns) - 1:
k = (j * 4) + (j - 1)
data.insert(k, "Duree{}".format(k), "")
data.update(data)
j += 1
print(data)
step 3: fill the gaps
j = 4
i = 0
while j < len(data.columns):
while i < len(data.index):
if data.iloc[i, j - 2] != "":
data.iloc[i, j] = str(float(data.iloc[i, j - 2]) - float(data.iloc[i, j - 3]))
data.update(data)
i += 1
else:
break
i = 0
j += 5
print(data)
And of course, export my new usable dataframe
data.to_csv(r"D:/Python/TestPythonTraitementDonnéesExcel/FichierPropre.csv", index = False, header = True)
here are the transformations:
click on the links for the pictures
before step1
after step 1
after step 2
after step 3
after step 4
I have an excel pivot table of format:
Names 2/1/2010 3/1/2010 4/1/2010
A 8
B 4 5 7
C 5 3
D 6 6
I need to get the names and date of the cells which are empty. How can I do it?
I want the output as a list: [A:3/1/2010,4/1/2010].
Assuming format is same as above, Check this code snippet, you can use different python module to read excel sheet
import xlrd
def get_list_vals() :
res = []
path="C:/File_PATH.xlsx"
wb=xlrd.open_workbook(path)
sheet=wb.sheet_by_index(0)
# Get rows from 2nd line
for row in range(1, sheet.nrows) :
temp = []
for column in range (sheet.ncols) :
val = sheet.cell_value(row,column)
# get first column values like(A, B, C)
if column == 0:
temp.append(val)
continue
# if not first column, get the date data from row = 1
elif val=="" :
date_val = sheet.cell_value(0,column)
temp.append(date_val)
res.append(temp)
return res
If you want specific format like [A : date1, date2] for thhis instead of temp = [] , you can append to string value
temp = [] -->> temp = ""
temp.append(val) --> temp += str(val) + ":"
temp.append(date_val) -->> temp + str(val) + ","
I'm working on forex data like this:
0 1 2 3
1 AUD/JPY 20040101 00:01:00.000 80.598 80.598
2 AUD/JPY 20040101 00:02:00.000 80.595 80.595
3 AUD/JPY 20040101 00:03:00.000 80.562 80.562
4 AUD/JPY 20040101 00:04:00.000 80.585 80.585
5 AUD/JPY 20040101 00:05:00.000 80.585 80.585
I want to go through column 2 and 3 and remove the rows in which the value is repeated for more than 15 times in a row. So far I managed to produce this piece of code:
price = 0
drop_start = 0
counter = 0
df_new = df
for i, r in df.iterrows():
if r.iloc[2] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[2]
counter = 1
drop_start = i
if r.iloc[2] == price:
counter = counter + 1
price = 0
drop_start = 0
counter = 0
df = df_new
for i, r in df.iterrows():
if r.iloc[3] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[3]
counter = 1
drop_start = i
if r.iloc[3] == price:
counter = counter + 1
print(df_new.info())
df_new.to_csv('df_new.csv', index=False, header=None)
Unfortunately when I check the output file there are some mistakes, there are some weekends which have not been removed by the program. How should I build my algorithm, so it removes the duplicated values correctly?
First 250k rows of my initial dataset is available here: https://ufile.io/omg5h
The output of this program for that sample data is available here:
https://ufile.io/2gc3d
You can see that in the output file the rows 6931+ were not succesfully removed:
The problem with your algorithm is that, you are not holding specific counter values for the row values, but rather increment the counter through the loop. This causes the result to be false I believe. Also, the comparison r.iloc[2] != price also does not make sense because you are changing the value of price every iteration, so if there are elements between the duplicates, this check do not serve a proper function. I wrote a small code to copy the behavior you asked for.
df = pd.DataFrame([[0,0.5, 2.5],[0,1, 2],[0,1.5,2.5 ],[0,2, 3],[0,2, 3],[0,3, 4],
[0,4, 5]],columns = ['A','B','C'])
df_new = df
dict = {}
print('Initial DF')
print(df)
print()
for i, r in df.iterrows():
counter = dict.get(r.iloc[1])
if counter == None:
counter = 0
dict[r.iloc[1]] = counter + 1
if dict[r.iloc[1]] >= 2:
df_new = df_new[df_new.B != r.iloc[1]]
print('2nd col. deleted DF')
print(df_new)
print()
df_fin = df_new
dict2 = {}
for i, r in df_new.iterrows():
counter = dict2.get(r.iloc[2])
if counter == None:
counter = 0
dict2[r.iloc[2]] = counter + 1
if dict2[r.iloc[2]] >= 2:
df_fin = df_fin[df_fin.C != r.iloc[2]]
print('3rd col. deleted DF')
print(df_fin)
Here, I hold the counter value for each unique value in the rows of column 2 and 3. Then, according to the threshold(which is 2 in this case) I remove the rows which are exceeding the threshold. I first eliminate values according to the 2nd column, then forward this modified array to the next loop and eliminate values according to the 3rd column and finish the process.
How can I make the cell number increase by one every time it loops through all of the sheets? I got it to loop through the different sheets itself but I'm not sure how to add +1 to the cell value.
for sheet in sheetlist:
wsX = wb.get_sheet_by_name('{}'.format(sheet))
ws2['D4'] = wsX['P6'].value
I'm trying to get just the ['D4'] to change to D5,D6,D7.. etc up to 25 automatically.
No need for counters or clumsy string conversion: openpyxl provides an API for programmatic access.
for idx, sheet in enumerate(sheetlist, start=4):
wsX = wb[sheet]
cell = ws2.cell(row=idx, column=16)
cell.value = wsX['P6']
for i, sheet in enumerate(sheetlist):
wsX = wb.get_sheet_by_name('{}'.format(sheet))
cell_no = 'D' + str(i + 4)
ws2[cell_no] = wsX['P6'].value
write this outside of the loop :
x = 'D4'
write this in the loop :
x = x[0] + str(int(x[1:])+1)
Try this one... it's commented so you can understand what it's doing.
#counter
i = 4
for sheet in sheetlist:
#looping from D4 to D25
while i <= 25:
wsX = wb.get_sheet_by_name('{}'.format(sheet))
#dynamic way to get the cell
cell1 = 'D' + str(i)
ws2[cell1] = wsX['P6'].value
#incrementing counter
i += 1