Hello need to get insides from data, the desired_result is from VBA code that compare two sheets desired_result is checked and 100% accurate, If someone can assist me get the desired output, conditions are:
err['p'] == scr['p'] & err['errd'] >= scr['scrd'] & err['errq'] - scr['scrq'] >= 0
Its all about checking how many of scr['n'] wenth thru err but if one passes thru err then err['errq'] -= scr['scrq'] and jump to next item scr['p'], scr['n'] are unique, please see the sample code below:
import pandas as pd
err = pd.DataFrame({
'p' : ['10100.A','10101.A','10101.A','10101.A','10102.A','10102.A','10102.A','10103.A','10103.A','10147.A','10147.A'],
'errd' : ['18-5-2022','16-5-2022','4-5-2022','13-5-2022','9-5-2022','2-5-2022','29-5-2022','6-5-2022','11-5-2022','25-5-2022','6-5-2022'],
'errq' : [1, 1, 1, 1, 1, 2, 46, 1, 4, 1, 5]})
err = err.sort_values('errd')
scr = pd.DataFrame({
'p' : ['10101.A','10101.A','10101.A','10102.A','10102.A','10102.A','10103.A','10147.A','10147.A','10147.A','10147.A','10147.A'],
'scrd' : ['10-5-2022','10-5-2022','9-5-2022','13-5-2022','9-5-2022','9-5-2022','25-5-2022','6-5-2022','6-5-2022','6-5-2022','6-5-2022','11-5-2022'],
'scrq' : [1,1,1,1,1,1,1,1,1,1,1,1],
'n' : ['7000000051481339','7000000051481342','7000000051722237','7000000052018581','7000000051721987','7000000051721990','7000000052725251','7000000051530150','7000000051530152','7000000051530157','7000000051546193','7000000051761150']})
desired_result = pd.DataFrame({
'report' : ['7000000051722237','7000000051481339','7000000051721987','7000000051721990','7000000052018581','7000000051530150','7000000051530152','7000000051530157','7000000051546193','7000000051761150'],
'match_err_scr' : ['10101.A','10101.A','10102.A','10102.A','10102.A','10147.A','10147.A','10147.A','10147.A','10147.A']})
What i have tried so far:
match = []
#Iterating scr rows
for i, row in scr.iterrows():
#Checking for match row now is full row in scr
if row['scrq'] <= err[(err['p'] == row['p']) & (err['errd'] >= row['scrd'])]['errq'].sum():
r = row.to_dict()
match.append(r)
#Creating new data frame
report = pd.DataFrame(match)
report
Merge left filter later
report1 = scr.merge(err, how = 'left', on = 'p')
flt = (report1['errd'] >= report1['scrd']) & (report1['errq'] - report1['scrq'] >= 0)
report1 = report1.loc[flt]
report1 = report1.drop_duplicates(subset = ['n'])
report1
Nested loop way to slow and again not correct
match = []
for i, row in scr.iterrows():
for e, erow in err.iterrows():
if (row['p'] == erow['p']) & (erow['errd'] >= row['scrd']) & (erow['errq'] - row['scrq'] >= 0):
err['errq'][e]-= row['scrq']
row_to_dict = row.to_dict()
match.append(row_to_dict)
break
report2 = pd.DataFrame(match)
report2
Not an answer, but required to help understand the question.
#B02T, this is what I am seeing as a slice of the data.
So I am correct in that you are only comparing scr.loc[0] to err.loc[3], scr.loc[1] to err.loc[1] and scr.loc[2] to err.loc[2] ? Or are you comparing each row in scr to each row in err?
Looking at the desired_result, I don't understand how scr.loc[2] could be in the desired_result since, using err.loc[2], (err['errd'] >= scr['srcd']) evals to False. And, following the same methodology, scr.loc[1] should be in desired_result.
>>> err[err['p'] == '10101.A']
p errd errq
3 10101.A 13-5-2022 1
1 10101.A 16-5-2022 1
2 10101.A 4-5-2022 1
>>> scr[scr['p'] == '10101.A']
p scrd scrq n
0 10101.A 10-5-2022 1 7000000051481339
1 10101.A 10-5-2022 1 7000000051481342
2 10101.A 9-5-2022 1 7000000051722237
>>> desired_result
report match_err_scr
0 7000000051722237 10101.A
1 7000000051481339 10101.A
2 7000000051721987 10102.A
3 7000000051721990 10102.A
4 7000000052018581 10102.A
5 7000000051530150 10147.A
6 7000000051530152 10147.A
7 7000000051530157 10147.A
8 7000000051546193 10147.A
9 7000000051761150 10147.A
Related
I have a subset dataframe from a much larger dataframe. I need to be able to create a for loop that searches through a dataframe and pull out the data corresponding to the correct name.
import pandas as pd
import numpy as np
import re
data = {'Name': ['CH_1', 'CH_2', 'CH_3', 'FV_1', 'FV_2', 'FV_3'],
'Value': [1, 2, 3, 4, 5, 6]
}
df = pd.DataFrame(data)
FL = [17.7, 60.0]
CH = [20, 81.4]
tol = 8
time1 = FL[0] + tol
time2 = FL[1] + tol
time3 = CH[0] + tol
time4 = CH[1] + tol
FH_mon = df['Values'] *5
workpercent = [.7, .92, .94]
mhpy = [2087, 2503, 3128.75]
list1 = list()
list2 = list()
for x in df['Name']:
if x == [(re.search('FV_', s)) for s in df['Name'].values]:
y = np.select([FH_mon < time1 , (FH_mon >= time1) and (FH_mon < time2), FH_mon > time2], [workpercent[0],workpercent[1],workpercent[2]])
z = np.select([FH_mon < time1 , (FH_mon >= time1) and (FH_mon < time2), FH_mon > time2], [mhpy[0],mhpy[1],mhpy[2]])
if x == [(re.search('CH_', s)) for s in df['Name'].values]:
y = np.select([FH_mon < time3, (FH_mon >= time3) and (FH_mon < time4)], [workpercent[0],workpercent[1]])
z = np.select([FH_mon < time3, (FH_mon >= time3) and (FH_mon < time4)], [mhpy[0],mhpy[1]])
list1.append(y)
list2.append(z)
I had a simple version earlier where I was just added a couple numbers, and I was getting really helpful answers to how I asked my question, but here is the more complex version. I need to search through and any time there is a FV in the name column, the if loop runs and uses data from the Name column with FV. Same for CH. I have the lists to keep track of each value as the loop loops through the Name column. If there is a simpler way I would really appreciate seeing it, but right now this seems like the cleanest way but I am receiving errors or the loop will not function properly.
This should be what you want:
for index, row in df.iterrows():
if re.search("FV_", row["Name"]):
df.loc[index, "Value"] += 2
elif re.search("CH_", row["Name"]):
df.loc[index, "Value"] += 4
If the "Name" column only has values starting with "FV_" or "CH_", use where:
df["Value"] = df["Value"].add(2).where(df["Name"].str.startswith("FV_"), df["Value"].add(4))
If you might have other values in "Name", use numpy.select:
import numpy as np
df["Value"] = np.select([df["Name"].str.startswith("FV_"), df["Name"].str.startswith("CH_")], [df["Value"].add(2), df["Value"].add(4)])
Output:
>>> df
Name Value
0 CH_1 5
1 CH_2 6
2 CH_3 7
3 FV_1 6
4 FV_2 7
5 FV_3 8
I need to write a code in Panda Dataframe. So: The values in the ID column will be checked sequentially whether they are the same or not. Three situations arise here. Case 1: If the ID is not the same as the next line, write it as "unique" in the Comment column. Case 2: If the ID is the same as the next column and different from the next one, write it as "ring" in the Comment column. Case 3: If the ID is the same as the next multiple columns, write it as "multi" in the Comment column. Case 4: do this until the rows in the ID column are complete.
import pandas as pd
df = pd.read_csv('History-s.csv')
a = len(df['ID'])
c = 0
while a != 0:
c += 1
while df['ID'][i] == df['ID'][i + 1]:
if c == 2:
if df['Nod 1'][i] == df['Nod 2'][i + 1]:
df['Comment'][i] = "Ring"
df['Comment'][i + 1] = "Ring"
else:
df['Comment'][i] = "Multi"
df['Comment'][i + 1] = "Multi"
elif c > 2:
df['Comment'][i] = "Multi"
df['Comment'][i + 1] = "Multi"
i += 1
else:
df['Comment'][i] = "Unique"
a = a -1
print(df, '\n')
Data is like this:
Data
After coding data frame should be like this:
Result
From the input dataframe you have provided, my first impression was that as you are checking next line in a while loop, so you are strictly considering just the next comin line, for ex.
ID
value
comment
1
2
MULTI
1
3
RING
3
4
UNIQUE
But if that is not the case, you can simply use pandas groupby function.
def func(df):
if len(df)>2:
df['comment'] = 'MULTI'
elif len(df)==2:
df['comment'] = 'RING'
else:
df['comment'] = 'UNIQUE'
return df
df = df.groupby(['ID']).apply(func)
Output:
ID value comment
0 1 2 RING
1 1 3 RING
2 3 4 UNIQUE
I would like to convert y dataframe from one format (X:XX:XX:XX) of values to another (X.X seconds)
Here is my dataframe looks like:
Start End
0 0:00:00:00
1 0:00:00:00 0:07:37:80
2 0:08:08:56 0:08:10:08
3 0:08:13:40
4 0:08:14:00 0:08:14:84
And I would like to transform it in seconds, something like that
Start End
0 0.0
1 0.0 457.80
2 488.56 490.80
3 493.40
4 494.0 494.84
To do that I did:
i = 0
j = 0
while j < 10:
while i < 10:
if data.iloc[i, j] != "":
Value = (int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100)
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], Value)
i += 1
else:
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], "")
i += 1
data.update(NewValue)
i = 0
j += 1
But I failed to replace the new values in my oldest dataframe in a permament way, when I do:
print(data)
I still get my old data frame in the wrong format.
Some one could hep me? I tried so hard!
Thank you so so much!
You are using pandas.DataFrame.update that requires a pandas dataframe as an argument. See the Example part of the update function documentation to really understand what update does https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
If I may suggest a more idiomatic solution; you can directly map a function to all values of a pandas Series
def parse_timestring(s):
if s == "":
return s
else:
# weird to use centiseconds and not milliseconds
# l is a list with [hour, minute, second, cs]
l = [int(nbr) for nbr in s.split(":")]
return sum([a*b for a,b in zip(l, (3600, 60, 1, 0.01))])
df["Start"] = df["Start"].map(parse_timestring)
You can remove the if ... else ... from parse_timestring if you replace all empty string with nan values in your dataframe with df = df.replace("", numpy.nan) then use df["Start"] = df["Start"].map(parse_timestring, na_action='ignore')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
The datetimelibrary is made to deal with such data. You should also use the apply function of pandas to avoid iterating on the dataframe like that.
You should proceed as follow :
from datetime import datetime, timedelta
def to_seconds(date):
comp = date.split(':')
delta = (datetime.strptime(':'.join(comp[1:]),"%H:%M:%S") - datetime(1900, 1, 1)) + timedelta(days=int(comp[0]))
return delta.total_seconds()
data['Start'] = data['Start'].apply(to_seconds)
data['End'] = data['End'].apply(to_seconds)
Thank you so much for your help.
Your method was working. I also found a method using loop:
To summarize, my general problem was that I had an ugly csv file that I wanted to transform is a csv usable for doing statistics, and to do that I wanted to use python.
my csv file was like:
MiceID = 1 Beginning End Type of behavior
0 0:00:00:00 Video start
1 0:00:01:36 grooming type 1
2 0:00:03:18 grooming type 2
3 0:00:06:73 0:00:08:16 grooming type 1
So in my ugly csv file I was writing only the moment of the begining of the behavior type without the end when the different types of behaviors directly followed each other, and I was writing the moment of the end of the behavior when the mice stopped to make any grooming, that allowed me to separate sequences of grooming. But this type of csv was not usable for easily making statistics.
So I wanted 1) transform all my value in seconds to have a correct format, 2) then I wanted to fill the gap in the end colonne (a gap has to be fill with the following begining value, as the end of a specific behavior in a sequence is the begining of the following), 3) then I wanted to create columns corresponding to the duration of each behavior, and finally 4) to fill this new column with the duration.
My questionning was about the first step, but I put here the code for each step separately:
step 1: transform the values in a good format
import pandas as pd
import numpy as np
data = pd.read_csv("D:/Python/TestPythonTraitementDonnéesExcel/RawDataBatch2et3.csv", engine = "python")
data.replace(np.nan, "", inplace = True)
i = 0
j = 0
while j < len(data.columns):
while i < len(data.index):
if (":" in data.iloc[i, j]) == True:
Value = str((int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100))
data = data.replace([data.iloc[i, j]], Value)
data.update(data)
i += 1
else:
i += 1
i = 0
j += 1
print(data)
step 2: fill the gaps
i = 0
j = 2
while j < len(data.columns):
while i < len(data.index) - 1:
if data.iloc[i, j] == "":
data.iloc[i, j] = data.iloc[i + 1, j - 1]
data.update(data)
i += 1
elif np.all(data.iloc[i:len(data.index), j] == ""):
break
else:
i += 1
i = 0
j += 4
print(data)
step 3: create a new colunm for each mice:
j = 1
k = 0
while k < len(data.columns) - 1:
k = (j * 4) + (j - 1)
data.insert(k, "Duree{}".format(k), "")
data.update(data)
j += 1
print(data)
step 3: fill the gaps
j = 4
i = 0
while j < len(data.columns):
while i < len(data.index):
if data.iloc[i, j - 2] != "":
data.iloc[i, j] = str(float(data.iloc[i, j - 2]) - float(data.iloc[i, j - 3]))
data.update(data)
i += 1
else:
break
i = 0
j += 5
print(data)
And of course, export my new usable dataframe
data.to_csv(r"D:/Python/TestPythonTraitementDonnéesExcel/FichierPropre.csv", index = False, header = True)
here are the transformations:
click on the links for the pictures
before step1
after step 1
after step 2
after step 3
after step 4
I have a piece of code that takes forever to run. Does anybody know how to optimize it?
The purpose of the formula is to make a column that does the following: when 'action' != 0, if 'PX_LAST'<'ma', populate 'buy_sell' with -1, if 'PX_LAST'>'ma', populate 'buy_sell' with 1; in the other cases, do not populate 'buy_sell' with new values.
Fyi - Column 'action' is populated with either 0 or 1
#create column
df_zinc['buy_sell'] = 0
index = 0
while index < df_zinc.shape[0]:
if df_zinc['action'][index] != 0:
continue
if df_zinc['PX_LAST'][index]<df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = -1
elif df_zinc['PX_LAST'][index]>df_zinc['ma'][index]:
df_zinc.loc[index,'buy_sell'] = 1
else:
index = index + 1
I think you need:
import numpy as np
mask1 = df_zinc['action'] != 0
mask2 = df_zinc['PX_LAST'] < df_zinc['ma']
mask3 = df_zinc['PX_LAST'] > df_zinc['ma']
df_zinc['buy_sell'] = np.select([mask1 & mask2, mask1 & mask3], [-1,1], 0)
I'm working on forex data like this:
0 1 2 3
1 AUD/JPY 20040101 00:01:00.000 80.598 80.598
2 AUD/JPY 20040101 00:02:00.000 80.595 80.595
3 AUD/JPY 20040101 00:03:00.000 80.562 80.562
4 AUD/JPY 20040101 00:04:00.000 80.585 80.585
5 AUD/JPY 20040101 00:05:00.000 80.585 80.585
I want to go through column 2 and 3 and remove the rows in which the value is repeated for more than 15 times in a row. So far I managed to produce this piece of code:
price = 0
drop_start = 0
counter = 0
df_new = df
for i, r in df.iterrows():
if r.iloc[2] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[2]
counter = 1
drop_start = i
if r.iloc[2] == price:
counter = counter + 1
price = 0
drop_start = 0
counter = 0
df = df_new
for i, r in df.iterrows():
if r.iloc[3] != price:
if counter >= 15:
df_new = df_new.drop(df_new.index[drop_start:i])
price = r.iloc[3]
counter = 1
drop_start = i
if r.iloc[3] == price:
counter = counter + 1
print(df_new.info())
df_new.to_csv('df_new.csv', index=False, header=None)
Unfortunately when I check the output file there are some mistakes, there are some weekends which have not been removed by the program. How should I build my algorithm, so it removes the duplicated values correctly?
First 250k rows of my initial dataset is available here: https://ufile.io/omg5h
The output of this program for that sample data is available here:
https://ufile.io/2gc3d
You can see that in the output file the rows 6931+ were not succesfully removed:
The problem with your algorithm is that, you are not holding specific counter values for the row values, but rather increment the counter through the loop. This causes the result to be false I believe. Also, the comparison r.iloc[2] != price also does not make sense because you are changing the value of price every iteration, so if there are elements between the duplicates, this check do not serve a proper function. I wrote a small code to copy the behavior you asked for.
df = pd.DataFrame([[0,0.5, 2.5],[0,1, 2],[0,1.5,2.5 ],[0,2, 3],[0,2, 3],[0,3, 4],
[0,4, 5]],columns = ['A','B','C'])
df_new = df
dict = {}
print('Initial DF')
print(df)
print()
for i, r in df.iterrows():
counter = dict.get(r.iloc[1])
if counter == None:
counter = 0
dict[r.iloc[1]] = counter + 1
if dict[r.iloc[1]] >= 2:
df_new = df_new[df_new.B != r.iloc[1]]
print('2nd col. deleted DF')
print(df_new)
print()
df_fin = df_new
dict2 = {}
for i, r in df_new.iterrows():
counter = dict2.get(r.iloc[2])
if counter == None:
counter = 0
dict2[r.iloc[2]] = counter + 1
if dict2[r.iloc[2]] >= 2:
df_fin = df_fin[df_fin.C != r.iloc[2]]
print('3rd col. deleted DF')
print(df_fin)
Here, I hold the counter value for each unique value in the rows of column 2 and 3. Then, according to the threshold(which is 2 in this case) I remove the rows which are exceeding the threshold. I first eliminate values according to the 2nd column, then forward this modified array to the next loop and eliminate values according to the 3rd column and finish the process.