Query data frame in python pandas, can't save query - python

I have a list of data frames that I'm opening in a for loop. For each data frame I want to query a portion of it and find the average.
This is what I have so far:
k = 0
for i in open('list.txt', 'r'):
k = k+1
i_name = i.strip()
df = pd.read_csv(i_name, sep='\t')
#Create queries
A = df.query('location == 1' and '1000 >= start <= 120000000')
B = df.query('location == 10' and '2000000 >= start <= 60000000')
print A
print B
#Find average
avgA = (sum(A['height'])/len(A['height']))
print avgA
avgB = (sum(B['height'])/len(B['height']))
print avgB
The problem is I'm not getting the average values I'm expecting (when doing it manually by excel). Printing the query results in the entire data frame being printed so I'm not sure if there's a problem with how I'm querying the data.
Am I correctly assigning the values A and B to the queries? Is there another way to do this that doesn't involve saving every data frame as a csv? I have many queries to create and don't want to save each intermediate query for hundreds of samples as I'm only interested in the average.

This does not do what you expect:
A = df.query('location == 1' and '1000 >= start <= 120000000')
B = df.query('location == 10' and '2000000 >= start <= 60000000')
You are doing the Python "and" of two strings. Since the first string has a True value, the result of that expression is "1000 >= start <= 120000000".
You want the "and" to be inside the query:
A = df.query('location == 1 and 1000 >= start <= 120000000')
B = df.query('location == 10 and 2000000 >= start <= 60000000')
Secondly, you have the inequality operators backwards. The first one is only going to get values less than or equal to 1000. What you really want is:
A = df.query('location == 1 and 1000 <= start <= 120000000')
B = df.query('location == 10 and 2000000 <= start <= 60000000')

Related

How can this for loop be written to process faster in Python?

I'm not familiar enough with Python to understand how I can make a for loop go faster. Here's what I'm trying to do.
Let's say we have the following dataframe of prices.
import pandas as pd
df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})
The goal is to create a new column called updown, which classifies each row as "up" or "down", signifying what comes first when looking at each subsequent row - up by 2, or down by 2.
df['updown'] = 0
for i in range(df.shape[0]):
j=0
while df.price.iloc[i+j] < (df.price.iloc[i] + 2) and df.price.iloc[i+j] > (df.price.iloc[i] - 2):
j= j+1
if df.price.iloc[i+j] >= (df.price.iloc[i] + 2):
df.updown.iloc[i] = "Up"
if df.price.iloc[i+j] <= (df.price.iloc[i] - 2):
df.updown.iloc[i] = "Down"
This works just fine, but simply runs too slow when running on millions of rows. Note that I am aware the code throws an error once it gets to the last row, which is fine with me.
Where can I learn how to make something like this happen much faster (ideally seconds, or at least minutes, as opposed to 10+ hours, which is how long it takes right now.
Running through a bunch of different examples, the second method in the following code is approximate x75 faster for the example dataset:
import pandas as pd, numpy as np
from random import randint
import time
data = [randint(90, 120) for i in range(10000)]
df1 = pd.DataFrame({'price': data})
t0 = time.time()
df1['updown'] = np.nan
count = df1.shape[0]
for i in range(count):
j = 1
up = df1.price.iloc[i] + 2
down = up - 4
while (pos := i + j) < count:
if(value := df1.price.iloc[pos]) >= up:
df1.loc[i, 'updown'] = "Up"
break
elif value <= down:
df1.loc[i, 'updown'] = "Down"
break
else:
j = j + 1
t1 = time.time()
print(f'Method 1: {t1 - t0}')
res1 = df1.head()
df2 = pd.DataFrame({'price': data})
t2 = time.time()
count = len(df2)
df2['updown'] = np.nan
up = df2.price + 2
down = df2.price - 2
# increase shift range until updown is set for all columns
# or there is insufficient data to change remaining rows
i = 1
while (i < count) and (not (isna := df2.updown.isna()) is None and ((i == 1) or (isna[:-(i - 1)].any()))):
shift = df2.price.shift(-i)
df2.loc[isna & (shift >= up), 'updown'] = 'Up'
df2.loc[isna & (shift <= down), 'updown'] = 'Down'
i += 1
t3 = time.time()
print(f'Method 2: {t3 - t2}')
s1 = df1.updown
s2 = df2.updown
match = (s1.isnull() == s2.isnull()).all() and (s1[s1.notnull()] == s2[s2.notnull()]).all()
print(f'Series match: {match}')
The main reason for the speed improvement is instead of iterating across the rows in python, we are doing operations on arrays of data which will all happen in C code. While python calling into pandas or numpy (which are C libraries) is quite quick, there is some overhead, and if you are doing this lots of time it very quickly becomes the limiting factor.
The performance increase is dependent on input data, but scales with the number of rows in the dataframe: the more rows the slower it is to iterate:
iterations method1 method2 increase
0 100 0.056002 0.018267 3.065689
1 1000 0.209895 0.005000 41.982070
2 10000 2.625701 0.009001 291.727054
3 100000 108.080149 0.042001 2573.260448
There are various errors stopping the example code from working, at least for me. Could you please confirm this is what you want the algorithm to do?
import pandas as pd
df = pd.DataFrame.from_dict({'price': {0: 98, 1: 99, 2: 101, 3: 99, 4: 97, 5: 100, 6: 100, 7: 98}})
df['updown'] = 0
count = df.shape[0]
for i in range(count):
j = 1
up = df.price.iloc[i] + 2
down = up - 4
while (pos := i + j) < count:
if(value := df.price.iloc[pos]) >= up:
df.loc[i, 'updown'] = "Up"
break
elif value <= down:
df.loc[i, 'updown'] = "Down"
break
else:
j = j + 1
print(df)

Fastest way to count event occurences in a Pandas dataframe?

I have a Pandas dataframe with ~100,000,000 rows and 3 columns (Names str, Time int, and Values float), which I compiled from ~500 CSV files using glob.glob(path + '/*.csv').
Given that two different names alternate, the job is to go through the data and count the number of times a value associated with a specific name ABC deviates from its preceding value by ±100, given that the previous 50 values for that name did not deviate by more than ±10.
I initially solved it with a for loop function that iterates through each row, as shown below. It checks for the correct name, then checks the stability of the previous values of that name, and finally adds one to the count if there is a large enough deviation.
count = 0
stabilityTime = 0
i = 0
if names[0] == "ABC":
j = value[0]
stability = np.full(50, values[0])
else:
j = value[1]
stability = np.full(50, values[1])
for name in names:
value = values[i]
if name == "ABC":
if j - 10 < value < j + 10:
stabilityTime += 1
if stabilityTime >= 50 and np.std(stability) < 10:
if value > j + 100 or value < j - 100:
stabilityTime = 0
count += 1
stability = np.roll(stability, -1)
stability[-1] = value
j = value
i += 1
Naturally, this process takes a very long computing time. I have looked at NumPy vectorization, but do not see how I can apply it in this case. Is there some way I can optimize this?
Thank you in advance for any advice!
Bonus points if you can give me a way to concatenate all the data from every CSV file in the directory that is faster than glob.glob(path + '/*.csv').

Making permanent change in a dataframe using python pandas

I would like to convert y dataframe from one format (X:XX:XX:XX) of values to another (X.X seconds)
Here is my dataframe looks like:
Start End
0 0:00:00:00
1 0:00:00:00 0:07:37:80
2 0:08:08:56 0:08:10:08
3 0:08:13:40
4 0:08:14:00 0:08:14:84
And I would like to transform it in seconds, something like that
Start End
0 0.0
1 0.0 457.80
2 488.56 490.80
3 493.40
4 494.0 494.84
To do that I did:
i = 0
j = 0
while j < 10:
while i < 10:
if data.iloc[i, j] != "":
Value = (int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100)
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], Value)
i += 1
else:
NewValue = data.iloc[:, j].replace([data.iloc[i, j]], "")
i += 1
data.update(NewValue)
i = 0
j += 1
But I failed to replace the new values in my oldest dataframe in a permament way, when I do:
print(data)
I still get my old data frame in the wrong format.
Some one could hep me? I tried so hard!
Thank you so so much!
You are using pandas.DataFrame.update that requires a pandas dataframe as an argument. See the Example part of the update function documentation to really understand what update does https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
If I may suggest a more idiomatic solution; you can directly map a function to all values of a pandas Series
def parse_timestring(s):
if s == "":
return s
else:
# weird to use centiseconds and not milliseconds
# l is a list with [hour, minute, second, cs]
l = [int(nbr) for nbr in s.split(":")]
return sum([a*b for a,b in zip(l, (3600, 60, 1, 0.01))])
df["Start"] = df["Start"].map(parse_timestring)
You can remove the if ... else ... from parse_timestring if you replace all empty string with nan values in your dataframe with df = df.replace("", numpy.nan) then use df["Start"] = df["Start"].map(parse_timestring, na_action='ignore')
see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html
The datetimelibrary is made to deal with such data. You should also use the apply function of pandas to avoid iterating on the dataframe like that.
You should proceed as follow :
from datetime import datetime, timedelta
def to_seconds(date):
comp = date.split(':')
delta = (datetime.strptime(':'.join(comp[1:]),"%H:%M:%S") - datetime(1900, 1, 1)) + timedelta(days=int(comp[0]))
return delta.total_seconds()
data['Start'] = data['Start'].apply(to_seconds)
data['End'] = data['End'].apply(to_seconds)
Thank you so much for your help.
Your method was working. I also found a method using loop:
To summarize, my general problem was that I had an ugly csv file that I wanted to transform is a csv usable for doing statistics, and to do that I wanted to use python.
my csv file was like:
MiceID = 1 Beginning End Type of behavior
0 0:00:00:00 Video start
1 0:00:01:36 grooming type 1
2 0:00:03:18 grooming type 2
3 0:00:06:73 0:00:08:16 grooming type 1
So in my ugly csv file I was writing only the moment of the begining of the behavior type without the end when the different types of behaviors directly followed each other, and I was writing the moment of the end of the behavior when the mice stopped to make any grooming, that allowed me to separate sequences of grooming. But this type of csv was not usable for easily making statistics.
So I wanted 1) transform all my value in seconds to have a correct format, 2) then I wanted to fill the gap in the end colonne (a gap has to be fill with the following begining value, as the end of a specific behavior in a sequence is the begining of the following), 3) then I wanted to create columns corresponding to the duration of each behavior, and finally 4) to fill this new column with the duration.
My questionning was about the first step, but I put here the code for each step separately:
step 1: transform the values in a good format
import pandas as pd
import numpy as np
data = pd.read_csv("D:/Python/TestPythonTraitementDonnéesExcel/RawDataBatch2et3.csv", engine = "python")
data.replace(np.nan, "", inplace = True)
i = 0
j = 0
while j < len(data.columns):
while i < len(data.index):
if (":" in data.iloc[i, j]) == True:
Value = str((int(data.iloc[i, j][0]) * 3600) + (int(data.iloc[i, j][2:4]) *60) + int(data.iloc[i, j][5:7]) + (int(data.iloc[i, j][8: 10])/100))
data = data.replace([data.iloc[i, j]], Value)
data.update(data)
i += 1
else:
i += 1
i = 0
j += 1
print(data)
step 2: fill the gaps
i = 0
j = 2
while j < len(data.columns):
while i < len(data.index) - 1:
if data.iloc[i, j] == "":
data.iloc[i, j] = data.iloc[i + 1, j - 1]
data.update(data)
i += 1
elif np.all(data.iloc[i:len(data.index), j] == ""):
break
else:
i += 1
i = 0
j += 4
print(data)
step 3: create a new colunm for each mice:
j = 1
k = 0
while k < len(data.columns) - 1:
k = (j * 4) + (j - 1)
data.insert(k, "Duree{}".format(k), "")
data.update(data)
j += 1
print(data)
step 3: fill the gaps
j = 4
i = 0
while j < len(data.columns):
while i < len(data.index):
if data.iloc[i, j - 2] != "":
data.iloc[i, j] = str(float(data.iloc[i, j - 2]) - float(data.iloc[i, j - 3]))
data.update(data)
i += 1
else:
break
i = 0
j += 5
print(data)
And of course, export my new usable dataframe
data.to_csv(r"D:/Python/TestPythonTraitementDonnéesExcel/FichierPropre.csv", index = False, header = True)
here are the transformations:
click on the links for the pictures
before step1
after step 1
after step 2
after step 3
after step 4

Want to optimize my code for finding out overlapping times in a big amount of records pandas

I have a data table consisting 100000 records with 50 columns, It has a start time and end time value and a equipment key for which records are available. When this nodes are down, their records are stored. so start time is when the node goes down, and end time is when the node is up after getting down. If there are multiple records where we have same equipment key, and start time and end time values which are inside of previous record's start time and end time, then we call it that this new record has overlapping time and we need to ignore them. To find out these overlapping records, I have written a function and apply it on a dataframe, but it's taking a long time. I am not that efficient in optimization, that's why seeking any suggestion regarding this.
sitecode_info = []
def check_overlapping_sitecode(it):
sitecode = it['equipmentkey']
fo = it['firstoccurrence']
ct = it['cleartimestamp']
if len(sitecode_info) == 0:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
else:
for list_item in sitecode_info:
for item in list_item.keys():
if item == 'sc':
if list_item[item] == sitecode:
# print("matched")
if fo >= list_item['fo'] and ct <= list_item['ct'] or \
fo >= list_item['fo'] and fo <= list_item['ct'] and ct >= list_item['ct'] or \
fo <= list_item['fo'] and ct >= list_item['ct'] or \
fo <= list_item['fo'] and ct >= list_item['fo'] and ct <= list_item['ct']:
return 1
else:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
else:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
I am calling this as following.
temp_df['false_alarms'] = temp_df.apply(check_overlapping_sitecode, axis=1)
I think you were just iterating over that list of dictionaries a touch too much.
**EDIT:**Added appending fo's and ct's even if it returns 1 in the method for enhanced accuracy.
'''
setting an empty dictionary.
this will look like: {sc1: [[fo, ct], [fo, ct]],
sc2:[[fo, ct], [fo, ct]]}
the keys are just the site_code,
this way we don't have to iterate over all of the fo's and ct's, just the ones related to that site code.
'''
sitecode_info = {}
# i set up a dataframe with 200000 rows x 50 columns
def check_overlapping_sitecode(site_code, fo, ct):
try:
#try to grab the existing site_code information from sitecode_info dict.
#if it fails then go ahead and make it while also returning 0 for that site_code
my_list = sitecode_info[site_code]
#if it works, go through that site's list.
for fo_old, ct_old in my_list:
#if the first occurence is >= old_first occurenc and <= cleartimestamp
if fo >= fo_old and fo <= ct_old:
sitecode_info[site_code].append([fo, ct])
return 1
#same but for cleartimestamp instead
elif ct <= ct_old and ct >= fo_old:
sitecode_info[site_code].append([fo, ct])
return 1
else:
#if it doesnt overlap at all go ahead and set the key to a list in list
sitecode_info[site_code].append([fo, ct])
return 0
except:
#set the key to a list in list if it fails
sitecode_info[site_code] = [[fo, ct]]
return 0
t = time.time()
"""Here's the real meat and potatoes.
using a lambda function to call method "check_overlapping_sitecode".
lambda: x where x is row
return the output of check_overlapping_sitecode
"""
temp_df['false_alarms'] = temp_df.apply(lambda x: check_overlapping_sitecode(x['equipmentkey'], x['firstoccurrence'], x['cleartimestamp']), axis=1)
print(time.time()-t)
#this code runs nearly 6 seconds for me.
#then you can do whatever you want with your DF.

How to form another colum in a pd.DataFrame out of different variables

I'm trying to make a new boolean variable by an if-statement with multiple conditions in other variables. But so far my many tries do not even work with variable as parameter.
head of used columns in data frame
I would really appreciate if anyone of you can see the Problem, I already searched for two days the whole World Wide Web. But as beginner I couldn't find the solution yet.
amount = df4['AnzZahlungIDAD']
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
diffrange = [(diff <=15) & (diff >= -15)]
special = df4[['Taxatoreneinsatz', 'Belegpruefereinsatz_rel', 'IntSVKZ', 'ExtTechSVKZ']]
First Method with list comprehension
label = []
label = [True if (amount[i] <= 1) & (time[i] <= timequantil) & (diff == diffrange) & (special == 'N') else False for i in label]
label
Second Method with iterrows()
df4['label'] = pd.Series([])
df4['label'] = [True if (row[amount] <= 1) & (row[time] <= timequantil) & (row[diff] == diffrange) & (row[special] == 'N') else False for row in df4.iterrows()]
df4['label']
3rd Method with Lambda function
df4.loc[:,'label'] = '1'
df4['label'] = df4['label'].apply([lambda c: True if (c[amount] <= 1) & (c[time] <= timequantil) & (c[diff] == diffrange) & (c[special]) == 'N' else False for c in df4['label']], axis = 0)
df4['label'].value_counts()
I expected that I get a varialbe "Label" in my dataframe df4 that is whether True or False.
Fewer tries gave me only all values = False or all = True even if I used only a single Parameter, which is impossible by the data.
First Method runs fine but Outputs: []
Second Method gives me following error: TypeError: tuple indices must be integers or slices, not Series
Third Method does not load at all.
IIUC, try this
time = df4['DLZ_SCHDATSCHL']
Erstr = df4['Schadenwert']
Zahlges = df4['zahlgesbrut']
# timequantil = time.quantile(.2)
diff = (Erstr-Zahlges)/Erstr*100
df4['label'] = (df4['AnzZahlungIDAD'] <= 1) & (time <= time.quantile(.2)) & (diff <=15) & (diff >= -15) & (df['Belegpruefereinsatz_rel'] =='N') & (df['Taxatoreneinsatz'] =='N') & (df['ExtTechSVKZ'] =='N') & (df['IntSVKZ'] =='N')
Given your dataset i got following output
Anz dlz sch zal taxa bel int ext label
0 2 82 200 253.80 N N N J False
1 2 82 200 253.80 N N N J False
2 1 153 200 323.68 N J N N False
3 1 153 200 323.68 N J N N False
4 1 191 500 1252.12 N J N N False
Note: Don't mind the abbreviations used in column name

Categories

Resources