How can I create a stream of data from a pandas dataframe? - python

I am looking for a way to produce stream of data from static data, eg. I want to create a source where each row of data will arrive after 10 ms. Is there a way to do it?

You could just iterate with a timer to wait, using yield to create a generator, I used itertuples but you can change how you iterate the data
import time
import pandas as pd
def yield_wait(frame, ms):
for v in frame.itertuples():
yield v
time.sleep(ms / 1000)
if __name__ == '__main__':
inp = [{'c1': 10, 'c2': 100}, {'c1': 11, 'c2': 110}, {'c1': 12, 'c2': 120}]
df = pd.DataFrame(inp)
for v in yield_wait(df, 1000): # print every 1sec
print(v)

Related

Pandas .loc[] is extremly slow compared to dict

I have a DataFrame that has around 10 columns and 100K rows. I want to get every row in a loop using .loc[] on index. However .loc[] is being extremely slow compared to Python's dict.
Here is the code to reproduce:
import pandas as pd
import random
import time
data = {}
for i in range(100000):
data[i] = {
'id': i,
'a': random.randint(1, 40000),
'b': random.randint(1, 40000),
'c': random.randint(1, 40000),
'd': random.randint(1, 40000),
'e': random.randint(1, 40000),
'f': random.randint(1, 40000),
}
df = pd.DataFrame.from_dict(
data=data,
orient="index",
dtype=int,
)
df.set_index('id', inplace=True)
dict_objs = df.to_dict('index')
start_time_dataframe = time.time()
for i in range(100000):
obj = df.loc[i]
end_time_dataframe = time.time() - start_time_dataframe
start_time_dict = time.time()
for i in range(100000):
obj = dict_objs[i]
end_time_dict = time.time() - start_time_dict
print(f"Time needed for DataFrame: {end_time_dataframe}") # 12.08s
print(f"Time needed for Dict: {end_time_dict}") # 0.01s
Why is DataFrame's .loc[] running so slow?

Get closest time from data

Below is the sample data:
{
"a":"05:32",
"b":"12:15",
"c":"15:42",
"d":"18:23"
}
Using this data: I want to get the closest next value to the current time.
ie; So if it is, 15:30 right now, the query should return c.
I tried to do this with a for loop and it didn't seem very efficient.
You can use min with a custom key function:
d = {'a': '05:32', 'b': '12:15', 'c': '15:42', 'd': '18:23'}
def closest(c, t2 = [15, 30]):
a, b = map(int, d[c].split(':'))
return abs((t2[-1]+60*t2[0]) - (b+60*a))
new_d = min(d, key=closest)
Output:
c
In general, you can replace t2 = [15, 30] (used only for demo purposes) with results from datetime.datetime.now:
from datetime import datetime
def closest(c, t2 = [(n:=datetime.now()).hour, n.minute]):
...

Can dictionary data split into test and training set randomly?

I want to understand if I have a set of Dictionary data in JSON such as example below:
data = {'a':'120120121',
'b':'12301101',
'c':'120120121',
'd':'12301101',
'e':'120120121',
'f':'12301101',
'g':'120120121',
'h':'12301101',
'i':'120120121',
'j':'12301101'}
Is it possible to split the dictionary to 70:30 randomly using Python?
The output should be like:
training_data = {'a':'120120121',
'b':'12301101',
'c':'120120121',
'e':'120120121',
'g':'120120121',
'i':'120120121',
'j':'12301101'}
test_data = {'d':'12301101',
'f':'12301101',
'h':'12301101'}
The easiest way would be to just use sklearn.model_selection.train_test_split here, and
turn back to dictionary if that is the structure you want:
from sklearn.model_selection import train_test_split
s = pd.Series(data)
training_data , test_data = [i.to_dict() for i in train_test_split(s, train_size=0.7)]
print(training_data)
# {'b': '12301101', 'j': '12301101', 'a': '120120121', 'f': '12301101',
# 'e': '120120121', 'c': '120120121', 'h': '12301101'}
print(test_data)
# {'i': '120120121', 'd': '12301101', 'g': '120120121'}

How to print individual rows of a pandas dataframe using Python?

newbie to Python.
I'm trying to extract data from a dataframe and put it into a string to print to a docx file.
This is my current code:
add_run("Placeholder A").italic= True
for i in range(0, list(df.shape)[0]):
A = df.iloc[findings][0]
B = df.iloc[findings][1]
C =df.iloc[findings][2]
output = ('The value of A: {}, B: {}, C: {}').format(A,B,C)
doc.add_paragraph(output)
The output I am after is:
Placeholder A
print output for row 1 of DF
Placeholder A
print output for row 2 of DF
Currently it is printing all the outputs of the dataframe under Placeholder A.
Any ideas where I am going wrong?
Here (stackoverflow - How to iterate over rows in a DataFrame in Pandas?) you can find help with iterating over pandas dataframe rows. Rest to do is just to print(row) :)
edit:
Here is example (made based on answer from link) of code that prints rows from previously created dataframe:
import pandas as pd
inp = [{'c1': 10, 'c2': 100, 'c3': 100}, {'c1': 11, 'c2': 110, 'c3': 100}, {'c1': 12, 'c2': 120, 'c3': 100}]
df = pd.DataFrame(inp)
for index, row in df.iterrows():
A = row["c1"]
B = row["c2"]
C = row["c3"]
print('The value of A: {}, B: {}, C: {}'.format(A, B, C))

How to use tqdm with map for Dataframes

Can I use tqdm progress bar with map function to loop through dataframe/series rows?
specifically, for the following case:
def example(x):
x = x + 2
return x
if __name__ == '__main__':
dframe = pd.DataFrame([{'a':1, 'b': 1}, {'a':2, 'b': 2}, {'a':3, 'b': 3}])
dframe['b'] = dframe['b'].map(example)
Due to the integration of tqdm with pandas you can use progress_map function instead of map function.
Note: for this to work you should add tqdm.pandas() line to your code.
So try this:
from tqdm import tqdm
def example(x):
x = x + 2
return x
tqdm.pandas() # <- added this line
if __name__ == '__main__':
dframe = pd.DataFrame([{'a':1, 'b': 1}, {'a':2, 'b': 2}, {'a':3, 'b': 3}])
dframe['b'] = dframe['b'].progress_map(example) # <- progress_map here
Here is the documentation reference:
(after adding tqdm.pandas()) ... you can use progress_apply instead of apply and progress_map
instead of map

Categories

Resources