Streaming data for pandas df - python

I'm attempting to simulate the use of pandas to access a constantly changing file.
I have one file reading a csv file, adding a line to it then sleeping for a random time to simulate bulk input.
import pandas as pd
from time import sleep
import random
df2 = pd.DataFrame(data = [['test','trial']], index=None)
while True:
df = pd.read_csv('data.csv', header=None)
df.append(df2)
df.to_csv('data.csv', index=False)
sleep(random.uniform(0.025,0.3))
The second file is checking for change in data by outputting the shape of the dataframe:
import pandas as pd
while True:
df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
print(df.shape)
The problem with that is while I'm getting the correct shape of the DF, there are certain times where it's outputting (0x2).
i.e.:
...
(10x2)
(10x2)
...
(10x2)
(0x2)
(11x2)
(11x2)
...
This does occur at some but not between each change in shape (the file adding to dataframe).
Knowing this happens when the first script is opening the file to add data, and the second script is unable to access it, hence (0x2), will this occur any data loss?
I cannot directly access the stream, only the output file. Or are there any other possible solutions?
Edit
The purpose of this is to load the new data only (I have a code that does that) and do analysis "on the fly". Some of the analysis will include output/sec, graphing (similar to stream plot), and few other numerical calculations.
The biggest issue is that I have access to the csv file only, and I need to be able to analyze the data as it comes without loss or delay.

One of the scripts is reading the file while the other is trying to write to the file. Both scripts cannot access the file at the same time. Like Padraic Cunningham says in the comments you can implement a lock file to solve this problem.
There is a python package that will do just that called lockfile with documentation here.
Here is your first script with the lockfile package implemented:
import pandas as pd
from time import sleep
import random
from lockfile import FileLock
df2 = pd.DataFrame(data = [['test','trial']], index=None)
lock = FileLock('data.lock')
while True:
with lock:
df = pd.read_csv('data.csv', header=None)
df.append(df2)
df.to_csv('data.csv', index=False)
sleep(random.uniform(0.025,0.3))
Here is you second script with the lockfile package implemented:
import pandas as pd
from time import sleep
from lockfile import FileLock
lock = FileLock('data.lock')
while True:
with lock:
df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
print(df.shape)
sleep(0.100)
I added a wait of 100ms so that I could slow down the output to the console.
These scripts will create a file called "data.lock" before accessing the "data.csv" file and delete the file "data.lock" after accessing the "data.csv" file. In either script, if the "data.lock" exists, the script will wait until the "data.lock" file no longer exists.

Your simulation script reads and writes to the data.csv file. You can read and write concurrently if one script opens the file as write only and the other opens the file as read only.
With this in mind, I changed your simulation script for writing the file to the following:
from time import sleep
import random
while(True):
with open("data.csv", 'a') as fp:
fp.write(','.join(['0','1']))
fp.write('\n')
sleep(0.010)
In python, opening a file with 'a' means append as write only. Using 'a+' will append with read and write access. You must make sure that the code writing the file will only open the file as write-only, and your script that is reading the file must never attempt to write to the file. Otherwise, you will need to implement another solution.
Now you should be able to read using your second script without the issue that you mention.

Related

How do I make a table for historic data?

How do I make a dataset that shows historic data from snapshots?
I have a csv-file that is updated and overwritten with new snapshot data once a day. I would like to make a python-script that regularly updates the snapshot data with the current snapshots.
One way I thought of was the following:
import pandas as pd
# Read csv-file
snapshot = pd.read_csv('C:/source/snapshot_data.csv')
# Try to read potential trend-data
try:
historic = pd.read_csv('C:/merged/historic_data.csv')
# Merge the two dfs and write back to historic file-path
historic.merge(snapshot).to_csv('C:/merged/historic_data.csv')
except:
snapshot.to_csv('C:/merged/historic_data.csv')
However, I don't like the fact that I use a try-function to get the historic data if the file-path exists or write the snapshot data to the historic path if the path doesn't exist.
Is there anyone that knows a better way of creating a trend dataset?
You can use os module to check if the file exists and mode argument in to_csv function to append data to the file.
The code below will:
Read from snapshot.csv.
Checks if the historic.csv file exists.
If it exists then save the headers else dont save header.
Save the file. If the file already exists, new data will be appended to the file instead of overwriting it.
import os
import pandas as pd
# Read snapshot file
snapshot = pd.read_csv("snapshot.csv")
# Check if historic data file exists
file_path = "historic.csv"
header = not os.path.exists(file_path) # whether header needs to written
# Create or append to the historic data file
snapshot.to_csv(file_path, header=header, index=False, mode="a")
you could easily one line it by utilising the mode parameter in `to_csv'.
pandas.read_csv('snapshot.csv').to_csv('historic.csv', mode='a')
It will create the file if it doesn't already exist, or will append if it does.
What happens if you don't have a new snapshot file? You might want to wrap that in a try... except block. The pythonic way is typically ask for forgiveness instead of permission.
I wouldn't even both with an external library like pandas as the standard library has all you need to 'append' to a file.
with open('snapshot.csv', 'r') as snapshot:
with open('historic.csv', 'a') as historic:
for line in new_file.readline():
historic_file.write(line)

Issue with saving CSV file how can I fix this in python pandas?

I'm having trouble dropping columns and saving the new data frame as a CSV file.
Code:
import pandas as pd
file_path = 'Downloads/editor_events.csv'
df = pd.read_csv(file_path, index_col = False, nrows= 1000)
df.to_csv(file_path, index = False)
df.to_csv(file_path)
The code executes and doesn't give any error. I've looked in my root directory but can't see any new csv file
Check file in folder in which you are running python script. And you are saving with same name, so you can check modified time to confirm it. Also you are not dropping columns as per posted code, you are just taking 1000 rows and saving it.
First: you are saving the same file that you are reading, so you won't see any new csv files. All you are doing right now is rewriting the same file.
But since I can guess you just show it as simple example of what you want to do, I will move to second:
Make sure that your path is correct. Try to write the full path, like 'c:\Users\AwesomeUser\Downloads\editor_events.csv' instead of just 'Downloads\editor_events.csv'.

Pickle encoding utf-8 issue

I'm trying to pickle a pandas dataframe to my local directory so I can work on it in another jupyter notebook. The write appears to go successful at first but when trying to read it in a new jupyter notebook the read is unsuccessful.
When I open the pickle file I appear to have wrote, the file's only contents are:
Error! /Users/.../income.pickle is not UTF-8 encoded
Saving disabled.
See console for more details.
I also checked and the pickle file itself is only a few kilobytes.
Here's my code for writing the pickle:
with open('income.pickle', 'wb', encoding='UTF-8') as to_write:
pickle.dump(new_income_df, to_write)
And here's my code for reading it:
with open('income.pickle', 'rb') as read_file:
income_df = pickle.load(read_file)
Also when I return income_df I get this output:
Series([], dtype: float64)
It's an empty series that I errors on when trying to call most series methods on it.
If anyone knows a fix for this I'm all ears. Thanks in advance!
EDIT:
This is the solution I arrived at:
with open('cleaned_df', 'wb') as to_write:
pickle.dump(df, to_write)
with open('cleaned_df','rb') as read_file:
df = pickle.load(read_file)
Which was much simpler than I expected
Pickling is generally used to store raw data, not to pass a Pandas DataFrame object. When you try to pickle it, it will just store the top level module name, Series, in this case.
1) You can write only the data from the DataFrame to a csv file.
# Write/read csv file using DataFrame object's "to_csv" method.
import pandas as pd
new_income_df.to_csv("mydata.csv")
new_income_df2 = pd.read_csv("mydata.csv")
2) If your data can be saved as a function in a regular python module with a *.py name, you can call it from a Jupyter notebook. You can also reload the function after you have changed the values inside. See autoreload ipynb documentation: https://ipython.org/ipython-doc/3/config/extensions/autoreload.html
# Saved as "mymodule1.py" (from notebook1.ipynb).
import pandas as pd
def funcdata():
new_income_df = pd.DataFrame(data=[100, 101])
return new_income_df
# notebook2.ipynb
%load_ext autoreload
%autoreload 2
import pandas as pd
import mymodule1.py
df2 = mymodule1.funcdata()
print(df2)
# Change data inside fucdata() in mymodule1.py and see if it changes here.
3) You can share data between Jupyter notebooks using %store command.
See src : https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/
And: Share data between IPython Notebooks
# %store example, first Jupyter notebook.
from sklearn import datasets
dataset = datasets.load_iris()
%store dataset
# from a new Jupyter notebook read.
%store -r dataset

using pandas read_excel to read from stdin

Note: I have solve this problem as per below:
I can use to_csv to write to stdout in python / pandas. Something like this works fine:
final_df.to_csv(sys.stdout, index=False)
I would like to read in an actual excel file (not a csv). I want to output CSV, but input xlsx. I have this file
bls_df = pd.read_excel(sys.stdin, sheet_name="MSA_dl", index_col=None)
But that doesn't seem to work. Is it possible to do what I'm trying and, if so, how does one do it?
Notes:
The actual input file is "MSA_M2018_dl.xlsx" which is in the zip file https://www.bls.gov/oes/special.requests/oesm18ma.zip.
I download and extract the datafile like this:
curl -o oesm18ma.zip'https://www.bls.gov/oes/special.requests/oesm18ma.zip'
7z x oesm18ma.zip
I have solved the problem as follows, with script test01.py that reads from stdin and writes to stdout. NOTE the use of sys.stdin.buffer in the read_excel() call.
import sys
import os
import pandas as pd
BLS_DF = pd.read_excel(sys.stdin.buffer, sheet_name="MSA_dl", index_col=None)
BLS_DF.to_csv(sys.stdout, index=False)
I invoke this as:
cat MSA_M2018_dl.xlsx | python3 test01.py
This is a small test program to illustrate the idea while removing complexity. It's not the actual program I'm working on.
Basing on this answer, a possibility would be:
import sys
import pandas as pd
import io
csv = ""
for line in sys.stdin:
csv += line
df = pd.read_csv(io.StringIO(csv))

Closing file after using to_csv()

I am new to python and so far I am loving the ipython notebook for learning. Am I using the to_csv() function to write out a pandas dataframe out to a file. I wanted to open the csv to see how it would look in excel and it would only open in read only mode because it was still in use by another How do I close the file?
import pandas as pd
import numpy as np
import statsmodels.api as sm
import csv
df = pd.DataFrame(file)
path = "File_location"
df.to_csv(path+'filename.csv', mode='wb')
This will write out the file no problem but when I "check" it in excel I get the read only warning. This also brought up a larger question for me. Is there a way to see what files python is currently using/touching?
This is the better way of doing it.
With context manager, you don't have to handle the file resource.
with open("thefile.csv", "w") as f:
df.to_csv(f)
#rpattiso
thank you.
try opening and closing the file yourself:
outfile = open(path+'filename.csv', 'wb')
df.to_csv(outfile)
outfile.close()
The newest pandas to_csv closes the file automatically when it's done.

Categories

Resources