using pandas read_excel to read from stdin

using pandas read_excel to read from stdin - python

Note: I have solve this problem as per below:
I can use to_csv to write to stdout in python / pandas. Something like this works fine:
final_df.to_csv(sys.stdout, index=False)
I would like to read in an actual excel file (not a csv). I want to output CSV, but input xlsx. I have this file
bls_df = pd.read_excel(sys.stdin, sheet_name="MSA_dl", index_col=None)
But that doesn't seem to work. Is it possible to do what I'm trying and, if so, how does one do it?
Notes:
The actual input file is "MSA_M2018_dl.xlsx" which is in the zip file https://www.bls.gov/oes/special.requests/oesm18ma.zip.
I download and extract the datafile like this:
curl -o oesm18ma.zip'https://www.bls.gov/oes/special.requests/oesm18ma.zip'
7z x oesm18ma.zip
I have solved the problem as follows, with script test01.py that reads from stdin and writes to stdout. NOTE the use of sys.stdin.buffer in the read_excel() call.
import sys
import os
import pandas as pd
BLS_DF = pd.read_excel(sys.stdin.buffer, sheet_name="MSA_dl", index_col=None)
BLS_DF.to_csv(sys.stdout, index=False)
I invoke this as:
cat MSA_M2018_dl.xlsx | python3 test01.py
This is a small test program to illustrate the idea while removing complexity. It's not the actual program I'm working on.

Basing on this answer, a possibility would be:
import sys
import pandas as pd
import io
csv = ""
for line in sys.stdin:
csv += line
df = pd.read_csv(io.StringIO(csv))

Related

Pandas can't find csv file

I'm trying to create a dataframe using a csv file for an assignment however, every time I would run my program it would show me an error that the file couldn't be found. My code is the following:
import pandas as pd
df = pd.read_csv('thefile')
The code returns an error no matter where I place my file. When I checked for the path using the code below:
import os
print(os.getcwd())
It showed me that the path is correct and it is looking inside the folder where my csv file is located but it still returns me the same error.

When reading in files, the 'thefile' must be followed by a .csv extension in the reference, as follows; 'thefile.csv'.

You need to add .csv behind the thefile,
without it, it doesn't know which file to look for it could be thefile.txt, thefile.conf, thefile.csv, thefile.....
So your code should look like this.
import pandas as pd
df = pd.read_csv('thefile.csv')

Can't read log file but can read after copy paste to notepad

Task:
My task is to compare my strings in first column inside of sha1_vsdt.csv and strings in trendx.log: When it matched, it should get the description inside of the log file then put it in the third column of csv, otherwise put undetected
But the trendx.log can't be read, what I did is - I copied the contents of trendx.log then paste it in a notepad then after I saved it, it is readable.
Here is the the readable log file - trend2.log. I think the unicode format is the problem.
How can I read this log file guys? is there anyway to convert this? I already tried to encode this using utf-16le but I only prints 3 lines
Here is my code
import numpy as np
import pandas as pd
import csv
import io
import shutil
pd.set_option('display.max_rows', 1000)
logtext = "trendx.log"
#Log data into dataframe using genfromtxt
logdata = np.genfromtxt(logtext,invalid_raise = False,dtype=str, comments=None,usecols=np.arange(16))
logframe = pd.DataFrame(logdata)
#print (logframe.head())
#Dataframe trimmed to use only SHA1, PRG and IP
df2=(logframe[[10,11]]).rename(columns={10:'SHA-1', 11: 'DESC'})
#print (df2.head())
#sha1_vsdt data into dataframe using read_csv
df1=pd.read_csv("sha1_vsdt.csv",delimiter=",",error_bad_lines=False,engine = 'python',quoting=3)
#Using merge to compare the two CSV
df = pd.merge(df1, df2, on='SHA-1', how='left').fillna('undetected')
df1['DESC'] = df['DESC'].values
df1.to_csv("sha1_vsdt.csv",index=False)
Output in csv using: trendx.log all is undetected from row 1 - 584
Correct output in csv using: trend2.log

This file is encoded as UTF-16-LE. Pass in the encoding flag when you read the file, like this:
logdata = np.genfromtxt(logtext, invalid_raise=False,dtype=str, comments=None,usecols=np.arange(16), encoding='utf_16-le')

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:

In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.

Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")

I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Streaming data for pandas df

I'm attempting to simulate the use of pandas to access a constantly changing file.
I have one file reading a csv file, adding a line to it then sleeping for a random time to simulate bulk input.
import pandas as pd
from time import sleep
import random
df2 = pd.DataFrame(data = [['test','trial']], index=None)
while True:
df = pd.read_csv('data.csv', header=None)
df.append(df2)
df.to_csv('data.csv', index=False)
sleep(random.uniform(0.025,0.3))
The second file is checking for change in data by outputting the shape of the dataframe:
import pandas as pd
while True:
df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
print(df.shape)
The problem with that is while I'm getting the correct shape of the DF, there are certain times where it's outputting (0x2).
i.e.:
...
(10x2)
(10x2)
...
(10x2)
(0x2)
(11x2)
(11x2)
...
This does occur at some but not between each change in shape (the file adding to dataframe).
Knowing this happens when the first script is opening the file to add data, and the second script is unable to access it, hence (0x2), will this occur any data loss?
I cannot directly access the stream, only the output file. Or are there any other possible solutions?
Edit
The purpose of this is to load the new data only (I have a code that does that) and do analysis "on the fly". Some of the analysis will include output/sec, graphing (similar to stream plot), and few other numerical calculations.
The biggest issue is that I have access to the csv file only, and I need to be able to analyze the data as it comes without loss or delay.

One of the scripts is reading the file while the other is trying to write to the file. Both scripts cannot access the file at the same time. Like Padraic Cunningham says in the comments you can implement a lock file to solve this problem.
There is a python package that will do just that called lockfile with documentation here.
Here is your first script with the lockfile package implemented:
import pandas as pd
from time import sleep
import random
from lockfile import FileLock
df2 = pd.DataFrame(data = [['test','trial']], index=None)
lock = FileLock('data.lock')
while True:
with lock:
df = pd.read_csv('data.csv', header=None)
df.append(df2)
df.to_csv('data.csv', index=False)
sleep(random.uniform(0.025,0.3))
Here is you second script with the lockfile package implemented:
import pandas as pd
from time import sleep
from lockfile import FileLock
lock = FileLock('data.lock')
while True:
with lock:
df = pd.read_csv('data.csv', header=None, names=['Name','DATE'])
print(df.shape)
sleep(0.100)
I added a wait of 100ms so that I could slow down the output to the console.
These scripts will create a file called "data.lock" before accessing the "data.csv" file and delete the file "data.lock" after accessing the "data.csv" file. In either script, if the "data.lock" exists, the script will wait until the "data.lock" file no longer exists.

Your simulation script reads and writes to the data.csv file. You can read and write concurrently if one script opens the file as write only and the other opens the file as read only.
With this in mind, I changed your simulation script for writing the file to the following:
from time import sleep
import random
while(True):
with open("data.csv", 'a') as fp:
fp.write(','.join(['0','1']))
fp.write('\n')
sleep(0.010)
In python, opening a file with 'a' means append as write only. Using 'a+' will append with read and write access. You must make sure that the code writing the file will only open the file as write-only, and your script that is reading the file must never attempt to write to the file. Otherwise, you will need to implement another solution.
Now you should be able to read using your second script without the issue that you mention.

Closing file after using to_csv()

I am new to python and so far I am loving the ipython notebook for learning. Am I using the to_csv() function to write out a pandas dataframe out to a file. I wanted to open the csv to see how it would look in excel and it would only open in read only mode because it was still in use by another How do I close the file?
import pandas as pd
import numpy as np
import statsmodels.api as sm
import csv
df = pd.DataFrame(file)
path = "File_location"
df.to_csv(path+'filename.csv', mode='wb')
This will write out the file no problem but when I "check" it in excel I get the read only warning. This also brought up a larger question for me. Is there a way to see what files python is currently using/touching?

This is the better way of doing it.
With context manager, you don't have to handle the file resource.
with open("thefile.csv", "w") as f:
df.to_csv(f)

#rpattiso
thank you.
try opening and closing the file yourself:
outfile = open(path+'filename.csv', 'wb')
df.to_csv(outfile)
outfile.close()

The newest pandas to_csv closes the file automatically when it's done.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

using pandas read_excel to read from stdin - python

Basing on this answer, a possibility would be: import sys import pandas as pd import io csv = "" for line in sys.stdin: csv += line df = pd.read_csv(io.StringIO(csv))

Related

Pandas can't find csv file

Can't read log file but can read after copy paste to notepad

CParserError: Error tokenizing data

Streaming data for pandas df

Closing file after using to_csv()

Categories

Resources