I am unable to grab the data from a CSV file to be able to put it into a pandas dataframe. I am able to get into the directory and see all of the files there, but I haven't been able to access the document.
Here is my code:
from ftplib import FTP_TLS
import socket
import pandas as pd
server=ftplib.FTP_TLS(‘server’,certfile = r'C:/’)
server.login(user,pw)
# get into respective directory
server.cwd('Banana')
server.prot_p()
# This piece here is needed in order to see what is in my directory, I don't understand why.
# Something about the server not being set up correctly?
server.af = socket.AF_INET6
# check location
server.pwd()
# check files
server.dir()
# Get CSV file data
import io
download_file = io.BytesIO()
download_file.seek(0)
server.retrbinary('RETR ' + str('file.csv'), download_file.write)
download_file.seek(0)
file_to_process = pd.read_csv(download_file, engine='python')
The error that I get is that the last code from import io down to file_to_process just sits there and does nothing. Maybe it times itself out? Unsure the issue.
New error is this:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3376: character maps to <undefined>
Edit: Now I'm trying to save to disk. But this code deletes the contents of the file. Do I not understand how write works?
filematch = ‘Try20.csv’
target_dir = r'\\server’
import os
for filename in server.nlst(filematch):
target_file_name = os.path.join(target_dir,os.path.basename(filename))
with open(target_file_name,'wb') as fhandle:
server.retrbinary('RETR %s' %filename, fhandle.write)
Secondarily, I don't understand how to write the contents of fhandle into a dataframe now.
Related
I'm currently trying to improve processing speed on several large log files, to extract some metrics to then store on a Postgres database. Currently, I'm just trying the first step, which is, simply filtering only relevant lines of the log after having them processed.
This is the sample code that currently works in regular Pandas:
import os
import regex as re
import pandas as pd
fp = "server.log"
data_lines = []
with open(fp, "rt", encoding="utf8") as file:
lines = file.readlines()
# data_lines += [
# line for line in lines
# if "POST" in line
# ]
data_lines += lines
# Processing
df = pd.DataFrame({"src": data_lines})
df.src = df.src.astype("string")
df = df[df.src.str.contains("POST")]
But, when I try to replace import pandas as pd with import modin.pandas as pd, I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xee in position 67: invalid continuation byte
As shown, the text file is being open with the correct encoding, and no error is thrown when using the same code with Pandas. Please, advise in case this is not the intended way to use Modin.
Right now my final output is in excel format. I wanted to compressed my excel file using gzip. Is there a way to do it ?
import pandas as pd
import gzip
import re
def renaming_ad_unit():
with gzip.open('weekly_direct_house.xlsx.gz') as f:
df = pd.read_excel(f)
result = df['Ad unit'].to_list()
for index, a_string in enumerate(result):
modified_string = re.sub(r"\([^()]*\)", "", a_string)
df.at[index,'Ad unit'] = modified_string
return df.to_excel('weekly_direct_house.xlsx',index=False)
Yes, this is possible.
To create a gzip file, you can open the file like this:
with gzip.open('filename.xlsx.gz', 'wb') as f:
...
Unfortunately, when I tried this, I found that I get the error OSError: Negative seek in write mode. This is because the Pandas excel writer moves backwards in the file when writing, and uses multiple passes to write the file. This is not allowed by the gzip module.
To fix this, I created a temporary file, and wrote the excel file there. Then, I read the file back, and write it to the compressed archive.
I wrote a short program to demonstrate this. It reads an excel file from a gzip archive, prints it out, and writes it back to another gzip file.
import pandas as pd
import gzip
import tempfile
def main():
with gzip.open('apportionment-2020-table02.xlsx.gz') as f:
df = pd.read_excel(f)
print(df)
with tempfile.TemporaryFile() as excel_f:
df.to_excel(excel_f, index=False)
with gzip.open('output.xlsx.gz', 'wb') as gzip_f:
excel_f.seek(0)
gzip_f.write(excel_f.read())
if __name__ == '__main__':
main()
Here's the file I'm using to demonstrate this: Link
You could also use io.BytesIO to create file in memory and write excel in this file and next write this file as gzip on disk.
I used link to excel file from Nick ODell answer.
import pandas as pd
import gzip
import io
df = pd.read_excel('https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment-2020-table02.xlsx')
buf = io.BytesIO()
df.to_excel(buf)
buf.seek(0) # move to the beginning of file
with gzip.open('output.xlsx.gz', 'wb') as f:
f.write(buf.read())
Similar to Nick ODell answer.
import pandas as pd
import gzip
import io
df = pd.read_excel('https://www2.census.gov/programs-surveys/decennial/2020/data/apportionment/apportionment-2020-table02.xlsx')
with io.BytesIO() as buf:
df.to_excel(buf)
buf.seek(0) # move to the beginning of file
with gzip.open('output.xlsx.gz', 'wb') as f:
f.write(buf.read())
Tested on Linux
I just converted some action I did with JS (node) to Python (flask webserver) - connecting to secured FTP service and read and parse a CSV files from there because I know it is faster with Python.
I managed to do almost everything, but I'm having some hard time at parsing the CSV file well.
So this is my code:
import urllib.request
import csv
import json
import pysftp
import pandas as pd
cnopts = pysftp.CnOpts()
cnopts.hostkeys = None
name = 'username'
password = 'pass'
host = 'hostURL'
path = ""
with pysftp.Connection(host=host, username=name, password=password, cnopts=cnopts) as sftp:
for filename in sftp.listdir():
if filename.endswith('.csv'):
file = sftp.open(filename)
csvFile = file.read()
I got to the part where I can see the content of the CSV file but I can't parse well (like I need to it be formatted - array of objects).
I tried to parse it with:
with open (csvFile, 'rb') as csv_file:
print(csv_file)
cr = csv.reader(csv_file,delimiter=",") # , is default
rows = list(cr)
and with this:
Past=pd.read_csv(csvFile,encoding='cp1252')
print(Past)
but I got errors like:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 748: invalid start byte
and
OSError: Expected file path name or file-like object, got <class 'bytes'> type
I'm really kinda stuck right now.
(One more question - not important but just wanted to know if I can retrieve a file from ftp based on the latest date - because sometimes there can be more than 1 file in a repository.)
If you don't mind using Pandas (and Numpy)
Pandas' read_csv accepts a file path or a file object (docs). More specifically, it mentions:
By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.
In that sense, using either filename or file from your example should work.
Given this, if using pandas option, try replacing your code with:
df = pd.read_csv(filename, encoding='cp1252') # assuming this is the correct encoding
print(df.head()) # optional, prints top 5 entries
df is now a Pandas DataFrame. To transform a DataFrame into an array of objects, try the to_numpy method (docs):
arr = df.to_numpy() # returns numpy array from DataFrame
im trying to archive the following:
input: xls file
output: csv file
I want to read the xls and do some manipulations (rewrite the headers (original: customernumer, csv needs Customer_Number__c), removing some columns, etc.
Right now I'm already reading the xls and try to write as csv (without any manipulations), but I'm struggling because of the coding.
The original file contains some "special" characters like "/", "\", and most impoartant "ä, ü, ö, ß".
I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 8: ordinal not in range(128)
I have no clue which special characters can be in a file, this changes from time to time.
here is my current sandbox code:
# -*- coding: utf-8 -*-
__author__ = 'adieball'
import xlrd
import csv
from os import sys
import argparse
def main():
parser = argparse.ArgumentParser()
parser.add_argument("inname", type=str,
help="Names of the Input File in single quotes")
parser.add_argument("--outname", type=str,
help="Optional enter the name of the output (csv) file. if nothing is given, "
"we use the name of the input file and add .csv to it")
args = parser.parse_args()
if args.outname is None:
outname = args.inname + ".csv"
else:
outname = args.outname
wb = xlrd.open_workbook(args.inname)
xl_sheet = wb.sheet_by_index(0)
print args.inname
print ('Retrieved worksheet: %s' % xl_sheet.name)
print outname
output = open(outname, 'wb')
wr = csv.writer(output, quoting=csv.QUOTE_ALL)
for rownum in xrange(wb.sheet_by_index(0).nrows):
wr.writerow(wb.sheet_by_index(0).row_values(rownum))
output.close()
anything I can do here to make sure these special characters get written to the csv in the same way as they appeared in the original xls?
thanks
andre
a simple
from os import sys
reload(sys)
sys.setdefaultencoding("utf-8")
did the trick
Andre
You could convert the script to Python 3, and then set the write mode when opening the the output file to "w" instead to write Unicode. Not trying to evangelize, but Python 3 makes this sort of thing easier. If you wanna stay with Python 2 checkout this guide: https://docs.python.org/2/howto/unicode.html
If you want to write a utf-8 encoded file, you have to use the codecs.open. Try this small example:
o1 = open('/tmp/o1.txt', 'wb')
try:
o1.write(u'\u20ac')
except Exception, exc:
print exc
o1.close()
import codecs
o2 = codecs.open('/tmp/o2.txt', 'w', 'utf-8')
o2.write(u'\u20ac')
o2.close()
Why not using UnicodeWriter class as in examples in csv doc https://docs.python.org/2/library/csv.html#examples . I think it should solve your problem.
If not I'll propose you different look to your problem if you have Excel - use win32com, Dispatch excel, and use Excel Object model. You can use build-in excel functions to rename, delete columns etc. and then save it as csv.
E.g.
import win32com.client
excelInstance = win32com.client.gencache.EnsureDispatch('Excel.Application')
workbook = excelInstance.Workbooks.Open(filepath)
worksheet = workbook.Worksheets('WorksheetName')
#### do what you like
worksheet.UsedRange.Find('customernumer').Value2 = 'Customer_Number__c'
####
workbook.SaveAs('Filename.csv', 6) #6 means csv in XlFileFormat enumeration
I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')