I'm trying to modify a csv that is uploaded into my flask application. I have the logic that works just fine when I don't upload it through flask.
import pandas as pd
import StringIO
with open('example.csv') as f:
data = f.read()
data = data.replace(',"', ",'")
data = data.replace('",', "',")
df = pd.read_csv(StringIO.StringIO(data), header=None, sep=',', quotechar="'")
print df.head(10)
I upload it to flask and access it using
f = request.files['data_file']
When I run it through the code above, replacing open('example.csv') with open(f), I get the following error
coercing to Unicode: need string or buffer, FileStorage found
I have figured out that the problem is the file type here. I can't use open on my file because open is looking for a file name and when the file is uploaded to flask it is the instance of the file that is being passed to the open command. However, I don't know how to make this work. I've tried skipping the open command and just using data = f.read() but that doesn't work. Any suggestions?
Thanks
FileStorage is a file-like wrapper around the incoming data. You can pass it directly to read_csv.
pd.read_csv(request.files['data_file'])
You most likely should not be performing those replace calls on the data, as the CSV module should handle that and the naive replacement can corrupt data in quoted columns. However, if you still need to, you can read the data out just like you were before.
data = request.files['data_file'].read()
If your data has a mix of quoting styles, you should fix the source of your data.
Answering my own question in case someone else needs this.
FileStorage objects have a .stream attribute which will be an io.BytesIO
f = request.files['data_file']
df = pandas.read_csv(f.stream)
Related
I am using a resume parsing python library that accepts a pdf file and returns JSON. The code is as simple as below:
parsed_data = ResumeParser("file.pdf").get_extracted_data()
I wanted to expose an API around this, so in API the pdf data is sent as a base64 string. So, I first write the data to a file and then run the above code. My current code looks as below:
def parse(b64data):
bytes = b64decode(b64data, validate=True)
f = open('tmp_file.pdf', 'wb')
f.write(bytes)
f.close()
parsed_data = ResumeParser("tmp_file.pdf").get_extracted_data()
return parsed_data
Is there any better approach for me to avoid writing the data to a file? I am exposing this API as a serverless function and I think I can save time by not doing write.
References:
https://github.com/OmkarPathak/pyresparser (Library Used)
The library that you are using appears to accept a BytesIO object as an alternative to passing it a string that contains a filename. However, it also appears to expect that this BytesIO object has a name attribute from which it extracts an extension so it can determine the filetype. So, we will add a bogus name attribute that contains the string .pdf to our BytesIO object.
So, we should be able to use something like this:
import io, base64
def parse(b64data):
bytes = base64.b64decode(b64data, validate=True)
bytesio = io.BytesIO(bytes)
bytesio.name = '.pdf'
parsed_data = ResumeParser(bytesio).get_extracted_data()
return parsed_data
I have old code below that gzips a file and stores it as json into S3, using the IO library ( so a file does not save locally). I am having trouble converting this same approach (ie using IO library for a buffer) to create a .txt file and push into S3 and later retrieve. I know how to create txt files and push into s3 is as well, but not how to use IO in the process.
The value I would want to be stored in the text value would just be a variable with a string value of 'test'
Goal: Use IO library and save string variable as a text file into S3 and be able to pull it down again.
x = 'test'
inmemory = io.BytesIO()
with gzip.GzipFile(fileobj=inmemory, mode='wb') as fh:
with io.TextIOWrapper(fh, encoding='utf-8',errors='replace') as wrapper:
wrapper.write(json.dumps(x, ensure_ascii=False,indent=2))
inmemory.seek(0)
s3_resource.Object(s3bucket, s3path + '.json.gz').upload_fileobj(inmemory)
inmemory.close()
Also any documentation with that anyone likes with specific respect to the IO library and writing to files, because the actual documentation ( f = io.StringIO("some initial text data")
ect..https://docs.python.org/3/library/io.html ) It just did not give me enough at my current level.
Duplicate.
For sake of brevity, it turns out there's a way to override the putObject call so that it takes in a string of text instead of a file.
The original post is answered in Java, but this additional thread should be sufficient for a Python-specific answer.
I am breaking my head over this right now. I am new to this parquet files, and I am running into a LOT of issues with it.
I am thrown an error that reads OSError: Passed non-file path: \datasets\proj\train\train.parquet each time I try to create a df from it.
I've tried this:
pq.read_pandas(r'E:\datasets\proj\train\train.parquet').to_pandas()
AND
od = pd.read_parquet(r'E:\datasets\proj\train\train.parquet', engine='pyarrow')
I also changed the drive letter of the drive the dataset resides, and it's the SAME THING!
It's the same with all engines.
PLEASE HELP!
This might be a problem with Arrow's file path handling. You could instead pass in an already opened file:
import pandas as pd
with open(r'E:\datasets\proj\train\train.parquet', 'rb') as f:
df = pd.read_parquet(f, engine='pyarrow')
Try using fastparquet as engine, worked for me.
engine = "fastparquet"
I collected some tweets from the twitter API and stored it to mongodb, I tried exporting the data to a JSON file and didn't have any issues there, until I tried to make a python script to read the JSON and convert it to a csv. I get this traceback error with my code:
json.decoder.JSONDecodeError: Extra data: line 367 column 1 (char 9745)
So, after digging around the internet I was pointed to check the actual JSON data in an online validator, which I did. This gave me the error of:
Multiple JSON root elements
from the site https://jsonformatter.curiousconcept.com/
Here are pictures of the 1st/2nd object beginning/end of the file:
or a link to the data here
Now, the problem is, I haven't found anything on the internet of how to handle that error. I'm not sure if it's an error with the data I've collected, exported, or if I just don't know how to work with it.
My end game with these tweets is to make a network graph. I was looking at either Networkx or Gephi, which is why I'd like to get a csv file.
Robert Moskal is right. If you can address the issue at source and use --jsonArray flag when you use mongoexport then it will make the problem easier i guess. If you can't address it at source then read the below points.
The code below will extract you the individual json objects from the given file and convert them to python dictionaries.
You can then apply your CSV logic to each individual dictionary.
If you are using csv module then I would say use unicodecsv module as it would handle the unicode data in your json objects.
import json
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
json_block = []
print json_dict
If you want to convert it to CSV using pandas you can use the below code:
import json, pandas as pd
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
dictlist=[]
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
dictlist.append(json_dict)
json_block = []
df = pd.DataFrame(jsonlist)
df.to_csv('out.csv',encoding='utf-8')
If you want to flatten out the json object you can use pandas.io.json.json_normalize() method.
Elaborating on #MYGz suggestion to use --jsonArray
Your post doesn't show how you exported the data from mongo. If you use the following via the terminal, you will get valid json from mongodb:
mongoexport --collection=somecollection --db=somedb --jsonArray --out=validfile.json
Replace somecollection, somedb and validfile.json with your target collection, target database, and desired output filename respectively.
The following: mongoexport --collection=somecollection --db=somedb --out=validfile.json...will NOT give you the results you are looking for because:
By default mongoexport writes data using one JSON document for every
MongoDB document. Ref
A bit late reply, and I am not sure it was available the time this question was posted. Anyway, now there is a simple way to import the mongoexport json data as follows:
df = pd.read_json(filename, lines=True)
mongoexport provides each line as a json objects itself, instead of the whole file as json.
I am trying to create a Python script that can take a JSON object and insert it into a headless Couchbase server. I have been able to successfully connect to the server and insert some data. I'd like to be able to specify the path of a JSON object and upsert that.
So far I have this:
from couchbase.bucket import Bucket
from couchbase.exceptions import CouchbaseError
import json
cb = Bucket('couchbase://XXX.XXX.XXX?password=XXXX')
print cb.server_nodes
#tempJson = json.loads(open("myData.json","r"))
try:
result = cb.upsert('healthRec', {'record': 'bob'})
# result = cb.upsert('healthRec', {'record': tempJson})
except CouchbaseError as e:
print "Couldn't upsert", e
raise
print(cb.get('healthRec').value)
I know that the first commented out line that loads the json is incorrect because it is expecting a string not an actual json... Can anyone help?
Thanks!
Figured it out:
with open('myData.json', 'r') as f:
data = json.load(f)
try:
result = cb.upsert('healthRec', {'record': data})
I am looking into using cbdocloader, but this was my first step getting this to work. Thanks!
I know that you've found a solution that works for you in this instance but I thought I'd correct the issue that you experienced in your initial code snippet.
json.loads() takes a string as an input and decodes the json string into a dictionary (or whatever custom object you use based on the object_hook), which is why you were seeing the issue as you are passing it a file handle.
There is actually a method json.load() which works as expected, as you have used in your eventual answer.
You would have been able to use it as follows (if you wanted something slightly less verbose than the with statement):
tempJson = json.load(open("myData.json","r"))
As Kirk mentioned though if you have a large number of json documents to insert then it might be worth taking a look at cbdocloader as it will handle all of this boilerplate code for you (with appropriate error handling and other functionality).
This readme covers the uses of cbdocloader and how to format your data correctly to allow it to load your documents into Couchbase Server.