I have an custom Python library function that takes a csv flat file as input that is read using data = open('file.csv', 'r').read(). But currently I've the data processed in Python as a Pandas DataFrame. How can I pass this DataFrame as a flat file object that my custom library function accepts?
As a work around I'm writing the DataFrame to disk and reading it back using the read function which is causing adding a second or two for each iteration. I want to avoid using this process.
In the to_csv method of pandas DataFrame, if you don't provide any argument you get the CSV output returned as a string. So you can use the to_csv method on your DataFrame, that produces the same output as you are getting by storing and reading the DataFrame again.
Related
Is there a method I can use to output the inferred schema on a large CSV using pandas?
In addition, any way to have it tell me with that type if it is nullable/blank based off the CSV?
File is about 500k rows with 250 columns.
With my new job, I'm constantly being handed CSV files with zero format documentation.
Is it necessary to load the whole csv file? At least you could use the read_csv function if you know the separator or doing a cat of the file to know the separator. Then use the .info():
df = pd.read_csv(path_to_file,...)
df.info()
I made a program to save two arrays into csv file using pandas data frame in python so that I could record all the data.
I tried the code listed below.
U_8=[]
start=[]
U_8.append(d)
start.append(str(time.time()))
x=pd.DataFrame({'1st':U_8, 'Time Stamp':start})
export_csv = x.to_csv (r'/home/pi/Frames/q8.csv', index = None,
header=True)
Every time the program is closed and run again , it overwrites the previous values stored in the csv file. I expected it to save the new values along with the previous ones. How could I store the past and present value in this csv file.
In order to append to a csv instead of writing, pass mode='a' to df.to_csv. The default mode is 'w' which overwrites any existing csv with the same filename. Plain appending, however, appends the column headers as well which will appear as values in your csv. To mitigate that, pass header=False in your subsequent runs: df.to_csv(data, mode='a', header=False).
Another way is to read your original DataFrame and use pd.concat to join it with your new results. The workflow will then be as such:
Read the original csv into a DataFrame.
Create a DataFrame with new results.
Concatenate the two DataFrames.
Write the resulting DataFrame to csv.
Assigning the return value of .to_csv to a variable is not necessary. As in df.to_csv(data) will still export to csv.
I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful
I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')
I'm trying to copy these files over from S3 to Redshift, and they are all in the format of Row(column1=value, column2=value,...), which obviously causes issues. How do I get a dataframe to write out in normal csv?
I'm calling it like this:
# final_data.rdd.saveAsTextFile(
# path=r's3n://inst-analytics-staging-us-standard/spark/output',
# compressionCodecClass='org.apache.hadoop.io.compress.GzipCodec'
# )
I've also tried writing out with the spark-csv module, and it seems like it ignores any of the computations I did, and just formats the original parquet file as a csv and dumps it out.
I'm calling that like this:
df.write.format('com.databricks.spark.csv').save('results')
The spark-csv approach is a good one and should be working. It seems by looking at your code that you are calling df.write on the original DataFrame df and that's why it's ignoring your transformations. To work properly, maybe you should do:
final_data = # Do your logic on df and return a new DataFrame
final_data.write.format('com.databricks.spark.csv').save('results')
I am using pandas to read a csv file and convert it into a numpy array. Earlier I was loading the whole file and was getting memory error. So I went through this link and tried to read the file in chunks.
But now I am getting a different error which say:
AssertionError: first argument must be a list-like of pandas objects, you passed an object of type "TextFileReader"
This is the code I am using:
>>> X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
>>> X = pd.concat(X_chunks, ignore_index=True)
API reference for read_csv tells that it returns either a DataFrame or a TextParser. The problem is that concat function will work fine if X_chunks is DataFrame, but its type is TextParser here.
is there any way in which I can force the return type for read_csv or any work around to load the whole file as a numpy array?
Since iterator=False is the default, and chunksize forces a TextFileReader object, may I suggest:
X_chunks = pd.read_csv('train_v2.csv')
But you don't want to materialize the list?
Final suggestion:
X_chunks = pd.read_csv('train_v2.csv', iterator=True, chunksize=1000)
for chunk in x_chunks:
analyze(chunk)
Where analyze is whatever process you've broken up to analyze the chunks piece by piece, since you apparently can't load the entire dataset into memory.
You can't use concat in the way you're trying to, the reason is that it demands the data be fully materialized, which makes sense, you can't concatenate something that isn't there yet.