i am having a huge pickle file which needs to be updated in every 3 hrs from a dailydata file(a csv file.)
there are two field named TRX_DATE and TIME_STAMP in each two having values like 24/11/2015 and 24/11/2015 10:19:02 respectively.(also 50 additionl fields are there)
so what i am doing is first reading the huge pickle to a dataframe. Then dropping any values for today's date by comparing with TRX_DATE field.
Then reading that csv file to another dataframe. then appending both dataframe and again creating new pickle.
my scripts looks like
import pandas as pd
import datetime as dt
import pickle
df = pd.read_pickle('hugedata pickle')
Today = dt.datetime.today()
df = df[(df.TRX_DATE > Today)] #delete any entries for today in main pickle
df1 = pd.read_csv(daily data csv file)
df = df.append(df1,ignore_index=True)
df.to_pickle('same huge data pickle')
problem is as follows
1.it is taking huge memory as well as time reading that huge pickle.
2.i need to append df1 to df and only columns from df should only remain and it should exclude if any new column from df1 getting appended. But i am getting new column values having NUN values at so many places.
So need assistance on these things
1.is there way that i will read the small sized csv only and append to pickle file ...(or reading that pickle is mandatory)
2.can it be done like converting the csv to pickle and merge two pickles. by load ,dump method (actually never used that)
3.how to read time from TIME_STAMP field and getting datas between two timestamp (filtering by TIME_STAMP).and upadting that to main pickle.previously i am filtering by TRX_DATE values.
Is there a better way--- please suggest.
HDF5 is made for what you are trying to do.
import tables
import numpy as np
from pandas import HDFStore,DataFrame
df.to_hdf('test.h5',key='test1') # create an hdf5 file
pd.read_hdf('test.h5',key='test1') # read an hdf5 file
df.to_hdf() defaults to append mode.
Related
I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.
please see attached photo
here's the image
I only need to import a specific column with conditions(such as specific data found in that column). And also, I only need to remove unnecessary columns. dropping them takes too much code. What specific code or syntax is applicable?
How to get a column from pandas dataframe is answered in Read specific columns from a csv file with csv module?
To quote:
Pandas is spectacular for dealing with csv files, and the following
code would be all you need to read a csv and save an entire column
into a variable:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
So in your case, you just save the the filtered data frame in a new variable.
This means you do newdf = data.loc[...... and then use the code snippet from above to extract the column you desire, for example newdf.continent
I have a daily process where I read in a historical parquet dataset and then concatenate that with a new file each day. I'm trying to optimize memory by making better use of arrows dictionary arrays. I want to avoid doing round trip to pandas systematically (and without defining columns) to get categoricals.
I'm wondering how to do this in pyarrow.
I currently do:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
historical_table = pq.read_table(historical_pq_path)
new_table = (pa.Table.from_pandas(csv.read_csv(new_file_path)
.to_pandas(strings_to_categorical=True,
split_blocks=True,
self_destruct=True))
)
combined_table = pa.concat_tables([historical_table, new_table])
I process many files and would like to avoid having to maintain a schema for each file where I list the dictionary columns of each column and use that as read options to csv. The convenience of going to pandas with no column specification using strings_to_categorical=True is really nice. From what I've seen there isn't a way to do something like strings_to_dict natively in pyarrow.
Is there clean a way to do this in just pyarrow?
I have a dictionary as follows:
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
I want to save this dictionary in Databricks in order for me not to obtain it every time I want to start working with it. Furthermore, I would like to know how to retrieve it and have it in its original form again.
I have tried doing the following:
from itertools import zip_longest
column_names, data = zip(*my_dict.items())
spark.createDataFrame(zip_longest(*data), column_names).show()
and
column_names, data = zip(*dict_brands.items())
spark.createDataFrame(zip(*data), column_names).show()
However, I get the following error:
zip_longest argument #10342 must support iteration
I also do not know how to reload it or upload it. I tried with a sample dataframe (not the same one), as follows:
df.write.format("tfrecords").mode("overwrite").save('/data/tmp/my_df')
And the error is:
Attribute name "my_column" contains invalid character(s)
among " ,;{}()\n\t=". Please use alias to rename it.
Finally, in order to obtain it, I thought about:
my_df = spark.table("my_df") # Get table
df = my_df.toPandas() # Make pd dataframe
and then make it a dictionary, but maybe there is an easier way than making it a dataframe and then retrieving as dataframe and converting into dictionary back again.
I would also like to know the computational cost for the solutions, since the actual dataset is very large.
Here is my sample code for realizing your needs step by step.
Convert a dictionary to a Pandas dataframe
my_dict = {'a':[12,15.2,52.1],'b':[2.5,2.4,5.2],'c':[1.2,5.3,12]}
import pandas as pd
pdf = pd.DataFrame(my_dict)
Convert a Pandas dataframe to a PySpark dataframe
df = spark.createDataFrame(pdf)
To save a PySpark dataframe to a file using parquet format. Format tfrecords is not supported at here.
df.write.format("parquet").mode("overwrite").save('/data/tmp/my_df')
To load the saved file above as a PySpark dataframe.
df2 = spark.read.format("parquet").load('/data/tmp/my_df')
To convet a PySpark dataframe to a dictionary.
my_dict2 = df2.toPandas().to_dict()
The computational cost of these code above is depended on the memory usage for your actual dataset.
I have large data-frame in a Csv file sample1 from that i have to generate a new Csv file contain only 100 data-frame.i have generate code for it.but i am getting key Error the label[100] is not in the index?
I have just tried as below,Any help would be appreciated
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv")
data_frame1 = data_frame[:100]
data_frame.to_csv("C:/users/raju/sample.csv")`
`
The correct syntax is with iloc:
data_frame.iloc[:100]
A more efficient way to do it is to use nrows argument who purpose is exactly to extract portions of files. This way you avoid wasting resources and time parsing useless rows:
import pandas as pd
data_frame = pd.read_csv("C:/users/raju/sample1.csv", nrows=101) # 100+1 for header
data_frame.to_csv("C:/users/raju/sample.csv")