How to read picke files using pyarrow - python

I have a bunch of code for reading multiple pickle files using Pandas:
dfs = []
for filename in glob.glob(os.path.join(path,"../data/simulated-data-raw/", "*.pkl")):
with open(filename, 'rb') as f:
temp = pd.read_pickle(f)
dfs.append(temp)
df = pd.DataFrame()
df = df.append(dfs)
how can I read the files using pyarrow? Meanwhile, this way does not work and raises an error.
dfs = []
for filename in glob.glob(os.path.join(path, "../data/simulated-data-raw/", "*.pkl")):
with open(filename, 'rb') as f:
temp = pa.read_serialized(f)
dfs.append(temp)
df = pd.DataFrame()
df = df.append(dfs)

FYI, pyarrow.read_serialized is deprecated and you should just use arrow ipc or python standard pickle module when willing to serialize data.
Anyway I'm not sure what you are trying to achieve, saving objects with Pickle will try to deserialize them with the same exact type they had on save, so even if you don't use pandas to load back the object, you will still get back a pandas DataFrame (as that's what you pickled) and will still need pandas installed to be able to create one.
For example, you can easily get rid of pandas.read_pickle and replace it with just pickle.load, but what you get back will still be a pandas.DataFrame
import pandas as pd
original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
pd.to_pickle(original_df, "./dummy.pkl")
import pickle
loaded_back = pickle.load(open("./dummy.pkl", "rb"))
print(loaded_back)

Related

CSV dataframe doesn't match with dataframe generated from URL

I have a file that I download from NERL API. When I try to compare it with older csv I get a difference using .equals command in padas but both files are 100% same. the only difference is one data frame is generated from CSV and another is directly from API URL.
Below is my code, why is there a difference?
import pandas as pd
NERL_url = "https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?api_key=DEMO_KEY&fuel_type=ELEC&country=all"
outputPath = r"D:\<myPCPath>\nerl.csv"
urlDF = pd.read_csv(NERL_url, low_memory=False)
urlDF.to_csv(outputPath, header=True,index=None, encoding='utf-8-sig')
csv_df = pd.read_csv(outputPath, low_memory=False)
if csv_df.equals(urlDF):
print("Same")
else:
print("Different")
My output is coming as Different. How do I fix this and why is this difference comming?
Problem is precision in read_csv, set to float_precision='round_trip' and then compared NaNs values, need replaced them to same values, like same:
NERL_url = "https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?api_key=DEMO_KEY&fuel_type=ELEC&country=all"
outputPath = r"nerl.csv"
urlDF = pd.read_csv(NERL_url, low_memory=False)
urlDF.to_csv(outputPath, header=True,index=None, encoding='utf-8-sig')
csv_df = pd.read_csv(outputPath, low_memory=False, float_precision='round_trip')
if csv_df.fillna('same').equals(urlDF.fillna('same')):
print("Same")
else:
print("Different")
Same

Avoid pandas converting 0,1 to True and False

I am fairly new to pandas. I am reading list of sql files from a folder and then writing the output to a text file using df.to_csv and then use those files to upload to redshift using COPY command.
One issue I am having is some of the boolean columns(1,0) are converting to True/False which I do not want as Redshift copy is throwing an error.
Here is my code
for filename in glob.glob('*.sql'):
with open(filename, 'r') as f:
df = pd.read_sql_query(f.read(),conn)
df['source_file_name'] = output_file_name
df.to_csv(output_file, sep='\t', index=False, float_format="%.11g")
f.close()
I do not want to give specific column name in the logic to .astype(int) as I am processing around 100 files with different output columns with different datatypes.
Also df *1 did not work as it gave error for datetime column. Is there a solution for this? I am even okay with manipulating at df.to_csv.
I'm not sure if this is the most efficient solution but you can check the type of each column and if it's a boolean type you can encode the labels using sklearn's LabelEncoder
For example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i, type_ in enumerate(df.dtypes):
if type_ == 'bool':
df.iloc[:,i] = le.fit_transform(df.iloc[:,i])
Just add this code snippet within your for loop, right before saving it as csv.
I found this works. Gusto's answer made me realize to play with iloc and came up to this solution.
for filename in glob.glob('*.sql'):
with open(filename, 'r') as f:
df = pd.read_sql_query(f.read(),conn)
df['source_file_name'] = output_file_name
for i, type_ in enumerate(df.dtypes):
if type_ == 'bool':
df = df.convert_dtypes(convert_boolean=False)
df.to_csv(output_file, sep='\t', index=False, float_format="%.11g")
f.close()

Store (df.info) method output in DataFrame or CSV

I have a giant Dataframe(df) that's dimensions are (42,--- x 135). I'm running a df.info on it, but the output is unreadable. I'm wondering if there is any way to dump it in a Dataframe or CSV? I think it has something to do with:
```buf : writable buffer, defaults to sys.stdout
```Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer
```if you need to further process the output."
But when i add a (buf = buffer) the output is just each word in the output then a new line which is very hard to read/work with. My goal is to be-able to better understand what columns are in the dataframe and to be able to sort them by type.
You need to open a file then pass the file handle to df.info:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)
You could try avoiding pandas.dataframe.info() and instead create the information that you need as a pandas.DataFrame:
import pandas as pd
def get_info(df: pd.DataFrame):
info = df.dtypes.to_frame('dtypes')
info['non_null'] = df.count()
info['unique_values'] = df.apply(lambda srs: len(srs.unique()))
info['first_row'] = df.iloc[0]
info['last_row'] = df.iloc[-1]
return info
And write it to csv with df.to_csv('info_output.csv').
The memory usage information may also be useful, so you could do:
df.memory_usage().sum()
import pandas as pd
df = pd.read_csv('/content/house_price.csv')
import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.csv", "w", encoding="utf-8") as f: f.write(s.split(" ----- ")[1].split("dtypes")[0])
di = pd.read_csv('df_info.csv', sep="\s+", header=None)
di
Just to build on mechanical_meat's and Adam Safi's combined solution, the following code will convert the info output into a dataframe with no manual intervention:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)
info_output_df = pd.read_csv('info_output.txt', sep="\s+", header=None, index_col=0, engine='python', skiprows=5, skipfooter=2)
Note that according to the docs, the 'skipfooter' option is only compatible with the python engine.

How to read multiple json files into pandas dataframe?

I'm having a hard time loading multiple line delimited JSON files into a single pandas dataframe. This is the code I'm using:
import os, json
import pandas as pd
import numpy as np
import glob
pd.set_option('display.max_columns', None)
temp = pd.DataFrame()
path_to_json = '/Users/XXX/Desktop/Facebook Data/*'
json_pattern = os.path.join(path_to_json,'*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
data = pd.read_json(file, lines=True)
temp.append(data, ignore_index = True)
It looks like all the files are loading when I look through file_list, but cannot figure out how to get each file into a dataframe. There are about 50 files with a couple lines in each file.
Change the last line to:
temp = temp.append(data, ignore_index = True)
The reason we have to do this is because the append doesn't happen in place. The append method does not modify the data frame. It just returns a new data frame with the result of the append operation.
Edit:
Since writing this answer I have learned that you should never use DataFrame.append inside a loop because it leads to quadratic copying (see this answer).
What you should do instead is first create a list of data frames and then use pd.concat to concatenate them all in a single operation. Like this:
dfs = [] # an empty list to store the data frames
for file in file_list:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
temp = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
This alternative should be considerably faster.
If you need to flatten the JSON, Juan Estevez’s approach won’t work as is. Here is an alternative :
import pandas as pd
dfs = []
for file in file_list:
with open(file) as f:
json_data = pd.json_normalize(json.loads(f.read()))
dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
Or if your JSON are line-delimited (not tested) :
import pandas as pd
dfs = []
for file in file_list:
with open(file) as f:
for line in f.readlines():
json_data = pd.json_normalize(json.loads(line))
dfs.append(json_data)
df = pd.concat(dfs, sort=False) # or sort=True depending on your needs
from pathlib import Path
import pandas as pd
paths = Path("/home/data").glob("*.json")
df = pd.DataFrame([pd.read_json(p, typ="series") for p in paths])
I combined Juan Estevez's answer with glob. Thanks a lot.
import pandas as pd
import glob
def readFiles(path):
files = glob.glob(path)
dfs = [] # an empty list to store the data frames
for file in files:
data = pd.read_json(file, lines=True) # read data frame from json file
dfs.append(data) # append the data frame to the list
df = pd.concat(dfs, ignore_index=True) # concatenate all the data frames in the list.
return df
Maybe you should state, if the json files are created themselves with pandas pd.to_json() or in another way.
I used data which was not created with pd.to_json() and I think it is not possible to use pd.read_json() in my case. Instead, I programmed a customized for-each loop approach to write everything to the DataFrames

how to handle error in reading file containing multiple languages

data trying to read
I have tried various ways still getting errors of the different type.
import codecs
f = codecs.open('sampledata.xlsx', encoding='utf-8')
for line in f:
print (repr(line))
the other way I tried is
f = open(fname, encoding="ascii", errors="surrogateescape")
still no luck.any help?
Newer versions of Pandas supports xlxs.
file_name = # path to file + file name
sheet = # sheet name or sheet number or list of sheet numbers and names
import pandas as pd
df = pd.read_excel(io=file_name, sheet_name=sheet)
print(df.head(5)) # print first 5 rows of the dataframe
Works great, especially if you're working with many sheets.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

Categories

Resources