Avoid pandas converting 0,1 to True and False - python

I am fairly new to pandas. I am reading list of sql files from a folder and then writing the output to a text file using df.to_csv and then use those files to upload to redshift using COPY command.
One issue I am having is some of the boolean columns(1,0) are converting to True/False which I do not want as Redshift copy is throwing an error.
Here is my code
for filename in glob.glob('*.sql'):
with open(filename, 'r') as f:
df = pd.read_sql_query(f.read(),conn)
df['source_file_name'] = output_file_name
df.to_csv(output_file, sep='\t', index=False, float_format="%.11g")
f.close()
I do not want to give specific column name in the logic to .astype(int) as I am processing around 100 files with different output columns with different datatypes.
Also df *1 did not work as it gave error for datetime column. Is there a solution for this? I am even okay with manipulating at df.to_csv.

I'm not sure if this is the most efficient solution but you can check the type of each column and if it's a boolean type you can encode the labels using sklearn's LabelEncoder
For example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i, type_ in enumerate(df.dtypes):
if type_ == 'bool':
df.iloc[:,i] = le.fit_transform(df.iloc[:,i])
Just add this code snippet within your for loop, right before saving it as csv.

I found this works. Gusto's answer made me realize to play with iloc and came up to this solution.
for filename in glob.glob('*.sql'):
with open(filename, 'r') as f:
df = pd.read_sql_query(f.read(),conn)
df['source_file_name'] = output_file_name
for i, type_ in enumerate(df.dtypes):
if type_ == 'bool':
df = df.convert_dtypes(convert_boolean=False)
df.to_csv(output_file, sep='\t', index=False, float_format="%.11g")
f.close()

Related

How to create a new csv from a csv that separated cell

I created a function for convert the csv.
The main topic is: get a csv file like:
,features,corr_dropped,var_dropped,uv_dropped
0,AghEnt,False,False,False
and I want to conver it to an another csv file:
features
corr_dropped
var_dropped
uv_dropped
0
AghEnt
False
False
False
I created a function for that but it is not working. The output is same as the input file.
function
def convert_file():
input_file = "../input.csv"
output_file = os.path.splitext(input_file)[0] + "_converted.csv"
df = pd.read_table(input_file, sep=',')
df.to_csv(output_file, index=False, header=True, sep=',')
you could use
df = pd.read_csv(input_file)
this works with your data. There is not much difference though. The only thing that changes is that the empty space before the first delimiter now has Unnamed: 0 in there.
Is that what you wanted? (Still not entirely sure what you are trying to achieve, as you are importing a csv and exporting the same data as a csv without really doing anything with it. the output example you showed is just a formated version of your initial data. but formating is not something csv can do.)

CSV dataframe doesn't match with dataframe generated from URL

I have a file that I download from NERL API. When I try to compare it with older csv I get a difference using .equals command in padas but both files are 100% same. the only difference is one data frame is generated from CSV and another is directly from API URL.
Below is my code, why is there a difference?
import pandas as pd
NERL_url = "https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?api_key=DEMO_KEY&fuel_type=ELEC&country=all"
outputPath = r"D:\<myPCPath>\nerl.csv"
urlDF = pd.read_csv(NERL_url, low_memory=False)
urlDF.to_csv(outputPath, header=True,index=None, encoding='utf-8-sig')
csv_df = pd.read_csv(outputPath, low_memory=False)
if csv_df.equals(urlDF):
print("Same")
else:
print("Different")
My output is coming as Different. How do I fix this and why is this difference comming?
Problem is precision in read_csv, set to float_precision='round_trip' and then compared NaNs values, need replaced them to same values, like same:
NERL_url = "https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?api_key=DEMO_KEY&fuel_type=ELEC&country=all"
outputPath = r"nerl.csv"
urlDF = pd.read_csv(NERL_url, low_memory=False)
urlDF.to_csv(outputPath, header=True,index=None, encoding='utf-8-sig')
csv_df = pd.read_csv(outputPath, low_memory=False, float_precision='round_trip')
if csv_df.fillna('same').equals(urlDF.fillna('same')):
print("Same")
else:
print("Different")
Same

How to read picke files using pyarrow

I have a bunch of code for reading multiple pickle files using Pandas:
dfs = []
for filename in glob.glob(os.path.join(path,"../data/simulated-data-raw/", "*.pkl")):
with open(filename, 'rb') as f:
temp = pd.read_pickle(f)
dfs.append(temp)
df = pd.DataFrame()
df = df.append(dfs)
how can I read the files using pyarrow? Meanwhile, this way does not work and raises an error.
dfs = []
for filename in glob.glob(os.path.join(path, "../data/simulated-data-raw/", "*.pkl")):
with open(filename, 'rb') as f:
temp = pa.read_serialized(f)
dfs.append(temp)
df = pd.DataFrame()
df = df.append(dfs)
FYI, pyarrow.read_serialized is deprecated and you should just use arrow ipc or python standard pickle module when willing to serialize data.
Anyway I'm not sure what you are trying to achieve, saving objects with Pickle will try to deserialize them with the same exact type they had on save, so even if you don't use pandas to load back the object, you will still get back a pandas DataFrame (as that's what you pickled) and will still need pandas installed to be able to create one.
For example, you can easily get rid of pandas.read_pickle and replace it with just pickle.load, but what you get back will still be a pandas.DataFrame
import pandas as pd
original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
pd.to_pickle(original_df, "./dummy.pkl")
import pickle
loaded_back = pickle.load(open("./dummy.pkl", "rb"))
print(loaded_back)

Store (df.info) method output in DataFrame or CSV

I have a giant Dataframe(df) that's dimensions are (42,--- x 135). I'm running a df.info on it, but the output is unreadable. I'm wondering if there is any way to dump it in a Dataframe or CSV? I think it has something to do with:
```buf : writable buffer, defaults to sys.stdout
```Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer
```if you need to further process the output."
But when i add a (buf = buffer) the output is just each word in the output then a new line which is very hard to read/work with. My goal is to be-able to better understand what columns are in the dataframe and to be able to sort them by type.
You need to open a file then pass the file handle to df.info:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)
You could try avoiding pandas.dataframe.info() and instead create the information that you need as a pandas.DataFrame:
import pandas as pd
def get_info(df: pd.DataFrame):
info = df.dtypes.to_frame('dtypes')
info['non_null'] = df.count()
info['unique_values'] = df.apply(lambda srs: len(srs.unique()))
info['first_row'] = df.iloc[0]
info['last_row'] = df.iloc[-1]
return info
And write it to csv with df.to_csv('info_output.csv').
The memory usage information may also be useful, so you could do:
df.memory_usage().sum()
import pandas as pd
df = pd.read_csv('/content/house_price.csv')
import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.csv", "w", encoding="utf-8") as f: f.write(s.split(" ----- ")[1].split("dtypes")[0])
di = pd.read_csv('df_info.csv', sep="\s+", header=None)
di
Just to build on mechanical_meat's and Adam Safi's combined solution, the following code will convert the info output into a dataframe with no manual intervention:
with open('info_output.txt','w') as file_out:
df.info(buf=file_out)
info_output_df = pd.read_csv('info_output.txt', sep="\s+", header=None, index_col=0, engine='python', skiprows=5, skipfooter=2)
Note that according to the docs, the 'skipfooter' option is only compatible with the python engine.

How to fix data getting loaded into a single column of a pandas dataframe?

I have the following code:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset2 = pd.read_csv(file_path, header=None, dtype=str)
v = dataset2.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
dataset1 = pd.DataFrame(f)
df = dataset1.astype('str')
dataset = df.values.tolist()
print (type (dataset))
print (type (dataset[1]))
print (type (dataset[1][1]))
The target is to transfer all the dataset into values from 1..n for each different distinct value in dataset and afterwards to transform it into list of lists where each element is string.
The above code works great. However when I change the dataset into:
file_path ='https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
I get error. How can it work for this dataset as well?
You need to understand the data you're working with. A quick print call would've helped you realise the delimiters with this one are different.
Furthermore, it appears to be numeric data; you don't need an str conversion anymore.
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
t = pd.read_csv(file_path, header=None, delim_whitespace=True)
v = t.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
df = pd.DataFrame(f)
If you want pandas to guess the delimiter format, you might employ the use of sep=None:
t = pd.read_csv(file_path, header=None, sep=None)
I don't recommend this because it is very easy for pandas to make mistakes when loading your data with an inferred delimiter.

Categories

Resources