I have a file that I download from NERL API. When I try to compare it with older csv I get a difference using .equals command in padas but both files are 100% same. the only difference is one data frame is generated from CSV and another is directly from API URL.
Below is my code, why is there a difference?
import pandas as pd
NERL_url = "https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?api_key=DEMO_KEY&fuel_type=ELEC&country=all"
outputPath = r"D:\<myPCPath>\nerl.csv"
urlDF = pd.read_csv(NERL_url, low_memory=False)
urlDF.to_csv(outputPath, header=True,index=None, encoding='utf-8-sig')
csv_df = pd.read_csv(outputPath, low_memory=False)
if csv_df.equals(urlDF):
print("Same")
else:
print("Different")
My output is coming as Different. How do I fix this and why is this difference comming?
Problem is precision in read_csv, set to float_precision='round_trip' and then compared NaNs values, need replaced them to same values, like same:
NERL_url = "https://developer.nrel.gov/api/alt-fuel-stations/v1.csv?api_key=DEMO_KEY&fuel_type=ELEC&country=all"
outputPath = r"nerl.csv"
urlDF = pd.read_csv(NERL_url, low_memory=False)
urlDF.to_csv(outputPath, header=True,index=None, encoding='utf-8-sig')
csv_df = pd.read_csv(outputPath, low_memory=False, float_precision='round_trip')
if csv_df.fillna('same').equals(urlDF.fillna('same')):
print("Same")
else:
print("Different")
Same
Related
I have been trying to build a script in python that pulls the info from a set of csv files. The format of the csv is as follows and has no header: ['Day','Hour','Seconds','Microsecods','x_accel','y_accel']. Instead of inputting the values in the correspondent columns pandas is pulling the values and making them a string like this:" 9,40,19,65664,-0.527,-0.333" in the first column. I tried using dtype and sep=',' but did not work. I don't understand why it does not fit them properly in the right columns.
This is my script:
import numpy as np
import os
import pandas as pd
os.chdir('C:/Users/pc/Desktop/41x/Learning_set/Bearing1_1')
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
columns = ['Day','Hour','Seconds','Microsecods','x_accel','y_accel']
for f in files:
data = pd.read_csv(f, 'Sheet1', header = None,engine='python',names=columns)
df = df.append(data)
print(df)
This is the pd output db:
This is snap of the csv:
You're using the read_csv function but in your arguments you are implying that the separator value is 'Sheet1':
pd.read_csv(f, 'Sheet1', header=None, engine='python', names=columns)
Is it a CSV or is it from an Excel file. If it is a CSV then most likely you can just remove this and it will work as expected.
I am fairly new to pandas. I am reading list of sql files from a folder and then writing the output to a text file using df.to_csv and then use those files to upload to redshift using COPY command.
One issue I am having is some of the boolean columns(1,0) are converting to True/False which I do not want as Redshift copy is throwing an error.
Here is my code
for filename in glob.glob('*.sql'):
with open(filename, 'r') as f:
df = pd.read_sql_query(f.read(),conn)
df['source_file_name'] = output_file_name
df.to_csv(output_file, sep='\t', index=False, float_format="%.11g")
f.close()
I do not want to give specific column name in the logic to .astype(int) as I am processing around 100 files with different output columns with different datatypes.
Also df *1 did not work as it gave error for datetime column. Is there a solution for this? I am even okay with manipulating at df.to_csv.
I'm not sure if this is the most efficient solution but you can check the type of each column and if it's a boolean type you can encode the labels using sklearn's LabelEncoder
For example:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for i, type_ in enumerate(df.dtypes):
if type_ == 'bool':
df.iloc[:,i] = le.fit_transform(df.iloc[:,i])
Just add this code snippet within your for loop, right before saving it as csv.
I found this works. Gusto's answer made me realize to play with iloc and came up to this solution.
for filename in glob.glob('*.sql'):
with open(filename, 'r') as f:
df = pd.read_sql_query(f.read(),conn)
df['source_file_name'] = output_file_name
for i, type_ in enumerate(df.dtypes):
if type_ == 'bool':
df = df.convert_dtypes(convert_boolean=False)
df.to_csv(output_file, sep='\t', index=False, float_format="%.11g")
f.close()
I have the following code:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset2 = pd.read_csv(file_path, header=None, dtype=str)
v = dataset2.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
dataset1 = pd.DataFrame(f)
df = dataset1.astype('str')
dataset = df.values.tolist()
print (type (dataset))
print (type (dataset[1]))
print (type (dataset[1][1]))
The target is to transfer all the dataset into values from 1..n for each different distinct value in dataset and afterwards to transform it into list of lists where each element is string.
The above code works great. However when I change the dataset into:
file_path ='https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
I get error. How can it work for this dataset as well?
You need to understand the data you're working with. A quick print call would've helped you realise the delimiters with this one are different.
Furthermore, it appears to be numeric data; you don't need an str conversion anymore.
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/vowel/vowel-context.data'
t = pd.read_csv(file_path, header=None, delim_whitespace=True)
v = t.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
df = pd.DataFrame(f)
If you want pandas to guess the delimiter format, you might employ the use of sep=None:
t = pd.read_csv(file_path, header=None, sep=None)
I don't recommend this because it is very easy for pandas to make mistakes when loading your data with an inferred delimiter.
I'm trying to save specific columns to a csv using pandas. However, there is only one line on the output file. Is there anything wrong with my code? My desired output is to save all columns where d.count() == 1 to a csv file.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
for columns in d:
if (d[columns]).count() > 1:
(d[columns]).dropna(how='any').to_csv('output.csv')
An alternative might be to populate a new dataframe containing what you want to save, and then save that one time.
import pandas as pd
results = pd.read_csv('employee.csv', sep=';', delimiter=';', low_memory=False)
results['index'] = results.groupby('Name').cumcount()
d = results.pivot(index='index', columns='Name', values='Job')
keepcols=[]
for columns in d:
if (d[columns]).count() > 1:
keepcols.append(columns)
output_df = results[keepcols]
output_df.to_csv('output.csv')
No doubt you could rationalise the above, and reduce the memory footprint by saving the output directly without first creating an object to hold it, but it helps identify what's going on in the example.
I am trying the kaggle challenge here, and unfortunately I am stuck at a very basic step.
I am trying to read the datasets into a pandas dataframe by executing following command:
test = pd.DataFrame.from_csv("C:/Name/DataMining/hillary/data/output/emails.csv")
The problem is that this file as you would find out has over 300,000 records, but I am reading only 7945.
print (test.shape)
(7945, 21)
Now I have double checked the file and I cannot find anything special about line number 7945. Any pointers why this could be happening?
I think better is use function read_csv with parameters quoting=csv.QUOTE_NONE and error_bad_lines=False. link
import pandas as pd
import csv
test = pd.read_csv("output/Emails.csv", quoting=csv.QUOTE_NONE, error_bad_lines=False)
print (test.shape)
#(381422, 22)
But some data (problematic) will be skipped.
If you want skip emails body data, you can use:
import pandas as pd
import csv
test = pd.read_csv(
"output/Emails.csv",
quoting=csv.QUOTE_NONE,
sep=',',
error_bad_lines=False,
header=None,
names=[
"Id", "DocNumber", "MetadataSubject", "MetadataTo", "MetadataFrom",
"SenderPersonId", "MetadataDateSent", "MetadataDateReleased",
"MetadataPdfLink", "MetadataCaseNumber", "MetadataDocumentClass",
"ExtractedSubject", "ExtractedTo", "ExtractedFrom", "ExtractedCc",
"ExtractedDateSent", "ExtractedCaseNumber", "ExtractedDocNumber",
"ExtractedDateReleased", "ExtractedReleaseInPartOrFull",
"ExtractedBodyText", "RawText"])
print (test.shape)
#delete row with NaN in column MetadataFrom
test = test.dropna(subset=['MetadataFrom'])
#delete headers in data
test = test[test.MetadataFrom != 'MetadataFrom']