Exported and imported DataFrames differ but should be the same - python

I tried to import some data from an Excel file to a pandas DataFrame, convert it into a csv file and read it back in (need to do some further file based handling on that exported csv file later on, so that is a necessary step).
For the sake of data integrity, exported and re-imported data should be the same. So, I compared the different DataFrames and encountered, that these are not the same, at least according to pandas' .equals() function.
I thought this might be an issue related to string encoding when exporting and re-importing the data since I had to transfer char encoding etc. while file handling. However, I was able to reproduce similar behavior without any encoding-related issues as follows:
import pandas as pd
import numpy as np
# https://stackoverflow.com/a/32752318
df1 = pd.DataFrame(np.random.randint(0, 10, size=(10, 4)), columns=list('ABCD'))
df1.to_csv('foo.csv', index=False)
df2 = pd.read_csv('foo.csv')
df1.to_csv('bar.csv', index=True)
df3 = pd.read_csv('bar.csv')
print(df1.equals(df2), df1.equals(df3), df2.equals(df3))
print(all(df1 == df2))
Why does .equals() tell that the DataFrames differ, but all(df1 == df2) tells they are equal? According to the docs, .equals() even considers NaNs at same locations to be equal, whereas df1 == df2 should not. Due to this, comparing different DataFrames with .equals() is less strict than df1 == df2, but does not return the same result in the example I provided.
Which criteria do df1 == df2 and df1.equals(df2) consider I am not aware of? I assume, that the implementation inside pandas is correct (did not look into the implementation inside the code itself, but export and re-import should be a standard interface test case). What am I doing wrong then?

I think that df1.equals(df2) return False because it takes into account the DataFrame dtype. df1 should have int32 columns, while df2 should have int64 columns (you can use the info() method to verify it).
You can specify the df2 dtype as follow in order to have the same dtype of df1:
df2 = pd.read_csv('foo.csv', dtype=np.int32)
if dtype is the same, .equals() should return True

When you write dataframe to .csv format with index=True ; it adds up extra column with name Unnamed: 0. That's why both .equals() and all(df1 == df2) tells dataframes are different. But, if you write .csv with index=False it will not add up an extra column and you will get output .csv equal to input dataframe.
If you don't care about dataframe index you can set index=False while writing dataframe to .csv or use pd.read_csv('bar.csv').drop(['Unnamed: 0'],axis=1) while reading csv.

Related

Why does numpy change the order of columns in pandas dataframe?

I am reading data from EXCEL to a pandas DataFrame:
df = pd.read_excel(file, sheet_name='FactoidList', ignore_index=False, sort=False)
Applying sort=False preserves the original order of my columns. But when I apply a numpy condition list, which generates a numpy array, the order of the columns changes.
Numpy orders the columns alphabetically from A to Z and I do not know how I can prevent it. Is there an equivalent to sort=False?
I searched online but could not find a solution. The problem is that I want to re-convert the numpy array to a dataframe in the original format, re-applying the original column names.
ADDITION: code for condition list used in script:
condlist = [f['pers_name'].str.contains('|'.join(qn)) ^ f['pers_name'].isin(qn),
f['inst_name'].isin(qi),
f['pers_title'].isin(qt),
f['pers_function'].isin(qf),
f['rel_pers'].str.contains('|'.join(qr)) ^ f['rel_pers'].isin(qr)]
choicelist = [f['pers_name'],
f['inst_name'],
f['pers_title'],
f['pers_function'],
f['rel_pers']]
output = np.select(condlist, choicelist)
print(output) # this print output already shows an inversion of columns
rows=np.where(output)
new_array=f.to_numpy()
result_array=new_array[rows]
Reviewing my script, I figured out that the problem isn't numpy but pandas.
Before applying my condition list, I am adding the dataframe df with the explicit sort=False statement to another dataframe f with the exact same structure, but I made the wrong assumption that the new combined dataframe would inherit sort=False.
Instead, I had to make it explicit:
f = f.append(df, axis=1, ignore_index=False, sort=False)

Concatenate 2 Rows to be header/column names

I have an excel sheet that is really poorly formatted. The actual column names I would like to use are across two rows; For example, if the correct column name should be Labor Percent, cell A1 would contain Labor, and cell A2 would contain Percent).
I try to load the file, here's what I'm doing:
import os
os.getcwd()
os.chdir(r'xxx')
import pandas as pd
file = 'problem.xls'
xl = pd.ExcelFile(file)
print(xl.sheet_names)
df = xl.parse('WEEKLY NUMBERS', skiprows=35)
As you can see in the picture, the remainder of what should be the column name is in the second row. Is there a way to rename the columns by concatenating? Can this somehow be done with the header= argument in the xl.parse bit?
You can rename the columns yourself by setting:
df.columns = ['name1', 'name2', 'name3' ...]
Note that you must specify a name for every column.
Then drop the first row to get rid of the unwanted row of column names.
df = df.drop(0)
Here's something you can try. Essentially it reads in the first two rows as your header, but treats it as a hierarchical multi-index. The second line of code below then flattens that multi-index down to a single string. I'm not 100% certain it will work for your data but is worth a try - it worked for the small dummy test data I tried it with:
df = pd.read_excel('problem.xlsx', sheetname='WEEKLY NUMBERS', header=[0, 1])
df.columns = df.columns.map(' '.join)
The second line was taken from this answer about flattening a multi-index.

deleting a range of rows in enothught canopy

Here is my Pandas DataFrame:
import pandas as pd
dfa = df = pd.read_csv("twitDB3__org.csv")
dfa.drop([7-100], axis=0, inplace=True)
Output
ValueError: labels [-93] not contained in axis
I am new to canopy and want to delete a range of rows and it seems to require each row individual. Would appreciate any help
a) I think you want dfa.drop(range(7,101),... (What you did was just subtract 100 from 7 and pass the result (-93) as the label to drop.)
b) Note that this will also change df, because as you've written it, df and dfa are just two names for the same mutable object. If you want to end up with two different dataframes, then either make an explicit copy, or don't use inplace, and save the result: df2 = df.drop(...
c) This is a pandas question, not a canopy question. Canopy provides 500+ Python packages, and while it's true that pandas is one of the more popular of these, there is a whole pandas community out there.

Convert an Object dtype column to Number Dtype in a datafrane Pandas

Trying to answer this question Get List of Unique String per Column we ran into a different problem from my dataset. When I import this CSV file to the dataframe every column is OBJECT type, we need to convert the columns that are just number to real (number) dtype and those that are not number to String dtype.
Is there a way to achieve this?
Download the data sample from here
I have tried following code from following article Pandas: change data type of columns but did not work.
df = pd.DataFrame(a, columns=['col1','col2','col3'])
As always thanks for your help
Option 1
use pd.to_numeric in an apply
df.apply(pd.to_numeric, errors='ignore')
Option 2
use pd.to_numeric on df.values.ravel
cvrtd = pd.to_numeric(df.values.ravel(), errors='coerce').reshape(-1, len(df.columns))
pd.DataFrame(np.where(np.isnan(cvrtd), df.values, cvrtd), df.index, df.columns)
Note
These are not exactly the same. For some column that contains mixed values, option 2 converts what it can while option 2 leaves everything in that column an object. Looking at your file, I'd choose option 1.
Timing
df = pd.read_csv('HistorianDataSample/HistorianDataSample.csv', skiprows=[1, 2])

How to add values from one dataframe into another ignoring the row indices

I have a pandas dataframe called trg_data to collect data that I am producing in batches. Each batch is produced by a sub-routine as a smaller dataframe df with the same number of columns but less rows and I want to insert the values from df into trg_data at a new row position each time.
However, when I use the following statement df is always inserted at the top. (i.e. rows 0 to len(df)).
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df
I'm guessing but I think the reason may be that even though the slice indicates the desired rows, it is using the index in df to decide where to put the data.
As a test I found that I can insert an ndarray at the right position no problem:
trg_data.iloc[trg_pt:(trg_pt + len(df))] = np.ones(df.shape)
How do I get it to ignore the index in df and insert the data where I want it? Or is there an entirely different way of achieving this? At the end of the day I just want to create the dataframe trg_data and then save to file at the end. I went down this route because there didn't seem to be a way of easily appending to an existing dataframe.
I've been working at this for over an hour and I can't figure out what to google to find the right answer!
I think I may have the answer (I thought I had already tried this but apparently not):
trg_data.iloc[trg_pt:(trg_pt + len(df))] = df.values
Still, I'm open to other suggestions. There's probably a better way to add data to a dataframe.
The way I would do this is save all the intermediate dataframes in an array, and then concatenate them together
import pandas as pd
dfs = []
# get all the intermediate dataframes somehow
# combine into one dataframe
trg_data = pd.concatenate(dfs)
Both
trg_data = pd.concat([df1, df2, ... dfn], ignore_index=True)
and
trg_data = pd.DataFrame()
for ...: #loop that generates df
trg_data = trg_data.append(df, ignore_index=True) #you can reuse the name df
shoud work for you.

Categories

Resources