Why does numpy change the order of columns in pandas dataframe? - python

I am reading data from EXCEL to a pandas DataFrame:
df = pd.read_excel(file, sheet_name='FactoidList', ignore_index=False, sort=False)
Applying sort=False preserves the original order of my columns. But when I apply a numpy condition list, which generates a numpy array, the order of the columns changes.
Numpy orders the columns alphabetically from A to Z and I do not know how I can prevent it. Is there an equivalent to sort=False?
I searched online but could not find a solution. The problem is that I want to re-convert the numpy array to a dataframe in the original format, re-applying the original column names.
ADDITION: code for condition list used in script:
condlist = [f['pers_name'].str.contains('|'.join(qn)) ^ f['pers_name'].isin(qn),
f['inst_name'].isin(qi),
f['pers_title'].isin(qt),
f['pers_function'].isin(qf),
f['rel_pers'].str.contains('|'.join(qr)) ^ f['rel_pers'].isin(qr)]
choicelist = [f['pers_name'],
f['inst_name'],
f['pers_title'],
f['pers_function'],
f['rel_pers']]
output = np.select(condlist, choicelist)
print(output) # this print output already shows an inversion of columns
rows=np.where(output)
new_array=f.to_numpy()
result_array=new_array[rows]

Reviewing my script, I figured out that the problem isn't numpy but pandas.
Before applying my condition list, I am adding the dataframe df with the explicit sort=False statement to another dataframe f with the exact same structure, but I made the wrong assumption that the new combined dataframe would inherit sort=False.
Instead, I had to make it explicit:
f = f.append(df, axis=1, ignore_index=False, sort=False)

Related

Performing a function on dataframe rows

I am trying to winsorize a data set that would contain a few hundred columns of data. I'd like to make a new column to the dataframe and the column would contain the winsorized result from its row's data. How can I do this with a pandas dataframe without having to specify each column (I'd like to use all columns)?
Edit: I would want to use the function 'winsorize(list, limits = [0.1,0.1])' but I'm not sure how to format the dataframe rows to work as a list.
Some tips:
You may use the pandas function apply with axis=1 to apply a function to every row.
The apply function will receive a pandas Series object but you can easily convert it to a list using tolist method
For example:
df.apply(lambda x: winsorize(x.tolist(), limits=[0.1,0.1]), axis=1)
You can use the numpy version of your dataframe using to_numpy()
from scipy.stats.mstats import winsorize
ma = winsorize(df.to_numpy(), axis=1, limits=[0.1, 0.1])
out = pd.DataFrame(ma.data, index=df.index, columns=df.columns)

Need help converting a merge function from R to Python, shape of resulting df is the same but losing more rows in Python after dropping duplicates

I believe the merge type in R is a left outer join. The merge I implemented in Python returned a dataframe that had the same shape as the resulting merged df in R. Although when I had dropped the duplicates (df2.drop_duplicates), 4000 rows were dropped in Python as opposed to the 50 rows dropped when applying the drop duplicates function to the post-merge R data frame
The dataframe I need to merge are df1 and df2
R:
df2<-merge( df2[ , -which(names(df2) %in% c(column9,column10))], df1[,c(column1,column2,column4,column5)],by.x=c(column1,column2),by.y=c(column2,column4),all.x=T
Python:
df2 = df2[[column1,column2,column3...column8]].merge(df1[[column1,column2,column4,column5]],how='left',left_on=[column1,column2],right_on=[column2,column4]
df2[column1] and df2[column2] are the columns I want to merge on because their names in df1 are df1[column2] and df1[column4] but have the same row values.
My gut tells me that the issue is stemming from this portion of the code that I might be misinterpreting: -which(names(df2) %in% c(column9,column10)
Please feel free to send some tips my way if I'm messing up somewhere
First, the list subset of columns in Pandas is no longer recommended. Instead, use reindex to subset columns which handles missing labels.
And the R translation of -which(names(df2) %in% c(column9, column10)) in Pandas can be ~df2.columns.isin([column9, column10]). And because isin returns a boolean series, to subset consider DataFrame.loc:
df2 = (df.loc[:, ~df2.columns.isin([column9, column10])]
.merge(df1.reindex([column1, column2, column4, column5], axis='columns'),
how='left',
left_on=[column1, column2],
right_on=[column2, column4])
)

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

Pandas: after slicing along specific columns, get "values" without returning entire dataframe

Here is what is happening:
df = pd.read_csv('data')
important_region = df[df.columns.get_loc('A'):df.columns.get_loc('C')]
important_region_arr = important_region.values
print(important_region_arr)
Now, here is the issue:
print(important_region.shape)
output: (5,30)
print(important_region_arr.shape)
output: (5,30)
print(important_region)
output: my columns, in the panda way
print(important_region_arr)
output: first 5 rows of the dataframe
How, having indexed my columns, do I transition to the numpy array?
Alternatively, I could just convert to numpy from the get-go and run the slicing operation within numpy. But, how is this done in pandas?
So here is how you can slice the dataset with specific columns. loc gives you access to the grup of rows and columns. The ones before , represents rows and columns after. If a : is specified it means all the rows.
data.loc[:,'A':'C']
For more understanding, please look at the documentation.

Exported and imported DataFrames differ but should be the same

I tried to import some data from an Excel file to a pandas DataFrame, convert it into a csv file and read it back in (need to do some further file based handling on that exported csv file later on, so that is a necessary step).
For the sake of data integrity, exported and re-imported data should be the same. So, I compared the different DataFrames and encountered, that these are not the same, at least according to pandas' .equals() function.
I thought this might be an issue related to string encoding when exporting and re-importing the data since I had to transfer char encoding etc. while file handling. However, I was able to reproduce similar behavior without any encoding-related issues as follows:
import pandas as pd
import numpy as np
# https://stackoverflow.com/a/32752318
df1 = pd.DataFrame(np.random.randint(0, 10, size=(10, 4)), columns=list('ABCD'))
df1.to_csv('foo.csv', index=False)
df2 = pd.read_csv('foo.csv')
df1.to_csv('bar.csv', index=True)
df3 = pd.read_csv('bar.csv')
print(df1.equals(df2), df1.equals(df3), df2.equals(df3))
print(all(df1 == df2))
Why does .equals() tell that the DataFrames differ, but all(df1 == df2) tells they are equal? According to the docs, .equals() even considers NaNs at same locations to be equal, whereas df1 == df2 should not. Due to this, comparing different DataFrames with .equals() is less strict than df1 == df2, but does not return the same result in the example I provided.
Which criteria do df1 == df2 and df1.equals(df2) consider I am not aware of? I assume, that the implementation inside pandas is correct (did not look into the implementation inside the code itself, but export and re-import should be a standard interface test case). What am I doing wrong then?
I think that df1.equals(df2) return False because it takes into account the DataFrame dtype. df1 should have int32 columns, while df2 should have int64 columns (you can use the info() method to verify it).
You can specify the df2 dtype as follow in order to have the same dtype of df1:
df2 = pd.read_csv('foo.csv', dtype=np.int32)
if dtype is the same, .equals() should return True
When you write dataframe to .csv format with index=True ; it adds up extra column with name Unnamed: 0. That's why both .equals() and all(df1 == df2) tells dataframes are different. But, if you write .csv with index=False it will not add up an extra column and you will get output .csv equal to input dataframe.
If you don't care about dataframe index you can set index=False while writing dataframe to .csv or use pd.read_csv('bar.csv').drop(['Unnamed: 0'],axis=1) while reading csv.

Categories

Resources