Pandas: how to concisely detrend subset of columns - python

I would like to calculate and subtract the average over a subset of columns. Here is one way to do it:
#!/usr/bin/env python3
import numpy as np
import pandas as pd
def col_avg(df, col_ids):
'''Calculate and subtract average over *col_ids*
*df* is modified in-place.
'''
cols = [ df.columns[i] for i in col_ids ]
acc = df[cols[0]].copy()
for col in cols[1:]:
acc += df[col]
acc /= len(cols)
for col in cols:
df[col] -= acc
# Create example data
np.random.seed(42)
df = pd.DataFrame(data=np.random.random((433,80)) + np.arange(433)[:, np.newaxis],
columns=['col-%d' % x for x in range(80)])
#df = pd.DataFrame.from_csv('data.csv')
# Calculate average over columns 2, 3 and 6
df_old = df.copy()
col_avg(df, [ 1, 2, 5])
assert any(df_old.iloc[0] != df.iloc[0])
Now and I don't particularly like the two for loops, so I tried to express the same operation more concisely:
def col_avg(df, col_ids):
dfT = df.T
mean = dfT.iloc[col_ids].mean()
dfT.iloc[col_ids] -= mean
This implementation looks a lot nicer (IMO), but it has one drawback: it only works for some datasets. With the example above, it works. But e.g. when loading this csv file it fails.
The only explanation that I have is that in some cases the dfT.iloc[col_ids] expression must be internally creating a copy of the value array instead of modifying it in-place.
Is this the right explanation?
If so, what is it about the DataFrame that makes pandas decide to copy the data in one case but no the other?
Is there another way to perform this task that always works and does not require explicit iteration?
EDIT: When suggesting alternative implementations, please state why you think your implementation will always work. After all, the above code seems to work for some inputs as well.

The transpose of the DataFrame, dfT = df.T, may return a new DataFrame, not a view.
In that case, modifying dfT does nothing to df.
In your toy example,
df = pd.DataFrame(data=np.random.random((433,80)) + np.arange(433)[:, np.newaxis],
columns=['col-%d' % x for x in range(80)])
all the columns have the same dtype:
In [83]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 433 entries, 0 to 432
Data columns (total 80 columns):
col-0 433 non-null float64
col-1 433 non-null float64
col-2 433 non-null float64
...
dtypes: float64(80)
memory usage: 274.0 KB
whereas in the DataFrame built from CSV, some columns have int64 dtype:
In [55]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 492 entries, 0 to 491
Data columns (total 72 columns):
sample no 492 non-null int64
index 492 non-null int64
plasma-r 492 non-null float64
plasma-z 492 non-null float64
...
Columns of a DataFrame always have a single dtype. So when you transpose this
CSV-based df, the new DataFrame can not be formed by simply transposing a
single underlying NumPy array. The integers which were in columns by themselves
is now spread across rows. Each column of df.T must have a single dtype, so
the integers are upcasted to floats. So all the columns of df.T have dtype
float64. Data has to be copied when dtypes change.
The bottom line is: So when df has mixed types, df.T is a copy.
col_avg could be simplified to
def col_avg2(df, col_ids):
means = df.iloc[:, col_ids].mean(axis=1)
for i in col_ids:
df.iloc[:, i] -= means
Note that the expression df.iloc[:, col_ids] will return a copy since cols_ids is not a basic slice. But assignment to df.iloc[...] (or df.loc[...]) is guaranteed to modify df.
This is why assigning to df.iloc or df.loc is the recommended way to avoid the assignment-with-chained-indexing pitfall.

From my understanding of the question, this does what you are asking. I don't understand why you're transposing the dataframe. Note: I got rid of the string column names for simplicity, but you can replace those easily.
np.random.seed(42)
df = pd.DataFrame(data=np.random.random((6,8)) + np.arange(6)[:, np.newaxis])#,
#columns=['col-%d' % x for x in range(80)])
# Calculate average over columns 2, 3 and 6
df_old = df.copy()
col_ids=[1,2,5]
df[col_ids] = df[col_ids] - np.mean(df[col_ids].values)
df_old-df # to make sure average is calculated over all three columns
Out[139]:
0 1 2 3 4 5 6 7
0 0 2.950637 2.950637 0 0 2.950637 0 0
1 0 2.950637 2.950637 0 0 2.950637 0 0
2 0 2.950637 2.950637 0 0 2.950637 0 0
3 0 2.950637 2.950637 0 0 2.950637 0 0
4 0 2.950637 2.950637 0 0 2.950637 0 0
5 0 2.950637 2.950637 0 0 2.950637 0 0

OP, say you want the average computed over similar columns and subtracted (say the "Psi at ..." columns). The easiest way is
df = pd.read_csv('data.csv')
psi_cols = [c for c in df.columns if c.startswith('Psi')]
df[psi_cols] -= df[psi_cols].mean().mean()
This computes the total mean across all columns. If you want to subtract the column mean from each column, do
df[psi_cols] -= df[psi_cols].mean()

Related

How to build DataFrame row-by-row with correct dtypes?

I have a DataFrame that I build row-by-row (by necessity). My issue is that at the end, all dtypes are object. This is not so if the DataFrame is created with all the data at once.
Let me explain what I mean.
import pandas as pd
from IPython.display import display
# Example data
cols = ['P/N','Date','x','y','z']
PN = ['10a1','10a2','10a3']
dates = pd.to_datetime(['2022-07-01','2022-07-03','2022-07-05'])
xd = [0,1,2]
yd = [1.1,1.2,1.3]
zd = [-0.8,0.,0.8]
# Canonical way to build DataFrame (if you have all the data ready)
dg = pd.DataFrame({'P/N':PN,'Date':dates,'x':xd,'y':yd,'z':zd})
display(dg)
dg.dtypes
Here's what I get. Note the correct dtypes:
OK, now I do the same thing row-by-row:
# Build empty DataFrame
cols = ['P/N','Date','x','y','z']
df = pd.DataFrame(columns=cols)
# Add rows in loop
for i in range(3):
new_row = {'P/N':PN[i],'Date':pd.to_datetime(dates[i]),'x':xd[i],'y':yd[i],'z':zd[i]}
# deprecated
#df = df.append(new_row,ignore_index=True)
df = pd.concat([df,pd.DataFrame([new_row])],ignore_index=True)
display(df)
df.dtypes
Note the [] around new_row, otherwise you get a stupid error. (I really don't understand the deprecation of append, BTW. It allows for much more readable code)
But now I get this: This isn't the same as above, all the dtypes are object!
The only way I found to recover my dtypes is to use infer_objects:
# Recover dtypes by using infer_objects()
dh = df.infer_objects()
dh.dtypes
And dh is now the same as dg.
Note that even if I do
df = pd.concat([df,pd.DataFrame([new_row]).infer_objects()],ignore_index=True)
above, it still does not work. I believe this is due to a bug in concat: when an empty DataFrame is concat'ed to a non-empty DataFrame, the resulting DataFrame fails to take over the dtypes of second DataFrame. We can verify this with:
pd.concat([pd.DataFrame(),df],ignore_index=True).dtypes
and all dtypes are still object.
Is there a better way to build a DataFrame row-by-row and have the correct dtypes inferred automatically?
Your initial dataframe df = pd.DataFrame(columns=cols) columns are all of type object because it has no data to infer dtypes from. So you need to set them with astype. Like #mozway commented it is recommended to use a list to add the from the loop.
I expect this to work for you:
import pandas as pd
cols = ['P/N','Date','x','y','z']
dtypes = ['object', 'datetime64[ns]', 'int64', 'float64', 'float64']
PN = ['10a1','10a2','10a3']
dates = pd.to_datetime(['2022-07-01','2022-07-03','2022-07-05'])
xd = [0,1,2]
yd = [1.1,1.2,1.3]
zd = [-0.8,0.,0.8]
df = pd.DataFrame(columns=cols).astype({c:d for c,d in zip(cols,dtypes)})
new_rows = []
for i in range(3):
new_row = [PN[i], pd.to_datetime(dates[i]), xd[i], yd[i], zd[i]]
new_rows.append(new_row)
df_new = pd.concat([df, pd.DataFrame(new_rows, columns=cols)], axis=0)
print(df_new)
print(df_new.info())
Output:
P/N Date x y z
0 10a1 2022-07-01 0 1.1 -0.8
1 10a2 2022-07-03 1 1.2 0.0
2 10a3 2022-07-05 2 1.3 0.8
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 P/N 3 non-null object
1 Date 3 non-null datetime64[ns]
2 x 3 non-null int64
3 y 3 non-null float64
4 z 3 non-null float64
dtypes: datetime64[ns](1), float64(2), int64(1), object(1)
memory usage: 144.0+ bytes

Best way to read a huge CSV into a dataframe with a column with mixed value types

I'm trying to read a huge CSV file (almost 5GB) into a pandas dataframe.
This CSV only has 3 columns like this:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 STORE_ID 404944 non-null int64
1 SIZE 404944 non-null int64
2 DISTANCE 404944 non-null object
The problem is the column DISTANCE should only have int64 numbers, but somehow it contains some "null" values in the form of \\N. These \\N are causing my code to fail. Unfortunately I have no control over building this CSV, so I have no way of correcting it before hand.
This is a sample of the CSV:
STORE_ID,SIZE,DISTANCE
900072211,1,1000
900072212,1,1000
900072213,1,\\N
900072220,5,4500
I need to have this DISTANCE column with only int64 values.
Since the CSV is huge, I first tried to read it using the following code, assigning dtypes at the start:
df = pd.read_csv("polygons.csv", dtype={"STORE_ID": int, "SIZE": int, "DISTANCE": int})
But with this I got this error:
TypeError: Cannot cast array data from dtype('O') to dtype('int64') according to the rule 'safe'
How would you go about efficiently ready this csv to a dataframe? Is there a way to assign a dtype to the DISTANCE column while reading?
Use na_values as parameter of pd.read_csv, it should solve your problem:
df = pd.read_csv(..., na_values=r'\\N')
Output:
>>> df
STORE_ID SIZE DISTANCE
0 900072211 1 1000.0
1 900072212 1 1000.0
2 900072213 1 NaN
3 900072220 5 4500.0
>>> df.dtypes
STORE_ID int64
SIZE int64
DISTANCE float64
dtype: object
Update
You can also use converters:
convert_N = lambda x: int(x) if x != r'\\N' else 0
df = pd.read_csv(..., converters={'DISTANCE': convert_N})
Output:
>>> df
STORE_ID SIZE DISTANCE
0 900072211 1 1000
1 900072212 1 1000
2 900072213 1 0
3 900072220 5 4500
>>> df.dtypes
x1 int64
x2 int64
x3 int64
dtype: object

How to count missing data in each column in python?

I have a large data frame with 85 columns. The missing data has been coded as NaN. My goal is to get the amount of missing data in each column. So I wrote a for loop to create a list to get the amounts. But it does not work.
The followings are my codes:
headers = x.columns.values.tolist()
nans=[]
for head in headers:
nans_col = x[x.head == 'NaN'].shape[0]
nan.append(nans_col)
I tried to use the codes in the loop to generate the amount of missing value for a specific column by changing head to that column's name, then the code works and gave me the amount of missing data in that column.
So I do not know how to correct the for loop codes. Is somebody kind to help me with this? I highly appreciate your help.
For columns in pandas (python data analysis library) you can use:
In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
In [6]: df.isnull().sum()
Out[6]:
a 1
b 2
dtype: int64
For a single column or for sereis you can count the missing values as shown below:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([1,2,3, np.nan, np.nan])
In [4]: s.isnull().sum()
Out[4]: 2
Reference
This gives you a count (by column name) of the number of values missing (printed as True followed by the count)
missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
print(column)
print(missing_data[column].value_counts())
print("")
Just use Dataframe.info, and non-null count is probably what you want and more.
>>> pd.DataFrame({'a':[1,2], 'b':[None, None], 'c':[3, None]}) \
.info(verbose=True, null_counts=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 2 non-null int64
1 b 0 non-null object
2 c 1 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 176.0+ bytes
If there are multiple dataframe
below is the function to calculate number of missing value in each column with percentage
Missing Data Analysis
def miss_data(df):
x = ['column_name','missing_data', 'missing_in_percentage']
missing_data = pd.DataFrame(columns=x)
columns = df.columns
for col in columns:
icolumn_name = col
imissing_data = df[col].isnull().sum()
imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
print(missing_data)
#function to show the nulls total values per column
colum_name = np.array(data.columns.values)
def iter_columns_name(colum_name):
for k in colum_name:
print("total nulls {}=".format(k),pd.isnull(data[k]).values.ravel().sum())
#call the function
iter_columns_name(colum_name)
#outout
total nulls start_date= 0
total nulls end_date= 0
total nulls created_on= 0
total nulls lat= 9925
.
.
.

Boolean Comparison across multiple dataframes

I have an issue where I want to compare values across multiple dataframes. Here is a snippet example:
data0 = [[1,'01-01'],[2,'01-02']]
data1 = [[11,'02-30'],[12,'02-25']]
data2 = [[8,'02-30'],[22,'02-25']]
data3 = [[7,'02-30'],[5,'02-25']]
df0 = pd.DataFrame(data0,columns=['Data',"date"])
df1 = pd.DataFrame(data1,columns=['Data',"date"])
df2 = pd.DataFrame(data2,columns=['Data',"date"])
df3 = pd.DataFrame(data3,columns=['Data',"date"])
result=(df0['Data']| df1['Data'])>(df2['Data'] | df3['Data'])
What I would like to do as I hope it can be seen is say if a value in df0 rowX or df1 rowX is greater than df2 rowX or df3 rowX return True else it should be false. In the code above 11 in df1 is greater than both 8 and 7 (df2 and 3 respectively) so the result should be True and then for the second row neither 2 or 12 is greater than 22 (df2) so should be False. However, result gives me
False,False
instead of
True,False
any thoughts or help?
Problem
For your data:
>>> df0['Data']
0 1
1 2
Name: Data, dtype: int64
>>> df1['Data']
0 11
1 12
Name: Data, dtype: int64
your a doing a bitwise or with |:
>>> df0['Data']| df1['Data']
0 11
1 14
Name: Data, dtype: int64
>>> df2['Data']| df3['Data']
0 15
1 23
Name: Data, dtype: int64
Do this for the single numbers:
>>> 1 | 11
11
>>> 2 | 12
14
This is not what you want.
Solution
You can use np.maximum for find the biggest values from each series:
>>> np.maximum(df0['Data'], df1['Data']) > np.maximum(df2['Data'], df3['Data'])
0 True
1 False
Name: Data, dtype: bool
Your existing solution does not work because the | operator performs a bitwise OR operation on the elements.
df0.Data | df1.Data
0 11
1 14
Name: Data, dtype: int64
This results in you comparing values that are different to the values in your dataframe columns. In summary, your approach does not compare values as you'd expect.
You can make this easy by finding -
the max per row of df0 and df1, and
the max per row of df2 and df3
Comparing these two columns to retrieve your result -
i = np.max([df0.Data, df1.Data], axis=0)
j = np.max([df2.Data, df3.Data], axis=0)
i > j
array([ True, False], dtype=bool)
This approach happens to be extremely scalable for any number of dataframes.

Pandas get_dummies to output dtype integer/bool instead of float

I would like to know if could ask the get_dummies function in pandas to output the dummies dataframe with a dtype lighter than the default float64.
So, for a sample dataframe with categorical columns:
In []: df = pd.DataFrame([(blue,wood),(blue,metal),(red,wood)],
columns=['C1','C2'])
In []: df
Out[]:
C1 C2
0 blue wood
1 blue metal
2 red wood
after getting the dummies, it looks like:
In []: df = pd.get_dummies(df)
In []: df
Out[]:
C1_blue C1_red C2_metal C2_wood
0 1 0 0 1
1 1 0 1 0
2 0 1 0 1
which is perfectly fine. However, by default the 1's and 0's are float64:
In []: df.dtypes
Out[]:
C1_blue float64
C1_red float64
C2_metal float64
C2_wood float64
dtype: object
I know I can change the dtype afterwards with astype:
In []: df = pd.get_dummies(df).astype(np.int8)
But I don't want to have the dataframe with floats in memory, because I am dealing with a big dataframe (from a csv of about ~5Gb). I would like to have the dummies directly as integers.
There is an open issue w.r.t. this, see here: https://github.com/pydata/pandas/issues/8725
The float issue is now solved. From pandas version 0.19, pd.get_dummies function returns dummy-encoded columns as small integers.
See: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#get-dummies-now-returns-integer-dtypes

Categories

Resources