I want to use every 5th row as a reference row (ref_row), divide this ref_row starting from this ref_row and do the same for the next 4 rows.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
len = df.shape[0]
for idx in range(0,len,5):
ref_row = df.iloc[idx:idx+1,:]
for idx_next in range(idx,idx+5):
df.iloc[idx_next:idx_next+1,:] = df.iloc[idx_next:idx_next+1,:].div(ref_row)
However, I got all NaN except the ref_row.
A B C D
0 1.0 1.0 1.0 1.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
... ... ... ... ...
95 1.0 1.0 1.0 1.0
96 NaN NaN NaN NaN
97 NaN NaN NaN NaN
98 NaN NaN NaN NaN
99 NaN NaN NaN NaN
Any idea what's wrong?
The problem with your code is that with df.iloc[idx_next:idx_next+1,:] and df.iloc[idx:idx+1,:], you're indexing df rows as DF objects. So when you divide, the indices don't match and you get NaN. Replace
df.iloc[idx_next:idx_next+1,:]
with
df.iloc[idx_next]
and
df.iloc[idx:idx+1,:]
with
df.iloc[idx]
everywhere, it will work as expected (because they're now Series objects, so the indices match).
You can also repeat the array of every fifth row of the DataFrame using np.repeat on axis=0, then element-wise divide it with the resulting array:
out = df.div(np.repeat(df[::5].to_numpy(), 5, axis=0))
Output:
A B C D
0 1.000000 1.000000 1.000000 1.000000
1 0.726190 0.359375 0.967742 1.644068
2 0.130952 0.046875 0.161290 0.406780
3 0.488095 0.312500 0.919355 0.305085
4 0.857143 0.203125 0.967742 0.525424
.. ... ... ... ...
95 1.000000 1.000000 1.000000 1.000000
96 0.061224 1.400000 0.518519 0.882353
97 1.510204 1.300000 1.740741 5.588235
98 0.224490 2.100000 1.407407 0.294118
99 1.061224 1.400000 1.388889 3.411765
[100 rows x 4 columns]
Related
I have a dataset that looks like below:
Zn Pb Ag Cu Mo Cr Ni Co Ba
87 7 0.02 42 2 57 38 14 393
70 6 0.02 56 2 27 29 20 404
75 5 0.02 69 2 44 23 17 417
70 6 0.02 54 1 20 19 12 377
I want to create a pandas dataframe out of this dataset. I have written the function below:
def correlation_iterated(raw_data,element_concentration):
columns = element_concentration.split()
df1 = pd.DataFrame(columns=columns)
data1=[]
selected_columns = raw_data.loc[:, element_concentration.split()].columns
for i in selected_columns:
for j in selected_columns:
# another function that takes 'i' and 'j' and returns 'a'
zipped1 = zip([i], a)
data1.append(dict(zipped1))
df1 = df1.append(data1,True)
print(df1)
This function is supposed to do the calculations for each element and create a 9 by 9 pandas dataframe and store each calculation in each cell. But I get the following:
Zn Pb Ag Cu Mo Cr Ni Co Ba
0 1.000000 NaN NaN NaN NaN NaN NaN NaN NaN
1 0.460611 NaN NaN NaN NaN NaN NaN NaN NaN
2 0.127904 NaN NaN NaN NaN NaN NaN NaN NaN
3 0.276086 NaN NaN NaN NaN NaN NaN NaN NaN
4 -0.164873 NaN NaN NaN NaN NaN NaN NaN NaN
.. ... .. .. .. .. .. .. .. ...
76 NaN NaN NaN NaN NaN NaN NaN NaN 0.113172
77 NaN NaN NaN NaN NaN NaN NaN NaN 0.027251
78 NaN NaN NaN NaN NaN NaN NaN NaN -0.036409
79 NaN NaN NaN NaN NaN NaN NaN NaN 0.041396
80 NaN NaN NaN NaN NaN NaN NaN NaN 1.000000
[81 rows x 9 columns]
which is basically calculating the results of the first column and storing them in just the first column, then doing the calculations and appending new rows to the column. How can I program the code in a way that appends new calculations to the next column when finished with one column? I want sth like this:
Zn Pb Ag Cu Mo Cr Ni Co Ba
0 1.000000 0.460611 ...
1 0.460611 1.000000 ...
2 0.127904 0.111559 ...
3 0.276086 0.303925 ...
4 -0.164873 -0.190886 ...
5 0.402046 0.338073 ...
6 0.174774 0.096724 ...
7 0.165760 -0.005301 ...
8 -0.043695 0.174193 ...
[9 rows x 9 columns]
Could you not just do something like this:
def correlation_iterated(raw_data,element_concentration):
columns = element_concentration.split()
data = {}
selected_columns = raw_data.loc[:,columns].columns
for i in selected_columns:
temp = []
for j in selected_columns:
# another function that takes 'i' and 'j' and returns 'a'
temp.append(a)
data[i] = temp
df = pd.DataFrame(data)
print(df)
I am using the pandas .qcut() function to divide a column 'AveragePrice' into 4 bins. I would like to assign each bin to a new variable. The reason for this is to do a separate analysis on each quartile. IE) I would like something like:
bin1 = quartile 1
bin2= quartile 2
bin3 = quartile 3
bin4= quantile 4
Here is what I'm working with.
`pd.qcut(data['AveragePrice'], q=4)`
2 (0.439, 1.1]
3 (0.439, 1.1]
17596 (1.1, 1.38]
17600 (1.1, 1.38]
Name: AveragePrice, Length: 14127, dtype: category
Categories (4, interval[float64]): [(0.439, 1.1] < (1.1, 1.38] < (1.38, 1.69] < (1.69, 3.25]]
If I understand correctly, you can "pivot" your quartile values into columns.
Toy example:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'AveragePrice': np.random.randint(0, 100, size=10) })
AveragePrice
0
20
1
29
2
53
3
30
4
3
5
4
6
78
7
62
8
75
9
1
Create the Quartile column, pivot Quartile into columns, and rename the columns to something more reader-friendly:
df['Quartile'] = pd.qcut(df.AveragePrice, q=4)
pivot = df.reset_index().pivot_table(
index='index',
columns='Quartile',
values='AveragePrice')
pivot.columns = ['Q1', 'Q2', 'Q3', 'Q4']
Q1
Q2
Q3
Q4
0
NaN
20.0
NaN
NaN
1
NaN
29.0
NaN
NaN
2
NaN
NaN
53.0
NaN
3
NaN
NaN
30.0
NaN
4
3.0
NaN
NaN
NaN
5
4.0
NaN
NaN
NaN
6
NaN
NaN
NaN
78.0
7
NaN
NaN
NaN
62.0
8
NaN
NaN
NaN
75.0
9
1.0
NaN
NaN
NaN
Now you can analyze the bins separately, e.g., describe them:
pivot.describe()
Q1
Q2
Q3
Q4
count
3.000000
2.000000
2.000000
3.000000
mean
2.666667
24.500000
41.500000
71.666667
std
1.527525
6.363961
16.263456
8.504901
min
1.000000
20.000000
30.000000
62.000000
25%
2.000000
22.250000
35.750000
68.500000
50%
3.000000
24.500000
41.500000
75.000000
75%
3.500000
26.750000
47.250000
76.500000
max
4.000000
29.000000
53.000000
78.000000
I am adding a new function which converts the DataFrame to lower triangle if its an upper triangle and vice versa. The data I am using always has first two rows filled with the first index only.
I tried using the solution from this problem Pandas: convert upper triangular dataframe by shifting rows to the left
Data :
0 1 2 3
0 1.000000 NaN NaN NaN
1 0.421655 NaN NaN NaN
2 0.747064 5.000000 NaN NaN
3 0.357616 0.631622 8.000000 NaN
which should be turned into:
Data :
0 1 2 3
0 NaN 8.000000 0.631622 0.357616
1 NaN NaN 5.000000 0.747064
2 NaN NaN NaN 0.421655
3 NaN NaN NaN 1.000000
Just like you need reverse order for row and columns
yourdf=df.iloc[::-1,::-1]
yourdf
Out[94]:
3 2 1 0
3 NaN 8.0 0.631622 0.357616
2 NaN NaN 5.000000 0.747064
1 NaN NaN NaN 0.421655
0 NaN NaN NaN 1.000000
your system should be having numpy installed. So, using numpy.flip is another way and provide more readable options
In [722]: df
Out[722]:
0 1 2 3
0 1.000000 NaN NaN NaN
1 0.421655 NaN NaN NaN
2 0.747064 5.000000 NaN NaN
3 0.357616 0.631622 8.0 NaN
In [724]: import numpy as np
In [725]: df_flip = pd.DataFrame(np.flip(df.values))
In [726]: df_flip
Out[726]:
0 1 2 3
0 NaN 8.0 0.631622 0.357616
1 NaN NaN 5.000000 0.747064
2 NaN NaN NaN 0.421655
3 NaN NaN NaN 1.000000
For certain columns of df, if 80% of the column is NAN.
What's the simplest code to drop such columns?
You can use isnull with mean for threshold and then remove columns by boolean indexing with loc (because remove columns), also need invert condition - so <.8 means remove all columns >=0.8:
df = df.loc[:, df.isnull().mean() < .8]
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.random((100,5)), columns=list('ABCDE'))
df.loc[:80, 'A'] = np.nan
df.loc[:5, 'C'] = np.nan
df.loc[20:, 'D'] = np.nan
print (df.isnull().mean())
A 0.81
B 0.00
C 0.06
D 0.80
E 0.00
dtype: float64
df = df.loc[:, df.isnull().mean() < .8]
print (df.head())
B C E
0 0.278369 NaN 0.004719
1 0.670749 NaN 0.575093
2 0.209202 NaN 0.219697
3 0.811683 NaN 0.274074
4 0.940030 NaN 0.175410
If want remove columns by minimal values dropna working nice with parameter thresh and axis=1 for remove columns:
np.random.seed(1997)
df = pd.DataFrame(np.random.choice([np.nan,1], p=(0.8,0.2),size=(10,10)))
print (df)
0 1 2 3 4 5 6 7 8 9
0 NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN
1 1.0 NaN 1.0 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN
3 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN 1.0 NaN NaN NaN 1.0
5 NaN NaN NaN 1.0 1.0 NaN NaN 1.0 NaN 1.0
6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN
9 1.0 NaN NaN NaN 1.0 NaN NaN 1.0 NaN NaN
df1 = df.dropna(thresh=2, axis=1)
print (df1)
0 3 4 5 7 9
0 NaN 1.0 1.0 NaN NaN NaN
1 1.0 NaN NaN NaN NaN NaN
2 NaN NaN NaN 1.0 NaN NaN
3 NaN NaN 1.0 NaN NaN NaN
4 NaN NaN NaN 1.0 NaN 1.0
5 NaN 1.0 1.0 NaN 1.0 1.0
6 NaN NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN 1.0 NaN
9 1.0 NaN 1.0 NaN 1.0 NaN
EDIT: For non-Boolean data
Total number of NaN entries in a column must be less than 80% of total entries:
df = df.loc[:, df.isnull().sum() < 0.8*df.shape[0]]
df.dropna(thresh=np.int((100-percent_NA_cols_required)*(len(df.columns)/100)),inplace=True)
Basically pd.dropna takes number(int) of non_na cols required if that row is to be removed.
You can use the pandas dropna. For example:
df.dropna(axis=1, thresh = int(0.2*df.shape[0]), inplace=True)
Notice that we used 0.2 which is 1-0.8 since the thresh refers to the number of non-NA values
As suggested in comments, if you use sum() on a boolean test, you can get the number of occurences.
Code:
def get_nan_cols(df, nan_percent=0.8):
threshold = len(df.index) * nan_percent
return [c for c in df.columns if sum(df[c].isnull()) >= threshold]
Used as:
del df[get_nan_cols(df, 0.8)]
I have a python pandas DataFrame that looks like this:
A B C ... ZZ
2008-01-01 00 NaN NaN NaN ... 1
2008-01-02 00 NaN NaN NaN ... NaN
2008-01-03 00 NaN NaN 1 ... NaN
... ... ... ... ... ...
2012-12-31 00 NaN 1 NaN ... NaN
and I can't figure out how to get a subset of the DataFrame where there is one or more '1' in it, so that the final df should be something like this:
B C ... ZZ
2008-01-01 00 NaN NaN ... 1
2008-01-03 00 NaN 1 ... NaN
... ... ... ... ...
2012-12-31 00 1 NaN ... NaN
This is, removing all rows and columns that do not have a 1 in it.
I try this which seems to remove the rows with no 1:
df_filtered = df[df.sum(1)>0]
And the try to remove columns with:
df_filtered = df_filtered[df.sum(0)>0]
but get this error after the second line:
IndexingError('Unalignable boolean Series key provided')
Do it with loc:
In [90]: df
Out[90]:
0 1 2 3 4 5
0 1 NaN NaN 1 1 NaN
1 NaN NaN NaN NaN NaN NaN
2 1 1 NaN NaN 1 NaN
3 1 NaN 1 1 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [91]: df.loc[df.sum(1) > 0, df.sum(0) > 0]
Out[91]:
0 1 2 3 4
0 1 NaN NaN 1 1
2 1 1 NaN NaN 1
3 1 NaN 1 1 NaN
Here's why you get that error:
Let's say I have the following frame, df, (similar to yours):
In [112]: df
Out[112]:
a b c d e
0 0 1 1 NaN 1
1 NaN NaN NaN NaN NaN
2 0 0 0 NaN 0
3 0 0 1 NaN 1
4 1 1 1 NaN 1
5 0 0 0 NaN 0
6 1 0 1 NaN 0
When I sum along the rows and threshold at 0, I get:
In [113]: row_sum = df.sum()
In [114]: row_sum > 0
Out[114]:
a True
b True
c True
d False
e True
dtype: bool
Since the index of row_sum is the columns of df, it doesn't make sense in this case to try to use the values of row_sum > 0 to fancy-index into the rows of df, since their row indices are not aligned and they cannot be aligned.
Alternatively to remove all NaN rows or columns you can use .any() too.
In [1680]: df
Out[1680]:
0 1 2 3 4 5
0 1.0 NaN NaN 1.0 1.0 NaN
1 NaN NaN NaN NaN NaN NaN
2 1.0 1.0 NaN NaN 1.0 NaN
3 1.0 NaN 1.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [1681]: df.loc[df.any(axis=1), df.any(axis=0)]
Out[1681]:
0 1 2 3 4
0 1.0 NaN NaN 1.0 1.0
2 1.0 1.0 NaN NaN 1.0
3 1.0 NaN 1.0 1.0 NaN