Clean up this int64 variable in python - python

This is the raw distribution of the var FREQUENCY
NaN 22131161
1.0 4182626
7.0 218343
3.0 145863
1 59432
0.0 29906
2.0 28129
4.0 15237
5.0 4553
8.0 3617
3 2754
7 2635
9.0 633
2 584
4 276
0 112
8 51
5 42
6.0 19
A 9
I 7
9 6
Q 3
Y 2
X 2
Z 1
C 1
N 1
G 1
B 1
Name: FREQUENCY, dtype: int64
group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom
How could frequency be render as int64 where letters are present?
Desired outcome 1: group all letter groups +NaN into one group. Remaining numeric value groups consolidate (1.0 and 1 =1,for example). In SAS, I just run this : y=1*X. I just give a value of 10 to represent character groups + NaN. How to do it in Python, especially elegantly?
Outcome 2: extract a binary variable z=1 if x=NaN. Otherwise z=0

The first issue "
group 1.0 should be the same as 1. I wrote df['x']=df['x].replace({'1.0:'1'}). it does not change anything. 9.0 vs 9, 3.0 vs.3 have same symptom"
was fixed once I add dtype={'FREQUANCY':'object'} while reading the csv file. Group 1.0 collapsed with group 1... After than replace works just fine.
All other issues pretty much are resolved, except issue 2 in that it still sets the variable type to be int64 where character variables are present. My guess is perhaps Python adopts a majority rule to vote on data type. It is indeed true numeric values dominate the count.

Related

Pandas - conditional row average

I have a dataframe:
x = pd.DataFrame({'1':[1,2,3,2,5,6,7,8,9], '2':[2,5,6,8,10,np.nan,6,np.nan,np.nan],
'3':[10,10,10,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]})
I am trying to generate an average of a row but only on values greater than 5. For instance - if a row had values of 3, 6, 10. The average would be 8 ((6+10)/2). The 3 would be ignored as it is below 5.
The equivalent in excel would be =AVERAGEIF(B2:DX2,">=5")
You can mask the values greater than 5 then take mean:
x.where(x>5).mean(1)
Or:
x.mask(x<=5).mean(1)
You can create a small custom function which, inside each row, filters out values smaller or equal than a certain value and apply it to each row of your dataframe
def average_if(s, value=5):
s = s.loc[s > value]
return s.mean()
x.apply(average_if, axis=1)
0 10.0
1 10.0
2 8.0
3 8.0
4 10.0
5 6.0
6 6.5
7 8.0
8 9.0
dtype: float64

Smoothing Categorical Output

I have a list of outputs obtained from a cow behavior detection model. Even in a video when a cow is laying, often time it identifies as standing and vice versa. In each video frame, a classification result is given by the model and we are appending it into a list. Let's assume after 20 frames, we have a series of output as follows -
behavious_cow_1 = ["stand","stand","stand","stand","lying", "stand","stand", "eating", "stand","stand","stand","stand","lying""stand","stand","stand","stand","stand","stand","lying"]
Out of 20 classification results, we have 4 misclassification; 3 lyings, and 1 eating. However, the whole time the cow was sitting at a place. If the list only contained numerical values like - 1,2,3..., I would have opted for moving average to change the misclassification. Is there any Scipy, Pandas, Numpy function that can smooth the categorical output? I am thinking about taking previous 3 and next 3 values to determine the current category.
I used the following solution -
import scipy.stats
window_length = 7
behave = ["stand","stand","stand","stand","lying","lying", "eating"]
most_freq_val = lambda x: scipy.stats.mode(x)[0][0]
smoothed = [most_freq_val(behave[i:i+window_length]) for i in range(0,len(behave)-window_length+1)]
I tried the solution posted by Hugolmn but it broke at a point. In the rolling mode, the window width is provided by the user (7 here). In a certain width, if more than one values are present in the same number of times, the code does not work. It's more like - you tried to find the statistical mode (most common item) of a list but it got more than one item with the same highest frequency.
I am myself very surprised that a function such as mode() does not work with a rolling window in pandas. However, I still found a decent solution to your problem
First, create a pandas Series with categorical datatype:
df = pd.Series(sample, dtype='category')
Now you can see that df.cat.categories returns the list of categories in your data, and df.cat.codes the codes associated to them. We can use the latter to apply a rolling mode with a width of 7 (3 previous, the value, and the next 3):
df.cat.codes
0 3
1 3
2 3
3 3
4 1
5 3
6 3
7 0
8 3
9 3
10 3
11 3
12 2
13 3
14 3
15 3
16 3
17 3
18 1
dtype: int8
df.cat.codes.rolling(7, center=True, min_periods=0).apply(lambda x: x.mode())
0 3.0
1 3.0
2 3.0
3 3.0
4 3.0
5 3.0
6 3.0
7 3.0
8 3.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 3.0
16 3.0
17 3.0
18 3.0
dtype: float64
Finally, you can map the codes to get the strings back:
(df.cat.codes
.rolling(7, center=True, min_periods=0)
.apply(lambda x: x.mode())
.map(dict(enumerate(df.cat.categories)))
)
0 stand
1 stand
2 stand
3 stand
4 stand
5 stand
6 stand
7 stand
8 stand
9 stand
10 stand
11 stand
12 stand
13 stand
14 stand
15 stand
16 stand
17 stand
18 stand
dtype: object
And there you go ! You recovered your strings after applying a rolling mode on their codes !

Trying to truncate decimal values in all the cells of dataframe, but not working

The Dataframe consists of table, the format of which is shown in the Attached image. I apologize for not being able to type the format here as while trying to type the format of the Dataframe, it was getting messed up due to long decimal values, so i thought to attach its snapshot.
Country names are the index of the data frame and the cell values consists of corresponding GDP value. The intent is to calculate the average of all the rows for each country. When np.average was applied -
#name of Dataframe - GDP
def function_average()
GDP['Average'] = np.average(GDP.iloc[:,0:])
return GDP
function_average()
The new column got created reflecting all the values as NaN. I assumed its probably due to the inappropriately formatted cell values. I tried truncating that using the following code -
GDP = np.round(GDP, decimals =2)
And yet, there was no change in values. The code ran successfully though and there was no error.
Please advise, how to proceed in this case, should i try to make change in the spreadsheet itself or attempt to format cell values in Dataframe?
I regret for any inconvenience caused for not being able to provide any other required information at this point. please let me know if any other detail is required.
Problem is need axis=1 for count mean per rows and change function to numpy.nanmean or DataFrame.mean:
Sample:
np.random.seed(100)
GDP = pd.DataFrame(np.random.randint(10, size=(5,5)), columns=list('ABCDE'))
GDP.loc[0, 'A'] = np.nan
GDP['Average1'] = np.average(GDP.iloc[:,0:], axis=1)
GDP['Average2'] = np.nanmean(GDP.iloc[:,0:], axis=1)
GDP['Average3'] = GDP.iloc[:,0:].mean(axis=1)
print (GDP)
A B C D E Average1 Average2 Average3
0 NaN 8 3 7 7 NaN 6.25 6.25
1 0.0 4 2 5 2 2.6 2.60 2.60
2 2.0 2 1 0 8 2.6 2.60 2.60
3 4.0 0 9 6 2 4.2 4.20 4.20
4 4.0 1 5 3 4 3.4 3.40 3.40
You get NaN, because at least one NaN:
print (np.average(GDP.iloc[:,0:]))
nan
GDP['Average'] = np.average(GDP.iloc[:,0:])
print (GDP)
A B C D E Average
0 NaN 8 3 7 7 NaN
1 0.0 4 2 5 2 NaN
2 2.0 2 1 0 8 NaN
3 4.0 0 9 6 2 NaN
4 4.0 1 5 3 4 NaN

How to count how many points are "better" than other points in pandas dataframe?

I have a dataframe in pandas which look something like this:
>>> df[1:3]
0 1 2 3 4 5 6 7 8
1 -0.59 -99.0 924.0 20.1 5.0 4.0 57.0 19.0 8.0
2 -1.30 -279.0 297.0 16.1 30.0 4.4 63.0 19.0 10.0
The number of points in the dataframe is ~1000.
Given a set of columns, I want to find out how many time each point is "better" than the other?
Given a set of n columns, a point is better than other point, if it better in at least one of the columns and equal in other columns.
A point which is better in one column and worse in n-1 is not considered better because its better than the other point in at least one column.
Edit1: Example:
>>> df
0 1 2
1 -0.59 -99.0 924.0
2 -1.30 -279.0 297.0
3 2.00 -100.0 500.0
4 0.0 0.0 0.0
If we consider only column 0, then the result would be:
1 - 1
2 - 0
3 - 3
4 - 2
because point 1 (-0.59) is only better than point 2 with respect to column 1.
Another example by taking columns - 0 and 1:
1 - 1 (only for point 2 all values i.e. column 0 and column 1 are either smaller than point 1 or lesser)
2 - 0 (since no point is has any lesser than this in any dimension)
3 - 1 (point 2)
4 - 2 (point 1 and 2)
Edit 2:
Perhaps, something like a function which when given a dataframe, a point (index of point) and a set of columns could give the count as - for each subset of columns how many times that point is better than other points.
def f(p, df, c):
"""returns
A list : L = [(c1,n), (c2,m)..]
where c1 is a proper subset of c and n is the number of times that this point was better than other points."""
rank each column separately
by ranking each column, I can see exactly how many other rows in that column the particular row you're in is greater than.
d1 = df.rank().sub(1)
d1
to solve your problem, it logically has to be the case that for a particular row, the smallest rank among the row elements is precisely the number of other rows in which every element in this row is greater than.
for the first two columns [0, 1], it can be calculated by by taking the min of d1
I use this for reference to compare the raw first two columns with the ranks
pd.concat([df.iloc[:, :2], d1.iloc[:, :2]], axis=1, keys=['raw', 'ranked'])
Take the min as stated above.
d1.iloc[:, :2].min(1)
1 1.0
2 0.0
3 1.0
4 2.0
dtype: float64
put the result next to raw data and ranks so we can see it
pd.concat([df.iloc[:, :2], d1.iloc[:, :2], d1.iloc[:, :2].min(1)],
axis=1, keys=['raw', 'ranked', 'results'])
sure enough, that ties out with your expected results.

Irregular binning of python pandas dataframe

I am getting to grips with python pandas.
The toy problem below, illustrates an issue I am having in a related exercise.
I have sorted a data-frame so that it presents a column's values (in this case students' test scores) in ascending order:
df_sorted =
variable test_score
1 52.0
1 53.0
4 54.0
6 64.0
6 64.0
6 64.0
5 71.0
10 73.0
15 75.0
4 77.0
However, I would now like to bin the data-frame by the means of 2 columns (here "variable" and "test_score") but for every X entries from the start to the end of the data-frame. This will also me to create bins that contain equal numbers of entries (very useful for plotting in my associated exercise).
The output if I bin every 3 rows would therefore looks like:
df_sorted_binned =
variable test_score
2 53.0
6 64.0
10 73.0
4 77.0
Can anyone see how I can do this easily?
Much obliged!
Just groupby a dummy variable that goes 0, 0, 0, 1, 1, 1, etc. This can be obtained with floor division:
>>> d.groupby(np.arange(len(d))//3).mean()
variable test_score
0 2 53
1 6 64
2 10 73
3 4 77

Categories

Resources