Edited for clarity:
I have a dataframe in the following format
i col1 col2 col3
0 00:00:00,1 10 1.7
1 00:00:00,2 10 1.5
2 00:00:00,3 50 4.6
3 00:00:00,4 30 3.4
4 00:00:00,5 20 5.6
5 00:00:00,6 50 1.8
6 00:00:00,9 20 1.9
...
That I'm trying to sort like this
i col1 col2 col3
0 00:00:00,1 10 1.7
1 00:00:00,2 10 1.5
4 00:00:00,5 20 5.6
3 00:00:00,9 20 1.9
4 00:00:00,4 30 3.4
5 00:00:00,3 50 4.6
6 00:00:00,6 50 1.8
...
I've tried df = df.sort_values(by = ['col1', 'col2'] which only works on col1.
I understand that it may have something to do with the values being 'strings', but I can't seem to find a workaround for it.
df.sort_values(by = ['col2', 'col1']
Gave the desired result
If need sort each column independently use Series.sort_values in DataFrame.apply:
c = ['col1','col2']
df[c] = df[c].apply(lambda x: x.sort_values().to_numpy())
#alternative
df[c] = df[c].apply(lambda x: x.sort_values().tolist())
print (df)
i col1 col2
0 0 00:00:00,1 10
1 1 00:00:01,5 20
2 2 00:00:10,0 30
3 3 00:01:00,1 40
4 5 01:00:00,0 50
I have to consider nth row and check n+1 to n+3 rows, if it is in the range of (nth row value)-0.5 to (nth row value)+0.5, and(&) the results of 3 rows.
A result
0 1.1 1 # 1.2 1.3 and 1.5 are in range of 0.6 to 1.6, ( 1 & 1 & 1)
1 1.2 0 # 1.3 and 1.5 are in range of 0.7 to 1.7, but not 2, hence ( 1 & 0 & 0)
2 1.3 0 # 1.5 and 1 are in range of 0.8 to 1.8, but not 2 ( 1 & 0 & 1)
3 1.5
4 2.0
5 1.0
6 2.5
7 1.8
8 4.0
9 4.2
10 4.5
11 3.9
df = pd.DataFrame( {
'A': [1.1,1.2,1.3,1.9,2,1,2.5,1.8,4,4.2,4.5,3.9]
} )
I have done some research on the site, but couldn't able to find exact syntax. I tried using rolling function for taking 3 rows and use between function check range and then and the results. Could you please help here.
s = pd.Series([1, 2, 3, 4])
s.rolling(2).between(s-1,s+1)
getting error :
AttributeError: 'Rolling' object has no attribute 'between'
You can also achieve the result without using rolling() while keep using .between(), as follows:
df['result'] = (
(df['A'].shift(-1).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-2).between(df['A'] - 0.5, df['A'] + 0.5)) &
(df['A'].shift(-3).between(df['A'] - 0.5, df['A'] + 0.5))
).astype(int)
Result:
print(df)
A result
0 1.1 1
1 1.2 0
2 1.3 0
3 1.5 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 1
9 4.2 0
10 4.5 0
11 3.9 0
Rolling windows tend to be quite slow in pandas. One quick solution can be to generate a dataframe with the values of the windows per row:
df_temp = pd.concat([df['A'].shift(i) for i in range(-1, 2)], axis=1)
df_temp
A A A
0 1.2 1.1 NaN
1 1.3 1.2 1.1
2 1.9 1.3 1.2
3 2.0 1.9 1.3
4 1.0 2.0 1.9
5 2.5 1.0 2.0
6 1.8 2.5 1.0
7 4.0 1.8 2.5
8 4.2 4.0 1.8
9 4.5 4.2 4.0
10 3.9 4.5 4.2
11 NaN 3.9 4.5
Then you can check per row if the value is in the desired range:
df['result'] = df_temp.apply(lambda x: (x - x.iloc[0]).between(-0.5, 0.5), axis=1).all(axis=1).astype(int)
A result
0 1.1 0
1 1.2 1
2 1.3 0
3 1.9 0
4 2.0 0
5 1.0 0
6 2.5 0
7 1.8 0
8 4.0 0
9 4.2 1
10 4.5 0
11 3.9 0
I tried converting the values in some columns of a DataFrame of floats to integers by using round then astype. However, the values still contained decimal places. What is wrong with my code?
nums = np.arange(1, 11)
arr = np.array(nums)
arr = arr.reshape((2, 5))
df = pd.DataFrame(arr)
df += 0.1
df
Original df:
0 1 2 3 4
0 1.1 2.1 3.1 4.1 5.1
1 6.1 7.1 8.1 9.1 10.1
Rounding then to int code:
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)
df
Output:
0 1 2 3 4
0 1.1 2.1 3.0 4.0 5.0
1 6.1 7.1 8.0 9.0 10.0
Expected output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
The problem is for the .iloc it assign the value and did not change the column type
l = df.columns[2:]
df[l] = df[l].astype(int)
df
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
One way to solve that is to use .convert_dtypes()
df.iloc[:, 2:] = df.iloc[:, 2:].round()
df = df.convert_dtypes()
print(df)
output:
0 1 2 3 4
0 1.1 2.1 3 4 5
1 6.1 7.1 8 9 10
It will help you to coerce all dtypes of your dataframe to a better fit.
had the same issue, was able to resolve with converting numbers to str and applying an lambda to cut of zeros.
df['converted'] = df['floats'].astype(str)
def cut_zeros(row):
if row[-2:]=='.0':
row=row[:-2]
else:row
return row
df['converted'] = df.apply(lambda row: cut_zeros(row['converted']),axis=1)
I have a dataframe as below:
idx col1 col2 col3
0 1.1 A 100
1 1.1 A 100
2 1.1 A 100
3 2.6 B 100
4 2.5 B 100
5 3.4 B 100
6 2.6 B 100
I want to update col3 with percentage values depending on the group size of col1,col2 (two columns ie., for each row having 1.1,A - col3 value should have 33.33)
Desired output:
idx col1 col2 col3
0 1.1 A 33.33
1 1.1 A 33.33
2 1.1 A 33.33
3 2.6 B 50
4 2.5 B 100
5 3.4 B 100
6 2.6 B 50
I think you need groupby with transform size:
df['col3'] = 100 / df.groupby(['col1', 'col2'])['col3'].transform('size')
print df
col1 col2 col3
idx
0 1.1 A 33.333333
1 1.1 A 33.333333
2 1.1 A 33.333333
3 2.6 B 50.000000
4 2.5 B 100.000000
5 3.4 B 100.000000
6 2.6 B 50.000000
I need to count the instances of two columns in a dataframe by values.
I get the same by using group & size, though I want to spit out
1. The flat value in each column combination
2. the name of the "last count" column (See also the what I want below).
df = pd.DataFrame([[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3], list('AAABBBBABCBDDD'), [1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8], ['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y','x/y/z','x','x/u/v/w'],['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3','col4','col5']
df.groupby(['col5', 'col2']).size()
# this gives
col5 col2 <Note that this is unnamed>
1 A 1
D 3
2 B 2
3 A 3
C 1
4 B 1
5 B 2
6 B 1
dtype: int64
What I want -:
col5 col2 count_instances_of_this_combination
1 A 1
1 D 3
2 B 2
3 A 3
3 C 1
4 B 1
5 B 2
6 B 1
That is I explicitly want the 1st columns to print out the complete combination of col5, col2
Related question :
Pandas DataFrame Groupby two columns and get counts
col1 col2 col3 col4 col5
0 1.1 A 1.1 x/y/z 1
1 1.1 A 1.7 x/y 3
2 1.1 A 2.5 x/y/z/n 3
3 2.6 B 2.6 x/u 2
4 2.5 B 3.3 x 4
5 3.4 B 3.8 x/u/v 2
6 2.6 B 4 x/y/z 5
7 2.6 A 4.2 x 3
8 3.4 B 4.3 x/u/v/b 6
9 3.4 C 4.5 - 3
10 2.6 B 4.6 x/y 5
11 1.1 D 4.7 x/y/z 1
12 1.1 D 4.7 x 1
13 3.3 D 4.8 x/u/v/w 1
This means the combination <1,A > occurred once, <2, B> occurred twice, <1,d> occurred thrice & so on.
Here's how it worked -:
Further to the answer below, on setting the sparity option to False,
I did this to get the name .
pd.options.display.multi_sparse = False
# rest same a above..
s=pd.DataFrame({'s=pd.DataFrame({'count_instances_of_this_combination' : df.groupby(['query', 'product_id']).size()}).reset_index()' : df.groupby(['col5', 'col2']).size()}).reset_index()
This gives me a well formed data frame with the "3rd" column as a named column.
Set the option:
pd.options.display.multi_sparse = False
Then:
import pandas as pd
pd.options.display.multi_sparse = False
df = pd.DataFrame(
[[1.1, 1.1, 1.1, 2.6, 2.5, 3.4,2.6,2.6,3.4,3.4,2.6,1.1,1.1,3.3],
list('AAABBBBABCBDDD'),
[1.1, 1.7, 2.5, 2.6, 3.3, 3.8,4.0,4.2,4.3,4.5,4.6,4.7,4.7,4.8],
['x/y/z','x/y','x/y/z/n','x/u','x','x/u/v','x/y/z','x','x/u/v/b','-','x/y',
'x/y/z','x','x/u/v/w'],
['1','3','3','2','4','2','5','3','6','3','5','1','1','1']]).T
df.columns = ['col1','col2','col3','col4','col5']
print(df.groupby(['col5', 'col2']).size())
yields
col5 col2
1 A 1
1 D 3
2 B 2
3 A 3
3 C 1
4 B 1
5 B 2
6 B 1
dtype: int64