I am a R user and I found myself struggling a bit moving to Python, especially with the indexing capabilities of Pandas.
Household_id is my second column. I sorted my dataframe based on this column and ran the following instructions, returning various results (that I would expect to be the same). Are those expressions the same? If so, why do I see different results?
In [63]: ground_truth.columns
Out[63]: Index([Timestamp, household_id, ... (continues)
In [59]: ground_truth.ix[1107177,'household_id']
Out[59]: 2
In [60]: ground_truth.ix[1107177,1]
Out[60]: 2.0
In [61]: ground_truth.iloc[1107177,1]
Out[61]: 4.0
In [62]: ground_truth['household_id'][1107177]
Out[62]: 2
PS: I cant post the data unfortunately (too big).
NOTE: When you sort by a column, you'll rearrange the index, and assuming it wasn't sorted that way to begin with you'll have integers labels that don't equal their linear index in the array.
First, ix will first try integers as labels then as indices, so it is immediate that 59 and 62 are the same. Second, if the index is not 0:n - 1 then 1107177 is a label, not a integer index thus the difference between 60 and 61. As far as the float casting goes, it looks like you might be using an older version of pandas. This doesn't happen in git master.
Here are the docs on ix.
Here's an example with a toy DataFrame:
In [1]:
df = DataFrame(randn(10, 3), columns=list('abc'))
print df
print
print df.sort('a')
a b c
0 -1.80 -0.28 -1.10
1 -0.58 1.00 -0.48
2 -2.50 1.59 -1.42
3 -1.00 -0.12 -0.93
4 -0.65 1.41 1.20
5 0.51 0.96 1.28
6 -0.28 0.13 1.59
7 1.28 -0.84 0.51
8 0.77 -1.26 -0.50
9 -0.59 -1.34 -1.06
a b c
2 -2.50 1.59 -1.42
0 -1.80 -0.28 -1.10
3 -1.00 -0.12 -0.93
4 -0.65 1.41 1.20
9 -0.59 -1.34 -1.06
1 -0.58 1.00 -0.48
6 -0.28 0.13 1.59
5 0.51 0.96 1.28
8 0.77 -1.26 -0.50
7 1.28 -0.84 0.51
Notice that the sorted row indices are integers and they don't map to their locations.
Related
I have a data frame like 1 and I am trying to create a new data frame 2 which consists of ratios of each column of above data frame.
I tried below mentioned logic.
df_new = pd.concat([df[df.columns.difference([col])].div(df[col], axis=0)\
.add_suffix('/R') for col in df.columns], axis=1)
Output is:
B/R C/R D/R A/R C/R D/R A/R B/R D/R A/R B/R C/R
0 0.46 1.16 0.78 2.16 2.50 1.69 0.86 0.40 0.68 1.28 0.59 1.48
1 1.05 1.25 1.64 0.95 1.19 1.55 0.80 0.84 1.30 0.61 0.64 0.77
2 1.56 2.78 2.78 0.64 1.79 1.79 0.36 0.56 1.00 0.36 0.56 1.00
3 0.54 2.23 0.35 1.86 4.14 0.64 0.45 0.24 0.16 2.89 1.56 6.44
However, here I am facing two issues. One is I am getting both A/B and B/A which are not needed and also increases number of columns. Is there a way to get the output only A/B and eliminate/restrict B/A.
Second issue is with Naming of columns using add suffix method which does not convey which is divided by which. Is there a way to create column names like A/B for Column A divided by column B.
Use combinations with divide columns in list comprehension:
df = pd.DataFrame({
'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,8],
})
from itertools import combinations
L = {f'{a}/{b}': df[a].div(df[b]) for a, b in combinations(df.columns, 2)}
df = pd.concat(L, axis=1)
print (df)
A/B A/C A/D B/C B/D C/D
0 1.25 0.714286 5.000000 0.571429 4.000000 7.000000
1 0.60 0.375000 1.000000 0.625000 1.666667 2.666667
2 1.50 0.666667 1.200000 0.444444 0.800000 1.800000
3 1.80 2.250000 1.285714 1.250000 0.714286 0.571429
4 0.40 1.000000 2.000000 2.500000 5.000000 2.000000
5 1.00 1.333333 0.500000 1.333333 0.500000 0.375000
I have a dataframe with sorted columns, something like this:
df = pd.DataFrame({q: np.sort(np.random.randn(10).round(2)) for q in ['blue', 'green', 'red']})
blue green red
0 -2.15 -0.76 -2.62
1 -0.88 -0.62 -1.65
2 -0.77 -0.55 -1.51
3 -0.73 -0.17 -1.14
4 -0.06 -0.16 -0.75
5 -0.03 0.05 -0.08
6 0.06 0.38 0.37
7 0.41 0.76 1.04
8 0.56 0.89 1.16
9 0.97 2.94 1.79
What I want to know is how many of the n smallest elements in the whole frame are in each column. This is the only thing I came up with:
is_small = df.isin(np.partition(df.values.flatten(), n)[:n])
with n=10 it looks like this:
blue green red
0 True True True
1 True False True
2 True False True
3 True False True
4 False False True
5 False False False
6 False False False
7 False False False
8 False False False
9 False False False
Then by applying np.sum I get the number corresponding to each column.
I'm dissatisfied with this solution because it in no way utilizes the sortedness of the original data. All the data gets partitioned and all the data is then checked for whether it's in the partition. It seems wasteful, and I can't seem to figure out a better way.
Think you can compare the largest of n-smallest values against the partitioned one and then use idxmin to leverage the sorted nature -
# Find largest of n smallest numbers
N = (np.partition(df.values.flatten(), n)[:n]).max()
out = (df<=N).idxmin(axis=0)
Sample run -
In [152]: np.random.seed(0)
In [153]: df = pd.DataFrame({q: np.sort(np.random.randn(10).round(2)) \
for q in ['blue', 'green', 'red']})
In [154]: df
Out[154]:
blue green red
0 -0.98 -0.85 -2.55
1 -0.15 -0.21 -1.45
2 -0.10 0.12 -0.74
3 0.40 0.14 -0.19
4 0.41 0.31 0.05
5 0.95 0.33 0.65
6 0.98 0.44 0.86
7 1.76 0.76 1.47
8 1.87 1.45 1.53
9 2.24 1.49 2.27
In [198]: n = 5
In [199]: N = (np.partition(df.values.flatten(), n)[:n]).max()
In [200]: (df<=N).idxmin(axis=0)
Out[200]:
blue 1
green 1
red 3
dtype: int64
Lets say, you are looking at 10 smallest, you can stack and find value_count for the 10 smallest
df.stack().nsmallest(10).index.get_level_values(1).value_counts()
You get
red 5
blue 4
green 1
Round works on a single element but not the DataFrame, tried DataFrame.round() but didn't work... any idea? Thanks.
Have code below:
print "Panda Version: ", pd.__version__
print "['5am'][0]: ", x3['5am'][0]
print "Round element: ", np.round(x3['5am'][0]*4) /4
print "Round Dataframe: \r\n", np.round(x3 * 4, decimals=2) / 4
df = np.round(x3 * 4, decimals=2) / 4
print "Round Dataframe Again: \r\n", df.round(2)
Got result:
Panda Version: 0.18.0
['5am'][0]: 0.279914529915
Round element: 0.25
Round Dataframe:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Round Dataframe Again:
5am 6am 7am 8am 9am 10am 11am
Date
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
Try to cast to float type:
x3.astype(float).round(2)
as simple as this
df['col_name'] = df['col_name'].astype(float).round(2)
Explanation of your code:
In [166]: np.round(df * 4, decimals=2)
Out[166]:
a b c d
0 0.11 0.45 1.65 3.38
1 3.97 2.90 1.89 3.42
2 1.46 0.79 3.00 1.44
3 3.48 2.33 0.81 1.02
4 1.03 0.65 1.94 2.92
5 1.88 2.21 0.59 0.39
6 0.08 2.09 4.00 1.02
7 2.86 0.71 3.56 0.57
8 1.23 1.38 3.47 0.03
9 3.09 1.10 1.12 3.31
In [167]: np.round(df * 4, decimals=2) / 4
Out[167]:
a b c d
0 0.0275 0.1125 0.4125 0.8450
1 0.9925 0.7250 0.4725 0.8550
2 0.3650 0.1975 0.7500 0.3600
3 0.8700 0.5825 0.2025 0.2550
4 0.2575 0.1625 0.4850 0.7300
5 0.4700 0.5525 0.1475 0.0975
6 0.0200 0.5225 1.0000 0.2550
7 0.7150 0.1775 0.8900 0.1425
8 0.3075 0.3450 0.8675 0.0075
9 0.7725 0.2750 0.2800 0.8275
In [168]: np.round(np.round(df * 4, decimals=2) / 4, 2)
Out[168]:
a b c d
0 0.03 0.11 0.41 0.84
1 0.99 0.72 0.47 0.86
2 0.36 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.26
7 0.72 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.28 0.28 0.83
This is working properly for me (pandas 0.18.1)
In [162]: df = pd.DataFrame(np.random.rand(10,4), columns=list('abcd'))
In [163]: df
Out[163]:
a b c d
0 0.028700 0.112959 0.412192 0.845663
1 0.991907 0.725550 0.472020 0.856240
2 0.365117 0.197468 0.750554 0.360272
3 0.870041 0.582081 0.203692 0.255915
4 0.257433 0.161543 0.483978 0.730548
5 0.470767 0.553341 0.146612 0.096358
6 0.020052 0.522482 0.999089 0.254312
7 0.714934 0.178061 0.889703 0.143701
8 0.308284 0.344552 0.868151 0.007825
9 0.771984 0.274245 0.280431 0.827999
In [164]: df.round(2)
Out[164]:
a b c d
0 0.03 0.11 0.41 0.85
1 0.99 0.73 0.47 0.86
2 0.37 0.20 0.75 0.36
3 0.87 0.58 0.20 0.26
4 0.26 0.16 0.48 0.73
5 0.47 0.55 0.15 0.10
6 0.02 0.52 1.00 0.25
7 0.71 0.18 0.89 0.14
8 0.31 0.34 0.87 0.01
9 0.77 0.27 0.28 0.83
Similar issue. df.round(1) didn't round as expected (e.g. .400000000123) but df.astype('float64').round(1) worked. Significantly, the dtype of df is float32. Apparently round() doesn't work properly on float32. How is this behavior not a bug?
As I just found here,
"round does not modify in-place. Rather, it returns the dataframe
rounded."
It might be helpful to think of this as follows:
df.round(2) is doing the correct rounding operation, but you are not asking it to see the result or saving it anywhere.
Thus, df_final = df.round(2) will likely complete your expected functionality, instead of just df.round(2). That's because the results of the rounding operation are now being saved to the df_final dataframe.
Additionally, it might be best to do one additional thing and use df_final = df.round(2).copy() instead of simply df_final = df.round(2). I find that some things return unexpected results if I don't assign a copy of the old dataframe to the new dataframe.
I've tried to reproduce your situation. and it seems to work nicely.
import pandas as pd
import numpy as np
from io import StringIO
s = """Date 5am 6am 7am 8am 9am 10am 11am
2016-07-11 0.279915 0.279915 2.85256 4.52778 6.23291 9.01496 8.53632
2016-07-12 0.339744 0.369658 2.67308 4.52778 5.00641 7.30983 6.98077
2016-07-13 0.399573 0.459402 2.61325 3.83974 5.48504 6.77137 5.24573
2016-07-14 0.339744 0.549145 2.64316 3.36111 5.66453 5.96368 7.87821
2016-07-15 0.309829 0.459402 2.55342 4.64744 4.46795 6.80128 6.17308
2016-07-16 0.25 0.369658 2.46368 2.67308 4.58761 6.35256 5.63462
2016-07-17 0.279915 0.369658 2.58333 2.91239 4.19872 5.51496 6.65171
"""
df = pd.read_table(StringIO(s), delim_whitespace=True)
df.set_index('Date').round(2)
I'm trying to reshape a dataframe, but I'm not able to get the results I need.
The dataframe looks like this:
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26
I need to reshape the dataframe so it will look like this:
m r s p O W N p O W N p O W N
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
1 4 4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
1 4 5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
I tried to use the pivot_table function
df.pivot_table(index=['m','r','s'], columns=['p'], values=['O','W','N'])
but I'm not able to get quite what I want. Does anyone know how to do this?
As someone who fancies himself as pretty handy with pandas, the pivot_table and melt functions are confusing to me. I prefer to stick with a well-defined and unique index and use the stack and unstack methods of the dataframe itself.
First, I'll ask if you really need to repeat the p-column like that? I can sort of see its value when presenting data, but IMO pandas isn't really set up to work like that. We could shoehorn it in, but let's see if a simpler solution gets you what you need.
Here's what I would do:
from io import StringIO
import pandas
datatable = StringIO("""\
m r s p O W N
1 4 3 1 2.81 3.70 3.03
1 4 4 1 2.14 2.82 2.31
1 4 5 1 1.47 1.94 1.59
1 4 3 2 0.58 0.78 0.60
1 4 4 2 0.67 0.00 0.00
1 4 5 2 1.03 2.45 1.68
1 4 3 3 1.98 1.34 1.81
1 4 4 3 0.00 0.04 0.15
1 4 5 3 0.01 0.00 0.26""")
df = (
pandas.read_table(datatable, sep='\s+')
.set_index(['m', 'r', 's', 'p'])
.unstack(level='p')
)
df.columns = df.columns.swaplevel(0, 1)
df.sort(axis=1, inplace=True)
print(df)
Which prints:
p 1 2 3
O W N O W N O W N
m r s
1 4 3 2.81 3.70 3.03 0.58 0.78 0.60 1.98 1.34 1.81
4 2.14 2.82 2.31 0.67 0.00 0.00 0.00 0.04 0.15
5 1.47 1.94 1.59 1.03 2.45 1.68 0.01 0.00 0.26
So now the columns are a MultiIndex and you can access, for example, all of the values where p = 2 with df[2] or df.xs(2, level='p', axis=1), which gives me:
O W N
m r s
1 4 3 0.58 0.78 0.60
4 0.67 0.00 0.00
5 1.03 2.45 1.68
Similarly, you can get all of the W columns with: df.xs('W', level=1, axis=1)
(we say level=1) because that column level does not have a name, so we use its position instead)
p 1 2 3
m r s
1 4 3 3.70 0.78 1.34
4 2.82 0.00 0.04
5 1.94 2.45 0.00
You can similarly query the columns by using axis=0.
If you really need the p values in a column, just add it there manually and reindex your columns:
for p in df.columns.get_level_values('p').unique():
df[p, 'p'] = p
cols = pandas.MultiIndex.from_product([[1,2,3], list('pOWN')])
df = df.reindex(columns=cols)
print(df)
1 2 3
p O W N p O W N p O W N
m r s
1 4 3 1 2.81 3.70 3.03 2 0.58 0.78 0.60 3 1.98 1.34 1.81
4 1 2.14 2.82 2.31 2 0.67 0.00 0.00 3 0.00 0.04 0.15
5 1 1.47 1.94 1.59 2 1.03 2.45 1.68 3 0.01 0.00 0.26
b = open('ss2.csv', 'w')
a = csv.writer(b)
sk = ''
with open ('df_col2.csv', 'r') as ann:
for col in ann:
an = col.lower().strip('\n').split(',')
suk += an[0] + ','
sk = sk[:-2]
a.writerow([sk])
I have the following data with some missing holes. I've looked over the 'how to handle missing data' but can't find anything that applies in this situation. Here is the data:
Species GearUsed AverageFishWeight(lbs) NormalRange(lbs) Caught
0 BlackBullhead Gillnet 0.11 0.8-7.7 0.18
1 BlackCrappie Trapnet 6.22 0.7-3.4 0.30
2 NaN Gillnet 1.00 0.6-3.5 0.30
3 Bluegill Trapnet 11.56 6.1-46.6 0.14
4 NaN Gillnet 1.44 NaN 0.21
5 BrownBullhead Trapnet 0.11 0.4-2.1 1.01
6 NorthernPike Trapnet 0.22 NaN 4.32
7 NaN Gillnet 2.22 3.5-10.5 5.63
8 Pumpkinseed Trapnet 0.89 2.0-8.5 0.23
9 RockBass Trapnet 0.22 0.5-1.8 0.04
10 Walleye Trapnet 0.22 0.3-0.7 0.28
11 NaN Gillnet 1.56 1.3-5.0 2.54
12 WhiteSucker Trapnet 0.33 0.3-1.4 2.76
13 NaN Gillnet 1.78 0.5-2.7 1.32
14 YellowPerch Trapnet 1.33 0.5-3.3 0.14
15 NaN Gillnet 27.67 3.4-43.6 0.14
I need the NaNs in the species column to just be the name above it, for example row 2 would be BlackCrappie. I would like to iterate through the frame and manually specify the species name but am not too sure of how, and also other answers recommend against iterating through the dataframe in the first place.
How do I access each cell of the frame individually? Thanks!
PS the column names are incorrect, there is not a 27lb yellow perch. :)
Do you want to fill the missing values in other rows as well? Seems to be what fillna() is for:
In [83]:
print df.fillna(method='pad')
Species GearUsed AverageFishWeight(lbs) NormalRange(lbs) Caught
0 BlackBullhead Gillnet 0.11 0.8-7.7 0.18
1 BlackCrappie Trapnet 6.22 0.7-3.4 0.30
2 BlackCrappie Gillnet 1.00 0.6-3.5 0.30
3 Bluegill Trapnet 11.56 6.1-46.6 0.14
4 Bluegill Gillnet 1.44 6.1-46.6 0.21
5 BrownBullhead Trapnet 0.11 0.4-2.1 1.01
6 NorthernPike Trapnet 0.22 0.4-2.1 4.32
7 NorthernPike Gillnet 2.22 3.5-10.5 5.63
8 Pumpkinseed Trapnet 0.89 2.0-8.5 0.23
9 RockBass Trapnet 0.22 0.5-1.8 0.04
10 Walleye Trapnet 0.22 0.3-0.7 0.28
11 Walleye Gillnet 1.56 1.3-5.0 2.54
12 WhiteSucker Trapnet 0.33 0.3-1.4 2.76
13 WhiteSucker Gillnet 1.78 0.5-2.7 1.32
14 YellowPerch Trapnet 1.33 0.5-3.3 0.14
15 YellowPerch Gillnet 27.67 3.4-43.6 0.14