What is the definition of mean in pandas data frame?

What is the definition of mean in pandas data frame? - python

I have a data frame and would like to get a mean of the values from one of the columns. If I do:
print df['col_name'][0:1]
print df['col_name'][0:1].mean()
I get:
0 2
Name: col_name
2.0
If I do:
print df['col_name'][0:2]
print df['col_name'][0:2].mean()
I get:
0 2
1 1
Name: col_name
10.5
If I do:
print df['col_name'][0:3]
print df['col_name'][0:3].mean()
I get:
0 2
1 1
2 2
Name: col_name
70.6666666667

It looks like you have a column of str values, not ints:
import pandas as pd
df = pd.DataFrame({'col':['2','1','2']})
for i in range(1,4):
print(df['col'][0:i].mean())
yields
2.0
10.5
70.6666666667
while if the values are ints:
df = pd.DataFrame({'col':[2,1,2]})
for i in range(1,4):
print(df['col'][0:i].mean())
yields
2.0
1.5
1.66666666667
You can convert your column of strs to a column of ints with
df['col'] = df['col'].map(int)
But, of course, the best way to handle this is to make sure the DataFrame is constructed with the right (int) values in the first place.

Related

Sum of range of rows in a dataframe column

For the following csv file
ID,Kernel Time,device__attribute_warp_size,cycles_elapsed,time_duration
,,,cycle,msecond
0,2021-Dec-09 23:04:13,32,175013.666667,0.122208
1,2021-Dec-09 23:04:16,32,2988.833333,0.002592
2,2021-Dec-09 23:04:18,32,2911.666667,0.002624
I want to sum the values of a column, cycles_elapsed, but as you can see the first row is not a number. I wrote the following code, but the result is not what I expect.
import pandas as pd
import csv
df = pd.read_csv('test.csv', thousands=',', usecols=['ID', 'cycles_elapsed'])
print(df['cycles_elapsed'])
c_sum = df['cycles_elapsed'].loc[1:].sum()
print(c_sum)
$ python3 test.py
0 cycle
1 175013.666667
2 2988.833333
3 2911.666667
Name: cycles_elapsed, dtype: object
175013.6666672988.8333332911.666667
How can I fix that?

There is problem with second data of file, omit this row by skiprows=[1] parameter, so get numeric column with correct sum:
df = pd.read_csv('cycles_elapsed.csv', skiprows=[1], usecols=['ID', 'cycles_elapsed'])
print (df)
ID cycles_elapsed
0 0 175013.666667
1 1 2988.833333
2 2 2911.666667
print (df.dtypes)
ID int64
cycles_elapsed float64
dtype: object
c_sum = df['cycles_elapsed'].sum()
print(c_sum)
180914.166667

How can I assign a new column to a slice of a pandas DataFrame with a multiindex?

I have a pandas DataFrame with a multi-index like this:
import pandas as pd
import numpy as np
arr = [1]*3 + [2]*3
arr2 = list(range(3)) + list(range(3))
mux = pd.MultiIndex.from_arrays([
arr,
arr2
], names=['one', 'two'])
df = pd.DataFrame({'a': np.arange(len(mux))}, mux)
df
a
one two
1 0 0
1 1 1
1 2 2
2 0 3
2 1 4
2 2 5
I have a function that takes a slice of a DataFrame and needs to assign a new column to the rows that have been sliced:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
#pass in a slice of the df with only records that have the last value for 'two'
work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
However calling the function results in the error:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
# This is added back by InteractiveShellApp.init_path()
How can I create a new column 'b' in the original DataFrame and assign its values for only the rows that were passed to the function, leaving the rest of the rows nan?
The desired output is:
a b
one two
1 0 0 nan
1 1 1 nan
1 2 2 4
2 0 3 nan
2 1 4 nan
2 2 5 10
NOTE: In the work function I'm actually doing a bunch of complex operations involving calling other functions to generate the values for the new column so I don't think this will work. Multiplying by 2 in my example is just for illustrative purposes.

You actually don't have an error, but just a warning. Try this:
def work(df):
b = df.copy()
#do some work on the slice and create values for a new column of the slice
b['b'] = b['a']*2
#assign the new values back to the slice in a new column
df['b'] = b['b']
return df
#pass in a slice of the df with only records that have the last value for 'two'
new_df = work(df.loc[df.index.isin(df.index.get_level_values('two')[-1:], level=1)])
Then:
df.reset_index().merge(new_df, how="left").set_index(["one","two"])
Output:
a b
one two
1 0 0 NaN
1 1 NaN
2 2 4.0
2 0 3 NaN
1 4 NaN
2 5 10.0

I don't think you need a separate function at all. Try this...
df['b'] = df['a'].where(df.index.isin(df.index.get_level_values('two')[-1:], level=1))*2
The Series.where() function being called on df['a'] here should return a series where values are NaN for rows that do not result from your query.

Python - How to get a column's mean if there is String value too

I am new to python. I have a .csv dataset. There is a column called BasePay.
Most of the values in column is type int, but some values are "Not Provided".
I am trying to get mean value of BasePay as:
sal['BasePay'].mean()
But it gives me error of :
TypeError: can only concatenate str (not "int") to str.
I want to omit that string columns. How can i do that?
Thanks.

Because some non numeric values use to_numeric with errors='coerce' for convert them to NaNs, so mean working nice:
out = pd.to_numeric(sal['BasePay'], errors='coerce').mean()
Sample:
sal = pd.DataFrame({'BasePay':[1, 'Not Provided', 2, 3, 'Not Provided']})
print (sal)
BasePay
0 1
1 Not Provided
2 2
3 3
4 Not Provided
print (pd.to_numeric(sal['BasePay'], errors='coerce'))
0 1.0
1 NaN
2 2.0
3 3.0
4 NaN
Name: BasePay, dtype: float64
out = pd.to_numeric(sal['BasePay'], errors='coerce').mean()
print (out)
2.0

This problem is because, when you import the dataset, the empty fields will be filled with NaN(pandas), So you have two options 1.Either you convert pandas.nan to 0 or remove the NaN's, by drop.nan
This can also be achieved by using np.nanmean()

If you store data from the BasePay column in a list, you can do as follows:
for i in l:
if type(i) == int:
x.append(i)
mean = sum(x) / len(x)
print(mean)

Pandas- set values to an empty dataframe

I have initialized an empty pandas dataframe that I am now trying to fill but I keep running into the same error. This is the (simplified) code I am using
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
# sett the values for the first two rows
df.loc[0:2,:] = [[1,2],[3,4],[5,6]]
On running the above code I get the following error:
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
I am not sure whats causing this. I tried the same using a single row at a time and it works (df.loc[0,:] = [1,2,3]). I thought this should be the logical expansion when I want to handle more than one rows. But clearly, I am wrong. Whats the correct way to do this? I need to enter values for multiple rows and columns and once. I can do it using a loop but that's not what I am looking for.
Any help would be great. Thanks

Since you have the columns from empty dataframe use it in dataframe constructor i.e
import pandas as pd
cols = list("ABC")
df = pd.DataFrame(columns=cols)
df = pd.DataFrame(np.array([[1,2],[3,4],[5,6]]).T,columns=df.columns)
A B C
0 1 3 5
1 2 4 6
Well, if you want to use loc specifically then, reindex the dataframe first then assign i.e
arr = np.array([[1,2],[3,4],[5,6]]).T
df = df.reindex(np.arange(arr.shape[0]))
df.loc[0:arr.shape[0],:] = arr
A B C
0 1 3 5
1 2 4 6

How about adding data by index as below. You can add externally to a function as and when you receive data.
def add_to_df(index, data):
for idx,i in zip(index,(zip(*data))):
df.loc[idx]=i
#Set values for first two rows
data1 = [[1,2],[3,4],[5,6]]
index1 = [0,1]
add_to_df(index1, data1)
print df
print ""
#Set values for next three rows
data2 = [[7,8,9],[10,11,12],[13,14,15]]
index2 = [2,3,4]
add_to_df(index2, data2)
print df
Result
>>>
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
A B C
0 1.0 3.0 5.0
1 2.0 4.0 6.0
2 7.0 10.0 13.0
3 8.0 11.0 14.0
4 9.0 12.0 15.0
>>>

Seeing through the documentation and some experiments, my guess is that loc only allows you to insert 1 key at a time. However, you can insert multiple keys first with reindex as #Dark shows.
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
http://pandas-docs.github.io/pandas-docs-travis/indexing.html#setting-with-enlargement
Also, while you are using loc[:2, :], you mean you want to select the first two rows. However, there is nothing in the empty df for you to select. There is no rows while you are trying to insert 3 rows. Thus, the message gives
ValueError: cannot copy sequence with size 3 to array axis with dimension 0
BTW, [[1,2],[3,4],[5,6]] will be 3 rows rather than 2.

Does this get the output you looking for:
import pandas as pd
df=pd.DataFrame({'A':[1,2],'B':[3,4],'C':[5,6]})
Output :
A B C
0 1 3 5
1 2 4 6

Python, pandas: how to remove greater than sign

Let's say I have the following example DataFrame
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
I would like to convert the column A from string to integer. In the case of '<2', I'd like to simply take off '<' sign and put 1 (the closest integer less than 2) in the second row. What's the most efficient way to do that? This is just a example. The actual data that I'm working on has hundreds of thousands of rows.
Thanks for your help in advance.

You could use Series.apply:
import pandas as pd
df = pd.DataFrame({'A':['1', '<2', '3']})
df['A'] = df['A'].apply(lambda x: int(x[1:])-1 if x.startswith('<') else int(x))
print(df.dtypes)
# A int64
# dtype: object
yields
print(df)
A
0 1
1 1
2 3
[3 rows x 1 columns]

You can use applymap on the DataFrame and remove the "<" character if it appears in the string:
df.applymap(lambda x: x.replace('<',''))
Here is the output:
A
0 1
1 2
2 3

Here are two other ways of doing this which may be helpful on the go-forward!
from pandas import Series, DataFrame
df = DataFrame({'A':['1', '<2', '3']})
Outputs
df.A.str.strip('<').astype(int)
Out[1]:
0 1
1 2
2 3
And this way would be helpful if you were trying to remove a character in the middle of your number (e.g. if you had a comma or something).
df = DataFrame({'A':['1', '1,002', '3']})
df.A.str.replace(',', '').astype(int)
Outputs
Out[11]:
0 1
1 1002
2 3
Name: A, dtype: int64

>>> import re
>>> df.applymap(lambda x: int(re.sub(r'[^0-9.]', '', x)))
A
0 1
1 2
2 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

What is the definition of mean in pandas data frame? - python

Related

Sum of range of rows in a dataframe column

How can I assign a new column to a slice of a pandas DataFrame with a multiindex?

Python - How to get a column's mean if there is String value too

Pandas- set values to an empty dataframe

Python, pandas: how to remove greater than sign

Categories

Resources