Sum of range of rows in a dataframe column

Sum of range of rows in a dataframe column - python

For the following csv file
ID,Kernel Time,device__attribute_warp_size,cycles_elapsed,time_duration
,,,cycle,msecond
0,2021-Dec-09 23:04:13,32,175013.666667,0.122208
1,2021-Dec-09 23:04:16,32,2988.833333,0.002592
2,2021-Dec-09 23:04:18,32,2911.666667,0.002624
I want to sum the values of a column, cycles_elapsed, but as you can see the first row is not a number. I wrote the following code, but the result is not what I expect.
import pandas as pd
import csv
df = pd.read_csv('test.csv', thousands=',', usecols=['ID', 'cycles_elapsed'])
print(df['cycles_elapsed'])
c_sum = df['cycles_elapsed'].loc[1:].sum()
print(c_sum)
$ python3 test.py
0 cycle
1 175013.666667
2 2988.833333
3 2911.666667
Name: cycles_elapsed, dtype: object
175013.6666672988.8333332911.666667
How can I fix that?

There is problem with second data of file, omit this row by skiprows=[1] parameter, so get numeric column with correct sum:
df = pd.read_csv('cycles_elapsed.csv', skiprows=[1], usecols=['ID', 'cycles_elapsed'])
print (df)
ID cycles_elapsed
0 0 175013.666667
1 1 2988.833333
2 2 2911.666667
print (df.dtypes)
ID int64
cycles_elapsed float64
dtype: object
c_sum = df['cycles_elapsed'].sum()
print(c_sum)
180914.166667

Related

Initialize dataframe with two columns which have one of the them all zeros

I have a list and I would like to convert it to a pandas dataframe. In the second column, I want to give all zeros but I got "object of type 'int' has no len()" error. The thing I did is this:
df = pd.DataFrame([all_equal_timestamps['A'], 0], columns=['data','label'])
How can i add second column with all zeros to this dataframe in the easiest manner and why did the code above give me this error?

Not sure what is in all_equal_timestamps, so I presume it's a list of elements. Do you mean to get this result?
import pandas as pd
all_equal_timestamps = {'A': ['1234', 'aaa', 'asdf']}
df = pd.DataFrame(all_equal_timestamps['A'], columns=['data']).assign(label=0)
# df['label'] = 0
print(df)
Output:
data label
0 1234 0
1 aaa 0
2 asdf 0
If you're creating a DataFrame with a list of lists, you'd expect something like this
df = pd.DataFrame([ all_equal_timestamps['A'], '0'*len(all_equal_timestamps['A']) ], columns=['data', 'label', 'anothercol'])
print(df)
Output:
data label anothercol
0 1234 aaa asdf
1 0 0 0

you can add a column named as "new" with all zero by using
df['new'] = 0

You can do it all in one line with assign:
timestamps = [1,0,3,5]
pd.DataFrame({"Data":timestamps}).assign(new=0)
Output:
Data new
0 1 0
1 0 0
2 3 0
3 5 0

Groupby to count the number of calls on different days by id

Given a dataframe like the one below:
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
I need to create another dataframe containing only the id and the number of calls made on different days. An example of output is as follows:
Id | Count
1 | 1
2 | 2
3 | 1
What I'm trying so far:
df2 = df.groupby(['id','date']).size().reset_index().rename(columns={0:'COUNT'})
df2
However, the way out is far from desired. Can anyone help?

You can make use of .nunique() [pandas-doc] to count the unique days per id:
table.groupby('id').date.nunique()
This gives us a series:
>>> df.groupby('id').date.nunique()
id
1 1
2 2
3 1
Name: date, dtype: int64
You can make use of .to_frame() [pandas-doc] to convert it to a dataframe:
>>> df.groupby('id').date.nunique().to_frame('count')
count
id
1 1
2 2
3 1

You can use pd.Dataframe function to convert the result into a dataframe and further rename the columns as per you like.
import pandas as pd
df = pd.DataFrame({'date': ['2013-04-19', '2013-04-19', '2013-04-20', '2013-04-20', '2013-04-19'],
'id': [1,2,2,3,1]})
x = pd.DataFrame(df.groupby('id').date.nunique().reset_index())
x.columns = ['Id', 'Count']
print(x)

How to split a column in DataFrame with pandas?

I imported a dataset looks like this.
Peak, Trough
0 1857-06-01, 1858-12-01
1 1860-10-01, 1861-06-01
2 1865-04-01, 1867-12-01
3 1869-06-01, 1870-12-01
4 1873-10-01, 1879-03-01
5 1882-03-01, 1885-05-01
6 1887-03-01, 1888-04-01
it is a CSV file. But when I check the .shape, it is
(7, 1)
I thought CSV file can automatically be seperated by its commas, however this one doesn't work.
I want to split this column into two, sperated by comma, and also the column names as well. How can I do that?

Use 'sep' tag in read_csv
It's like:
df = read_csv(path, sep = ', ')

Same data to text file or csv and then use read_csv with parameter skipinitialspace=True and parse_dates for convert values to datetimes:
df = pd.read_csv('data.txt', skipinitialspace=True, parse_dates=[0,1])
print (df.head())
Peak Trough
0 1857-06-01 1858-12-01
1 1860-10-01 1861-06-01
2 1865-04-01 1867-12-01
3 1869-06-01 1870-12-01
4 1873-10-01 1879-03-01
print (df.dtypes)
Peak datetime64[ns]
Trough datetime64[ns]
dtype: object
If data are in excel in one column is possible use Series.str.split by first column, convert to datetimes and last set new columns names:
df = pd.read_excel('data.xlsx')
df1 = df.iloc[:, 0].str.split(', ', expand=True).apply(pd.to_datetime)
df1.columns = df.columns[0].split(', ')
print (df1.head())
Peak Trough
0 1857-06-01 1858-12-01
1 1860-10-01 1861-06-01
2 1865-04-01 1867-12-01
3 1869-06-01 1870-12-01
4 1873-10-01 1879-03-01
print (df1.dtypes)
Peak datetime64[ns]
Trough datetime64[ns]
dtype: object

Iterating over each element in pandas DataFrame

So I got a pandas DataFrame with a single column and a lot of data.
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
When looping through the DataFrame it always stops after the first one.
If I convert it to a list before, then my numbers are all in braces (eg. [12] instead of 12) thus breaking my code.
Does anyone see what I am doing wrong?
import pandas as pd
def go_trough_list(df):
for number in df:
print(number)
df = pd.read_csv("my_ids.csv")
go_trough_list(df)
df looks like:
1
0 2
1 3
2 4
dtype: object
[Finished in 1.1s]
Edit: I found one mistake. My first value is recognized as a header.
So I changed my code to:
df = pd.read_csv("my_ids.csv",header=None)
But with
for ix in df.index:
print(df.loc[ix])
I get:
0 1
Name: 0, dtype: int64
0 2
Name: 1, dtype: int64
0 3
Name: 2, dtype: int64
0 4
Name: 3, dtype: int64
edit: Here is my Solution thanks to jezrael and Nick!
First I added headings=None because my data has no header.
Then I changed my function to:
def go_through_list(df)
new_list = df[0].apply(my_function,parameter=par1)
return new_list
And it works perfectly! Thank you again guys, problem solved.

You can use the index as in other answers, and also iterate through the df and access the row like this:
for index, row in df.iterrows():
print(row['column'])
however, I suggest solving the problem differently if performance is of any concern. Also, if there is only one column, it is more correct to use a Pandas Series.
What do you mean by parse it into another function? Perhaps take the value, and do something to it and create it into another column?
I need to access each of the element, not to change it (with apply()) but to parse it into another function.
Perhaps this example will help:
import pandas as pd
df = pd.DataFrame([20, 21, 12])
def square(x):
return x**2
df['new_col'] = df[0].apply(square) # can use a lambda here nicely

You can convert column as Series tolist:
for x in df['Colname'].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({'a': pd.Series( [1, 2, 3]),
'b': pd.Series( [4, 5, 6])})
print df
a b
0 1 4
1 2 5
2 3 6
for x in df['a'].tolist():
print x
1
2
3
If you have only one column, use iloc for selecting first column:
for x in df.iloc[:,0].tolist():
print x
Sample:
import pandas as pd
df = pd.DataFrame({1: pd.Series( [2, 3, 4])})
print df
1
0 2
1 3
2 4
for x in df.iloc[:,0].tolist():
print x
2
3
4
This can work too, but it is not recommended approach, because 1 can be number or string and it can raise Key error:
for x in df[1].tolist():
print x
2
3
4

Say you have one column named 'myColumn', and you have an index on the dataframe (which is automatically created with read_csv). Try using the .loc function:
for ix in df.index:
print(df.loc[ix]['myColumn'])

What is the definition of mean in pandas data frame?

I have a data frame and would like to get a mean of the values from one of the columns. If I do:
print df['col_name'][0:1]
print df['col_name'][0:1].mean()
I get:
0 2
Name: col_name
2.0
If I do:
print df['col_name'][0:2]
print df['col_name'][0:2].mean()
I get:
0 2
1 1
Name: col_name
10.5
If I do:
print df['col_name'][0:3]
print df['col_name'][0:3].mean()
I get:
0 2
1 1
2 2
Name: col_name
70.6666666667

It looks like you have a column of str values, not ints:
import pandas as pd
df = pd.DataFrame({'col':['2','1','2']})
for i in range(1,4):
print(df['col'][0:i].mean())
yields
2.0
10.5
70.6666666667
while if the values are ints:
df = pd.DataFrame({'col':[2,1,2]})
for i in range(1,4):
print(df['col'][0:i].mean())
yields
2.0
1.5
1.66666666667
You can convert your column of strs to a column of ints with
df['col'] = df['col'].map(int)
But, of course, the best way to handle this is to make sure the DataFrame is constructed with the right (int) values in the first place.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum of range of rows in a dataframe column - python

Related

Initialize dataframe with two columns which have one of the them all zeros

Groupby to count the number of calls on different days by id

How to split a column in DataFrame with pandas?

Iterating over each element in pandas DataFrame

What is the definition of mean in pandas data frame?

Categories

Resources