Finding duplicates in a column, setting conditions, summing values from another column - python

I have a csv file and I'm currently using pandas module. Have not found the solution for my problem. Here is the sample, problem, and desired output csv.
Sample csv:
project, id, sec, code
1, 25, 50, 01
1, 25, 50, 12
1, 25, 45, 07
1, 5, 25, 03
1, 25, 20, 06
Problem:
I do not want to get rid of duplicated (id) but sum the values of (sec) to (code) 01 if duplicates are found given other codes such as 12, 7, and 6. I need to know how to set conditions as well. If code 7 is less than 60 do not sum. I have used the following code to sort by columns. the .isin however gets rid of "id" 5. In a larger file there will be other duplicate "id"s with similar codes.
df = df.sort_values(by=['id'], ascending=[True])
df2 = df.copy()
sort1 = df2[df2['code'].isin(['01', '07', '06', '12'])]
Desired Output:
project, id, sec, code
1, 5, 25, 03
1, 25, 120, 01
1, 25, 50, 12
1, 25, 45, 07
1, 25, 20, 06
I have thought of parsing through the file but I'm stuck on the logic.
def edit_data(df):
sum = 0
with open(df) as file:
next(file)
for line in file:
parts = line.split(',')
code = float(parts[3])
id = float(parts[1])
sec = float(parts[2])
return ?
Appreciate any help as I'm new in Python equivalent to 3 months experience. Thanks!

Let's try this:
df = df.sort_values('id')
#Use boolean indexing to eliminate unwanted records, then groupby and sum, convert the results to dataframe with indexes of groups.
sumdf = df[~((df.code == 7) & (df.sec < 60))].groupby(['project','id'])['sec'].sum().to_frame()
#Find first record of the group using duplicated and again with boolean indexing set the sec column for those records to NaN.
df.loc[~df.duplicated(subset=['project','id']),'sec'] = np.nan
#Set the index of the original dataframe and use combined_first to replace those NaN with values from the summed, grouped dataframe.
df_out = df.set_index(['project','id']).combine_first(sumdf).reset_index().astype(int)
df_out
Output:
project id code sec
0 1 5 3 25
1 1 25 1 120
2 1 25 12 50
3 1 25 7 45
4 1 25 6 20

Related

Pandas Dataframe - How to get a multiline cell separated by carriage return into multiple rows?

thank you for taking time to look into this. I'm a beginner programmer and struck at this.
#the dataframe is as follows for reference
data = [['\r8', 'tom', 10, '55\r \r \r62\r75'], ['18\r\r9', 'nick', 15, '77\r25\r85'], ['17\r19\r18', 'juli', 14, '55\r75\r85']]
df = pd.DataFrame(data, columns=['Roll No per Class', 'Name', 'Age', 'Highest Scores'])
This is a sample dataframe, the original one spans over more than 15,000 rows and 10 columns.
I want the /r cells to be placed into a new row with the other columns repeating.enter image description here
I have tried the code mentioned below
import numpy as np from itertools import chain
# return list from series of comma-separated strings def chainer(s):
return list(chain.from_iterable(s.str.split('\r')))
# calculate lengths of splits lens = df['Highest Scores'].str.split(',').map(len)
# create new dataframe, repeating or chaining as appropriate res = pd.DataFrame({'Name': np.repeat(df['Name'], lens),
'Age': np.repeat(df['Age'], lens),
'Roll No per Class': chainer(df['Roll No per Class']),
'Highest Scores': chainer(df['Highest Scores'])})
I'm getting the error:
ValueError: All arrays must be of the same length
I have also tried the code -
df.set_index(['Name', 'Age']).apply(lambda x: x.str.split('\r').explode()).reset_index()
It also gives an error :
ValueError: cannot handle a non-unique multi-index!
I'm guessing this is because the length of Roll number column doesn't match the length of Highest Scores column.
Can someone please help look into this. This is my first post so do let me know if there is anything missing and needs to be added.
You can split the cells at \r first,
>>> cols = ['Roll No per Class', 'Highest Scores']
>>> df[cols] = df[cols].apply(lambda col: col.str.split("\r"))
>>> df
Roll No per Class Name Age Highest Scores
0 [, , 8] tom 10 [55, 62, 75]
1 [18, , 9] nick 15 [77, 25, 85]
2 [17, 19, 18] juli 14 [55, 75, 85]
and explode them after:
>>> df.explode(cols)
Roll No per Class Name Age Highest Scores
0 tom 10 55
0 tom 10 62
0 8 tom 10 75
1 18 nick 15 77
1 nick 15 25
1 9 nick 15 85
2 17 juli 14 55
2 19 juli 14 75
2 18 juli 14 85

Is there a more efficient or concise way to divide a df according to a list of indexes?

I'm trying to slice/divide the following dataframe
df = pd.DataFrame(
{'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
)
according to a list of indexes to split on :
[5, 7, 9]
The first and last items of the list are the first and last indexes of the dataframe. I'm trying to get the following four dataframes as a result (defined by the three given indexes and the beginning and end of the original df) each assigned to their own variable:
time value
0 4 0
1 10 0
2 15 0
3 6 50
4 0 100
time value
5 20 0
6 40 0
time value
7 11 70
8 9 100
time value
9 12 0
10 11 100
11 25 20
My current solution gives me a list of dataframes that I could then assign to variables manually by list index, but the code is a bit complex, and I'm wondering if there's a simpler/more efficient way to do this.
indexes = [5,7,9]
indexes.insert(0,0)
indexes.append(df.index[-1]+1)
i = 0
df_list = []
while i+1 < len(indexes):
df_list.append(df.iloc[indexes[i]:indexes[i+1]])
i += 1
This is all coming off of my attempt to answer this question. I'm sure there's a better approach to that answer, but I did feel like there should be a simpler way to do this kind of slicing that what I thought of.
you can use np.split like
df_list = np.split(df, indexes)

assign column values to variable out of data frame

I read an excel sheet into a data frame. Is there a possibility to loop over the columns and assign a list with the values of each column to a variable that has the column name as a variable name?
So as a simple example I have the data frame
val = [[1,396,29],[2,397,29],[3,395,29],[4,393,29],[5,390,29],[6,398,29]]
df=pd.DataFrame(val,columns=['Hours','T1T_in','T1p_in'])
df
Hours T1T_in T1p_in
0 1 396 29
1 2 397 29
2 3 395 29
3 4 393 29
4 5 390 29
5 6 398 29
so I would like to have a loop creating lists with column Name as variable?
Hours = [1,2,3,4,5,6]
T1T_in = [396,397,395,393,390,398]
T1p_in = [29,29,29,29,29,29,29]
I find a solution for getting the names out but can not assign the values. Thank you for your help
the simplest way to get column elements as list is to use pandas.Series.tolist() :
Hours = df.Hours.to_list()
T1T_in = df.T1p_in.to_list()
T1p_in = df.T1p_in.to_list()
you can also use a for loop(as you mantion that you want) to get the columns and the rows if you want:
data = {}
for column_name, rows in df.iteritems():
data[column_name] = rows.to_list()
print(data)
output:
{'Hours': [1, 2, 3, 4, 5, 6],
'T1T_in': [396, 397, 395, 393, 390, 398],
'T1p_in': [29, 29, 29, 29, 29, 29]}
the above result can also be achieved with :
df.to_dict('list')
as #anky_91 was saying
It works so far. But I have one question. I still do not understand how I can access to the values for each variable now. In the following I want to work with these lists and calculate some items. A simplified example you can see below.
Thank you once again.
new=[]
for i in range(len(hours)):
new.append(T1T_in[i]+T1p_in[I])
Thank you once again for your help

get partial sum of values in df column once they reach a certain threshold

I need to start adding values in one of the columns in my df and return a row where the sum reaches a certain threshold. What is the easiest way to do it?
e.g.
threshold = 86
values ID
1 42 xxxxx
2 34 yyyyy
3 29 vvvvv
4 28 eeeee
should return line 3
import pandas as pd
df = pd.DataFrame(dict(values=[42, 34, 29, 28], ID=['x', 'y', 'z', 'e']))
threshold = 86
idx = df['values'].cumsum().searchsorted(threshold)
print(df.iloc[idx])
Try it here
Output:
values 29
ID z
Name: 2, dtype: object
Note that df.values has a special pandas meaning so df['values'] is different and necessary.
This should work
df['new_values'] = df['values'].cumsum()
rows = df[df['new_values']==threshold].index.to_list()
Another way
df['values'].cumsum().ge(threshold).idxmax()
Out[131]: 3
df.loc[df['values'].cumsum().ge(threshold).idxmax()]
Out[133]:
values 29
ID vvvvv
Name: 3, dtype: object

group a list by date while counting the rows values

This is the format of my data:
Date hits returning
2014/02/06 10 0
2014/02/06 25 0
2014/02/07 11 0
2014/02/07 31 1
2014/02/07 3 2
2014/02/08 6 0
2014/02/08 4 3
2014/02/08 17 0
2014/02/08 1 0
2014/02/09 6 0
2014/02/09 8 1
The required output is a:
date, sum_hits, sum_returning, sum_total
2014/02/06 35 0 35
2014/02/07 44 3 47
2014/02/08 28 3 31
2014/02/09 14 1 15
The output is for using Google Charts
For getting the unique date, and counting the values per row, I am creating a dictionary and using the date has the key, something like:
# hits = <object with the input data>
data = {}
for h in hits:
day = h.day_hour.strftime('%Y/%m/%d')
if day in data:
t_hits = int(data[day][0] + h.hits)
t_returning = int(data[day][1] + h.returning)
data[day] = [t_hits, t_returning, t_hits + t_returning]
else:
data[day] = [
h.hits,
h.returning,
int(h.hits + h.returning)]
This creates something like:
{
'2014/02/06' = [35 0 35],
'2014/02/07' = [44 3 47],
'2014/02/08' = [28 3 31],
'2014/02/09' = [14 1 15]
}
And for creating the required output I am doing this:
array()
for k, v in data.items():
row = [k]
row.extend(v)
array.append(row)
which creates an array with the required format:
[
[2014/02/06, 35, 0, 35],
[2014/02/07, 44, 3, 47],
[2014/02/08, 28, 3, 31],
[2014/02/09, 14, 1, 15],
]
So my question basically is, if there is a better way of doing this, or some python internal command that could allow me to group by row fields while counting the row values.
If your input is always sorted (or if you can sort it), you can use itertools.groupby to simplify some of this. groupby, as the name suggests, groups the input elements by the key, and gives you an iterable of (group_key, list_of_values_in_group). Something like the following should work:
import itertools
# the keyfunc extracts the key from each input element
keyfunc = lambda row: row.day_hour.strftime("%Y/%m/%d")
data = []
for day, day_rows in itertools.groupby(hits, key=keyfunc):
sum_hits = 0
sum_returning = 0
for row in day_rows:
sum_hits += int(row.hits)
sum_returning += int(row.returning)
data.append([day, sum_hits, sum_returning, sum_hits + sum_returning])
# data now contains your desired output

Categories

Resources