group a list by date while counting the rows values - python

This is the format of my data:
Date hits returning
2014/02/06 10 0
2014/02/06 25 0
2014/02/07 11 0
2014/02/07 31 1
2014/02/07 3 2
2014/02/08 6 0
2014/02/08 4 3
2014/02/08 17 0
2014/02/08 1 0
2014/02/09 6 0
2014/02/09 8 1
The required output is a:
date, sum_hits, sum_returning, sum_total
2014/02/06 35 0 35
2014/02/07 44 3 47
2014/02/08 28 3 31
2014/02/09 14 1 15
The output is for using Google Charts
For getting the unique date, and counting the values per row, I am creating a dictionary and using the date has the key, something like:
# hits = <object with the input data>
data = {}
for h in hits:
day = h.day_hour.strftime('%Y/%m/%d')
if day in data:
t_hits = int(data[day][0] + h.hits)
t_returning = int(data[day][1] + h.returning)
data[day] = [t_hits, t_returning, t_hits + t_returning]
else:
data[day] = [
h.hits,
h.returning,
int(h.hits + h.returning)]
This creates something like:
{
'2014/02/06' = [35 0 35],
'2014/02/07' = [44 3 47],
'2014/02/08' = [28 3 31],
'2014/02/09' = [14 1 15]
}
And for creating the required output I am doing this:
array()
for k, v in data.items():
row = [k]
row.extend(v)
array.append(row)
which creates an array with the required format:
[
[2014/02/06, 35, 0, 35],
[2014/02/07, 44, 3, 47],
[2014/02/08, 28, 3, 31],
[2014/02/09, 14, 1, 15],
]
So my question basically is, if there is a better way of doing this, or some python internal command that could allow me to group by row fields while counting the row values.

If your input is always sorted (or if you can sort it), you can use itertools.groupby to simplify some of this. groupby, as the name suggests, groups the input elements by the key, and gives you an iterable of (group_key, list_of_values_in_group). Something like the following should work:
import itertools
# the keyfunc extracts the key from each input element
keyfunc = lambda row: row.day_hour.strftime("%Y/%m/%d")
data = []
for day, day_rows in itertools.groupby(hits, key=keyfunc):
sum_hits = 0
sum_returning = 0
for row in day_rows:
sum_hits += int(row.hits)
sum_returning += int(row.returning)
data.append([day, sum_hits, sum_returning, sum_hits + sum_returning])
# data now contains your desired output

Related

Is there a more efficient or concise way to divide a df according to a list of indexes?

I'm trying to slice/divide the following dataframe
df = pd.DataFrame(
{'time': [4, 10, 15, 6, 0, 20, 40, 11, 9, 12, 11, 25],
'value': [0, 0, 0, 50, 100, 0, 0, 70, 100, 0,100, 20]}
)
according to a list of indexes to split on :
[5, 7, 9]
The first and last items of the list are the first and last indexes of the dataframe. I'm trying to get the following four dataframes as a result (defined by the three given indexes and the beginning and end of the original df) each assigned to their own variable:
time value
0 4 0
1 10 0
2 15 0
3 6 50
4 0 100
time value
5 20 0
6 40 0
time value
7 11 70
8 9 100
time value
9 12 0
10 11 100
11 25 20
My current solution gives me a list of dataframes that I could then assign to variables manually by list index, but the code is a bit complex, and I'm wondering if there's a simpler/more efficient way to do this.
indexes = [5,7,9]
indexes.insert(0,0)
indexes.append(df.index[-1]+1)
i = 0
df_list = []
while i+1 < len(indexes):
df_list.append(df.iloc[indexes[i]:indexes[i+1]])
i += 1
This is all coming off of my attempt to answer this question. I'm sure there's a better approach to that answer, but I did feel like there should be a simpler way to do this kind of slicing that what I thought of.
you can use np.split like
df_list = np.split(df, indexes)

Dynamically creating nested lists of sequential numbers

I have a dataframe which is a subset of another dataframe and contains the following indexes: 45, 46, 47, 51, 52
Example dataframe:
price count
45 3909.0 8
46 3908.75 8
47 3908.50 8
51 3907.75 8
52 3907.5 8
I want to make 2 lists, each being its own list of the indexes that are sequential. (Example of this data format)
list[0] = [45, 46, 47]
list[1] = [51, 52]
Problem: The following code causes this error on the second to last line:
IndexError: list assignment index out of range
same_width_nodes = df.loc[df['count'] == width]
i = same_width_nodes.index[0]
seq = 0
sequences = [[]]
sequences[seq] = []
for index, row in same_width_nodes.iterrows():
if i == index:
i += 1
sequences[seq].append(index)
else:
seq += 1
sequences[seq] = [index]
i = index
Maybe there's a better way to achieve this, but I'd like to know why I can't create a new item in the sequences list as I am doing here, and how I should be doing it.
You can use this:
s_index=df.index.to_series()
l = s_index.groupby(s_index.diff().ne(1).cumsum()).agg(list).to_numpy()
Output:
l[0]
[45, 46, 47]
and
l[1]
[51, 52]
In steps.
First we do a rolling diff on your index, anything that is greater than 1 we code as True, we then apply a cumsum to create a new group per sequence.
45 0
46 0
47 0
51 1
52 1
Next, we use the groupby method with the new sequences to create your nested list inside a list comprehension
Setup.
df = pd.DataFrame([1,2,3,4,5],columns=['A'],index=[45,46, 47, 51, 52])
A
45 1
46 2
47 3
51 4
52 5
df['grp'] = df.assign(idx=df.index)['idx'].diff().fillna(1).ne(1).cumsum()
idx = [i.index.tolist() for _,i in df.groupby('grp')]
[[45, 46, 47], [51, 52]]
The issue is with this line
sequences[seq] = [index]
You are trying to assign the list an index which is not created. Instead do this.
sequences.append([index])
I use the diff to find when the index value diff changes greater than 1. I iterate the tuples and access by index their values.
index=[45,46,47,51,52]
price=[3909.0,3908.75,3908.50,3907.75,3907.5]
count=[8,8,8,8,8]
df=pd.DataFrame({'index':index,'price':price,'count':count})
df['diff']=df['index'].diff().fillna(0)
print(df)
result_list=[[]]
seq=0
for row in df.itertuples():
index=row[1]
diff=row[4]
if diff<=1:
result_list[seq].append(index)
else:
seq+=1
result_list.insert(1,[index])
print(result_list)
output:
[[45, 46, 47], [51, 52]]

pandas replace all values of a column with a column values that increment by n starting at 0

Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90

Finding duplicates in a column, setting conditions, summing values from another column

I have a csv file and I'm currently using pandas module. Have not found the solution for my problem. Here is the sample, problem, and desired output csv.
Sample csv:
project, id, sec, code
1, 25, 50, 01
1, 25, 50, 12
1, 25, 45, 07
1, 5, 25, 03
1, 25, 20, 06
Problem:
I do not want to get rid of duplicated (id) but sum the values of (sec) to (code) 01 if duplicates are found given other codes such as 12, 7, and 6. I need to know how to set conditions as well. If code 7 is less than 60 do not sum. I have used the following code to sort by columns. the .isin however gets rid of "id" 5. In a larger file there will be other duplicate "id"s with similar codes.
df = df.sort_values(by=['id'], ascending=[True])
df2 = df.copy()
sort1 = df2[df2['code'].isin(['01', '07', '06', '12'])]
Desired Output:
project, id, sec, code
1, 5, 25, 03
1, 25, 120, 01
1, 25, 50, 12
1, 25, 45, 07
1, 25, 20, 06
I have thought of parsing through the file but I'm stuck on the logic.
def edit_data(df):
sum = 0
with open(df) as file:
next(file)
for line in file:
parts = line.split(',')
code = float(parts[3])
id = float(parts[1])
sec = float(parts[2])
return ?
Appreciate any help as I'm new in Python equivalent to 3 months experience. Thanks!
Let's try this:
df = df.sort_values('id')
#Use boolean indexing to eliminate unwanted records, then groupby and sum, convert the results to dataframe with indexes of groups.
sumdf = df[~((df.code == 7) & (df.sec < 60))].groupby(['project','id'])['sec'].sum().to_frame()
#Find first record of the group using duplicated and again with boolean indexing set the sec column for those records to NaN.
df.loc[~df.duplicated(subset=['project','id']),'sec'] = np.nan
#Set the index of the original dataframe and use combined_first to replace those NaN with values from the summed, grouped dataframe.
df_out = df.set_index(['project','id']).combine_first(sumdf).reset_index().astype(int)
df_out
Output:
project id code sec
0 1 5 3 25
1 1 25 1 120
2 1 25 12 50
3 1 25 7 45
4 1 25 6 20

split dataframe values into a specified number of groups and apply function - pandas

df=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
I'd like to split df into a specified number of groups and sum all elements in each group. For example, dividing df into 4 groups
1,4,1,3 2,8,3,6 3,7,3,1 2,9
would result in
9
19
14
11
I could do df.groupby(np.arange(len(df))//4).sum(), but this won't work for larger dataframes
For example
df1=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9,1,5,3,4])
df1.groupby(np.arange(len(df1))//4).sum()
creates 5 groups instead of 4
You can use numpy.array_split:
df=pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9,1,5,3,4])
a = pd.Series([x.values.sum() for x in np.array_split(df, 4)])
print (a)
0 11
1 27
2 15
3 13
dtype: int64
Solution with concat and sum:
a = pd.concat(np.array_split(df, 4), keys=np.arange(4)).sum(level=0)
print (a)
0
0 11
1 27
2 15
3 13
Say you have this data frame:
df = pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
You can achive it using list comprehension and loc:
group_size = 4
[df.loc[i:i+group_size-1].values.sum() for i in range(0, len(df), group_size)]
Output:
[9, 19, 14, 11]
I looked in the comments, and i thought that you can use some explicit python code when the "usual" pandas functions can't fulfill your needs.
So:
import pandas as pd
def get_sum(a, chunks):
for k in range(0, len(df), chunks):
yield a[k:k+chunks].values.sum()
df = pd.DataFrame([1,4,1,3,2,8,3,6,3,7,3,1,2,9])
group_size = list(get_sum(df, 4))
print(group_size)
Output:
[9, 19, 14, 11]

Categories

Resources