Creating intervals in python - python

List item
I am new to programming so don't know much about it.
I have a dataset like this:-
Type Value
A 40
A 70
A 125
A 150
B 50
B 80
B 130
B 150
And I want in this format:
Type <60 >60 >90 >120
A 1 3 2 2
B 1 3 2 2
Basically, count and categorize the values.
def delay_tag(list_name):
empty_list = []
for i in range(0, len(airline)):
if list_name[i] < 60:
empty_list.append('<60')
elif (list_name[i] > 60):
empty_list.append('>60')
elif (list_name[i] >= 120):
empty_list.append('>120')
else:
empty_list.append('>= 180')
return(empty_list)
This is what I Tried

This may give you an idea.
import pandas as pd
df = pd.DataFrame({
'Type':['A','A','A','A','B','B','B','B'],
'Value':[40,70,125,150,50,80,130,150]
})
df_lt60 = df[df['Value']<60]
print df_lt60.groupby('Type').Value.nunique()
df_gt60 = df[df['Value']>=60]
print df_gt60.groupby('Type').Value.nunique()

import pandas as pd
df = pd.read.csv('your_file.csv')
fun = lambda x:{'<60':x.lt(60).sum(),'>60':x.gt(60).sum(),'>90':x.gt(90).sum(),'>120':x.gt(120).sum()}
pd.DataFrame(df.groupby('Type').Value.apply(fun)).reset_index().pivot('Type','level_1','Value')
Out[76]:
level_1 <60 >120 >60 >90
Type
A 1 2 3 2
B 1 2 3 2

Related

How to build a pandas dataframe in a recursive function?

I am trying to implement the 'Bottom-Up Computation' algorithm in data mining (https://www.aaai.org/Papers/FLAIRS/2003/Flairs03-050.pdf).
I need to use the 'pandas' library to create a dataframe and provide it to a recursive function, which should also return a dataframe as output. I am only able to return the final column as output, because I am unable to figure out how to dynamically build a data frame.
Here is the python program:
import pandas as pd
def project_data(df, d):
return df.iloc[:, d]
def select_data(df, d, val):
col_name = df.columns[d]
return df[df[col_name] == val]
def remove_first_dim(df):
return df.iloc[:, 1:]
def slice_data_dim0(df, v):
df_temp = select_data(df, 0, v)
return remove_first_dim(df_temp)
def buc(df):
dims = df.shape[1]
if dims == 1:
input_sum = sum(project_data(df, 0) )
print(input_sum)
else:
dim_vals = set(project_data(df, 0).values)
for dim_val in dim_vals:
sub_data = slice_data_dim0(df, dim_val)
buc(sub_data)
sub_data = remove_first_dim(df)
buc(sub_data)
data = {'A':[1,1,1,1,2],
'B':[1,1,2,3,1],
'M':[10,20,30,40,50]
}
df = pd.DataFrame(data, columns = ['A','B','M'])
buc(df)
I get the following output:
30
30
40
100
50
50
80
30
40
But what I need is a dataframe, like this (not necessarily formatted, but a data frame):
A B M
0 1 1 30
1 1 2 30
2 1 3 40
3 1 ALL 100
4 2 1 50
5 2 ALL 50
6 ALL 1 80
7 ALL 2 30
8 ALL 3 40
9 ALL ALL 150
How do I achieve this?
Unfortunately pandas doesn't have functionality to do subtotals - so the trick is to just calculate them on the side and concatenate together with original dataframe.
from itertools import combinations
import numpy as np
dim = ['A', 'B']
vals = ['M']
df = pd.concat(
[df]
# subtotals:
+ [df.groupby(list(gr), as_index=False)[vals].sum() for r in range(len(dim)-1) for gr in combinations(dim, r+1)]
# total:
+ [df.groupby(np.zeros(len(df)))[vals].sum()]
)\
.sort_values(dim)\
.reset_index(drop=True)\
.fillna("ALL")
Output:
A B M
0 1 1 10
1 1 1 20
2 1 2 30
3 1 3 40
4 1 ALL 100
5 2 1 50
6 2 ALL 50
7 ALL 1 80
8 ALL 2 30
9 ALL 3 40
10 ALL ALL 150

Multiprocessing functions for dataframes

I have an excel sheet which consists of 2 columns. The first keywords and the second is Url.
I am making a script to extract groups which shares the same 3 URLs or more.
I wrote the below code but it takes around an hour to process the main function on a huge excel sheet.
import pandas as pd
import numpy as np
import time
loop = 1
numerator = 0
continuee= []
df_list = []
for index in list(df.sort_values('Url').set_index('Url').index.unique()):
if len(df.sort_values('Url').set_index('Url').loc[index].values) == 1:
list1 = list(df.sort_values('Url').set_index('Url').loc[index].values)
elif len(df.sort_values('Url').set_index('Url').loc[index].keywords.values) > 1:
list1 = list(df.sort_values('Url').set_index('Url').loc[index].keywords.values)
df1 = df[df.keywords.isin(list1)]
df1 = df1[df1.Url.duplicated(keep=False)]
df1 = df1.groupby('Url').filter(lambda x: x.Url.value_counts() == df1.keywords.nunique())
df1 = df1.groupby('keywords').filter(lambda x: x.keywords.value_counts() >= 3)
df1 = df1.groupby('Url').filter(lambda x: x.Url.value_counts() == df1.keywords.nunique())
if df1.keywords.nunique() > 1:
silos = list(df1.keywords.unique())
df_list.append({numerator:silos})
word = word[~(word.isin(silos))]
numerator += 1
else:
singles = list(word[word.keywords.isin(list1)].keywords.unique())
df_list.append({"single" : singles})
word = word[~(word.isin(singles))]
print(loop)
loop += 1
trial = pd.DataFrame(df_list)
if 'single' in list(trial.columns):
for i in list(word.keywords.unique()):
if i not in list(trial.single):
df_list.append({"single" : i})
else:
for i in list(word.keywords.unique()):
df_list.append({"single" : i})
trial = pd.DataFrame(df_list)
I tried many times to use multiprocessing but I failed as I am not really getting how it works with Pandas. Is there a way to help me, please? Also, if I wanted to pass another couple of functions how would I do it? Many thanks in advance.
From what I can gather, this should be your solution;
by_size = df.groupby(df.columns.tolist()).size().reset_index()
three_or_more=by_size[by_size[0]>=3].iloc[:,:-1]
Example:
>>> df
keyword url
0 2 2
1 4 3
2 2 1
3 4 3
4 1 1
5 2 1
6 4 1
7 2 1
8 1 1
9 3 3
>>> by_size = df.groupby(df.columns.tolist()).size().reset_index()
>>> by_size
keyword url 0
0 1 1 2
1 2 1 3
2 2 2 1
3 3 3 1
4 4 1 1
5 4 3 2
>>> three_or_more=by_size[by_size[0]>=3].iloc[:,:-1]
>>> three_or_more
keyword url
1 2 1

Write if else loop in simple way to reduce time complexity

I have the below code
for i in range(index, len(df_2)+1):
if df_2.loc[i, 'Duration'] == 0:
df_2.loc[i, 'Duration'] = df_2.loc[i, "idle_hrs"] + df_2.loc[i - 1, "Duration"]
How can i write this in simple way to reduce time complexity? is there a war to write it in list comprehension style?
You can use shift for the accumulation.
import pandas as pd
index = 2
df = pd.DataFrame({'Duration': [1,0,2,0], 'idle_hrs': [10,20,30,40]})
df
Duration idle_hrs
0 1 10
1 0 20
2 2 30
3 0 40
start_df = df[:index]
df.loc[df['Duration'] == 0, 'Duration'] = df['idle_hrs'] + df['Duration'].shift(1)
df.iloc[:index] = start_df
df
Duration idle_hrs
0 1.0 10
1 0.0 20
2 2.0 30
3 42.0 40

Alternatives to Dataframe.iterrows() or Dataframe.itertuples()?

My understanding of a Pandas dataframe vectorization (through Pandas vectorization itself or through Numpy) is applying a function to an array, similar to .apply() (Please correct me if I'm wrong). Suppose I have the following dataframe:
import pandas as pd
df = pd.DataFrame({'color' : ['red','blue','yellow','orange','green',
'white','black','brown','orange-red','teal',
'beige','mauve','cyan','goldenrod','auburn',
'azure','celadon','lavender','oak','chocolate'],
'group' : [1,1,1,1,1,
1,1,1,1,1,
1,2,2,2,2,
4,4,5,6,7]})
df = df.set_index('color')
df
For this data, I want to apply a special counter for each unique value in A. Here's my current implementation of it:
df['C'] = 0
for value in set(df['group'].values):
filtered_df = df[df['group'] == value]
adj_counter = 0
initialize_counter = -1
spacing_counter = 20
special_counters = [0,1,-1,2,-2,3,-3,4,-4,5,-5,6,-6,7,-7]
for color,rows in filtered_df.iterrows():
if len(filtered_df.index) < 7:
initialize_counter +=1
df.loc[color,'C'] = (46+special_counters[initialize_counter])
else:
spacing_counter +=1
if spacing_counter > 5:
spacing_counter = 0
df.loc[color,'C'] = spacing_counter
df
Is there a faster way to implement this that doesn't involve iterrows or itertuples? Since the counting in the C columns is very irregular, I'm not sure as how I could implement this through apply or even through vectorization
What you can do is first create the column 'C' with groupby on the column 'group' and cumcount that would almost represent spacing_counter or initialize_counter depending on if len(filtered_df.index) < 7 or not.
df['C'] = df.groupby('group').cumcount()
Now you need to select the appropriate rows to do the if or the else part of your code. One way is to create a series using groupby again and transform to know the size of the group related to each row. Then, use loc on you df using this series and do: if the value is smaller than 7, you can map your values with the special_counters else just use modulo % 6
ser_size = df.groupby('group')['C'].transform('size')
df.loc[ser_size < 7,'C'] = df.loc[ser_size < 7,'C'].map(lambda x: 46 + special_counters[x])
df.loc[ser_size >= 7,'C'] %= 6
at the end, you get as expected:
print (df)
group C
color
red 1 0
blue 1 1
yellow 1 2
orange 1 3
green 1 4
white 1 5
black 1 0
brown 1 1
orange-red 1 2
teal 1 3
beige 1 4
mauve 2 46
cyan 2 47
goldenrod 2 45
auburn 2 48
azure 4 46
celadon 4 47
lavender 5 46
oak 6 46
chocolate 7 46

An elegant way to make transformation of something like transpose in pandas faster

I have a pandas.Dataframe called a and the structure is as follows:
while I want to get the DataFrame structure is like:
where the b is like the transpose of a.
By convert a to b, I use the code :
id_uni = a['id'].unique()
b = pd.DataFrame(columns=['id']+[str(i) for i in range(1,4)])
b['id'] = id_uni
for i in id_uni:
for j in range(7):
ind = (a['id'] == i) & (a['w'] == j)
med = a.loc[ind, 't'].values
if med:
b.loc[b['id'] == i, str(j)] = med[0]
else:
b.loc[b['id'] == i, str(j)] = 0
The method is very brutal that I just use two for-loops to get all elements from a to b. And it is very slow. Do you have an efficient way to improve it?
You can use pivot:
print (df.pivot(index='id', columns='w', values='t'))
w 1 2 3
id
0 54 147 12
1 1 0 1
df1 = df.pivot(index='id', columns='w', values='t').reset_index()
df1.columns.name=None
print (df1)
id 1 2 3
0 0 54 147 12
1 1 1 0 1

Categories

Resources