I am currently using Pandas and Python to handle much of the repetitive tasks, I need done for my master thesis. At this point, I have written some code (with help from stack overflow) that, based on some event dates in one file, finds a start and end date to use as a date range in another file. These dates are then located and appended to an empty list, which I can then output to excel. However, using the below code I get a dataframe with 5 columns and 400.000 + rows (which is basically what I want), but not how I want the data outputted to excel. Below is my code:
end_date = pd.DataFrame(data=(df_sample['Date']-pd.DateOffset(days=2)))
start_date = pd.DataFrame(data=(df_sample['Date']-pd.offsets.BDay(n=252)))
merged_dates = pd.merge(end_date,start_date,left_index=True,right_index=True)
ff_factors = []
for index, row in merged_dates.iterrows():
time_range= (df['Date'] > row['Date_y']) & (df['Date'] <= row['Date_x'])
df_factor = df.loc[time_range]
ff_factors.append(df_factor)
appended_data = pd.concat(ff_factors, axis=0)
I need the data to be 5 columns and 250 rows (columns are variable identifiers) side by side, so that when outputting it to excel I have, for example column A-D and then 250 rows for each column. This then needs to be repeated for column E-H and so on. Using iloc, I can locate the 250 observations using appended_data.iloc[0:250], with both 5 columns and 250 rows, and then output it to excel.
Are the any way for me to automate the process, so that after selecting the first 250 and outputting it to excel, it selects the next 250 and outputs it next to the first 250 and so on?
I hope the above is precise and clear, else I'm happy to elaborate!
EDIT:
The above picture illustrate what I get when outputting to excel; 5 columns and 407.764 rows. What I needed is to get this split up into the following way:
The second picture illustrates how I needed the total sample to be split up. The first five columns and corresponding 250 rows needs to be as the second picture. When I do the next split using iloc[250:500], I will get the next 250 rows, which needs to be added after the initial five columns and so on.
You can do this with a combination of np.reshape, which can be made to behave as desired on individual columns, and which should be much faster than a loop through the rows, and pd.concat, to join the dataframes it makes back together:
def reshape_appended(df, target_rows, pad=4):
df = df.copy() # don't modify in-place
# below line adds strings, '0000',...,'0004' to the column names
# this ensures sorting the columns preserves the order
df.columns = [str(i).zfill(pad)+df.columns[i] for i in range(len(df.columns))]
#target number of new columns per column in df
target_cols = len(df.index)//target_rows
last_group = pd.DataFrame()
# below conditional fires if there will be leftover rows - % is mod
if len(df.index)%target_rows != 0:
last_group = df.iloc[-(len(df.index)%target_rows):].reset_index(drop=True)
df = df.iloc[:-(len(df.index)%target_rows)] # keep rows that divide nicely
#this is a large list comprehension, that I'll elaborate on below
groups = [pd.DataFrame(df[col].values.reshape((target_rows, target_cols),
order='F'),
columns=[str(i).zfill(pad)+col for i in range(target_cols)])
for col in df.columns]
if not last_group.empty: # if there are leftover rows, add them back
last_group.columns = [pad*'9'+col for col in last_group.columns]
groups.append(last_group)
out = pd.concat(groups, axis=1).sort_index(axis=1)
out.columns = out.columns.str[2*pad:] # remove the extra characters in the column names
return out
last_group takes care of any rows that don't divide evenly into sets of 250. The playing around with column names enforces proper sorting order.
df[col].values.reshape((target_rows, target_cols), order='F')
Reshapes the values in the column col of df into the shape specified by the tuple (target_rows, target_cols), with the ordering Fortran uses, indicated by F.
columns=[str(i).zfill(pad)+col for i in range(target_cols)]
is just giving names to these columns, with any eye to establishing proper ordering afterward.
Ex:
df = pd.DataFrame(np.random.randint(0, 10, (23, 3)), columns=list('abc'))
reshape_appended(df, 5)
Out[160]:
a b c a b c a b c a b c a b c
0 8 3 0 4 1 9 5 4 7 2 3 4 5.0 7.0 2.0
1 1 6 1 3 5 1 1 6 0 5 9 4 6.0 0.0 1.0
2 3 1 3 4 3 8 9 3 9 8 7 8 7.0 3.0 2.0
3 4 0 1 5 5 6 6 4 4 0 0 3 NaN NaN NaN
4 9 7 3 5 7 4 6 5 8 9 5 5 NaN NaN NaN
df
Out[161]:
a b c
0 8 3 0
1 1 6 1
2 3 1 3
3 4 0 1
4 9 7 3
5 4 1 9
6 3 5 1
7 4 3 8
8 5 5 6
9 5 7 4
10 5 4 7
11 1 6 0
12 9 3 9
13 6 4 4
14 6 5 8
15 2 3 4
16 5 9 4
17 8 7 8
18 0 0 3
19 9 5 5
20 5 7 2
21 6 0 1
22 7 3 2
My best guess to solving the problem would be to try and loop, until the counter is greater than length, so
i = 250 # counter
j = 0 # left limit
for x in range(len("your dataframe")):
appended_data.iloc[j:i]
i+=250
if i > len("your df"):
appended_data.iloc[j:(len("your df"))
break
else:
j = i
Related
I have a data frame with multiple columns (30/40) in a time series continuously from 1 to 1440 minutes.
df
time colA colB colC.....
1 5 4 3
2 1 2 3
3 5 4 3
4 6 7 3
5 9 0 3
6 4 4 0
..
Now I want to add two row values into one but I want to keep the interval of index 'time' same as the row number I am adding. The resulted data frame is:
df
time colA colB colC.......
1 6 6 6
3 11 11 6
5 13 4 3
..
Here I added two row values into one but the time index interval is also same as 2 rows. 1,3,5...
Is it possible to achieve that?
Another way would be to group your data set every two rows and aggregate with using sum on your 'colX' columns and mean on your time column. Chaining astype(int) will round the resulting values:
d = {col: 'sum' for col in [c for c in df.columns if c.startswith('col')]}
df.groupby(df.index // 2).agg({**d,'time': 'mean'}).astype(int)
prints back:
colA colB colC time
0 6 6 6 1
1 11 11 6 3
2 13 4 3 5
one way is to do the addition for all and then fix time:
df_new = df[1::2].reset_index(drop=True) + df[::2].reset_index(drop=True)
df_new['time'] = df[::2]['time'].values
I'm working on a project where my original dataframe is:
A B C label
0 1 2 2 Nan
1 2 4 5 7
2 3 6 5 Nan
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
But, I have an array with new labels for certain points (for that I only used columns A and B) in the original dataframe. Something like this:
X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]
My goal is to add the new labels to the original dataframe. I know that the combination of A and B unique is. What is the fastest way to assign the new label to the correct row?
This is my try:
y_labeled = np.array(y).astype('float64')
current_position = 0
for point in X_labeled:
row = df.loc[(df['A'] == point[0]) & (df['B'] == point[1])]
df.at[row.index, 'label'] = y_labeled[current_position]
current_position += 1
Wanted output (rows with index 1 and 2 are changed):
A B C label
0 1 2 2 Nan
1 2 4 5 5
2 3 6 5 9
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
For small datasets may this be okay with I'm currently using it for datasets with more than 25000 labels. Is there a way that is faster?
Also, in some cases I used all columns expect the column 'label'. That dataframe exists out of 64 columns so my method can not be used here. Has someone an idea to improve this?
Thanks in advance
Best solution is to make your arrays into a dataframe and use df.update():
new = pd.DataFrame(X_labeled, columns=['A', 'B'])
new['label'] = y_labeled
new = new.set_index(['A', 'B'])
df = df.set_index(['A', 'B'])
df.update(new)
df = df.reset_index()
Here's a numpy based approach aimed at performance. To vectorize this we want a way to check membership of the rows in X_labeled in columns A and B. So what we can do, is view these two columns as 1D arrays (based on this answer) and then we can use np.in1d to index the dataframe and assign the values in y_labeled:
import numpy as np
X_labeled = [[2, 4], [3,6]]
y_labeled = [5,9]
a = df.values[:,:2].astype(int) #indexing on A and B
def view_as_1d(a):
a = np.ascontiguousarray(a)
return a.view(np.dtype((np.void, a.dtype.itemsize * a.shape[-1])))
ix = np.in1d(view_as_1d(a), view_as_1d(X_labeled))
df.loc[ix, 'label'] = y_labeled
print(df)
A B C label
0 1 2 2 Nan
1 2 4 5 5
2 3 6 5 9
3 4 8 7 Nan
4 5 10 3 8
5 6 12 4 8
I need help with optimising code, because mine solution is very slow.
I have 2 dataframes. One with 6 columns - 1 column with main items and 5 columns with items recommended. Second df contains sales data per order (each product in separate row).
I need to check what product is flaged as a "main" product, which products are recommended and which are just additional ones. If there is more than 1 main product in order, I need to duplicate that order and set only 1 main product per duplicate.
I tried using pandas for that and found working solution however I used itertuples, spliting both dfs by main items etc. It gaves my right result, but 1 order is computing for almost 2 secs and I have more than 1mln of them.
promo = pd.DataFrame({'main_id':[2,4,6],
'recommended_1':[1,2,8],
'recommended_2':[8,6,9],
'recommended_3':[10,9,10],
'recommended_4': [12,11,11],
'recommended_5': [6,7,8]})
orders = pd.DataFrame({
'order':['a','a','a','b','b','b','c','c'],
'product':[1,2,3,2,4,9,6,9]
})
promo['recommended_list'] = promo[
['recommended_1','recommended_2',
'recommended_3','recommended_4',
'recommended_5']].values.tolist()
flag = pd.DataFrame(
{'flag':orders['product'].isin(promo.main_id)}
)
flaged_orders = pd.concat([orders,flag], axis=1)
main_in_orders = pd.DataFrame(
flaged_orders.query("flag").groupby(['order'])['product']
.agg(lambda x: x.tolist())
)
order_holder = pd.DataFrame()
for index, row in main_in_orders.itertuples():
for item in row:
working_order = orders.query("order==#index")
working_order.loc[working_order['product']==item,'kategoria']='M'
recommended_products = promo.loc[promo['main_id']==item]['recommended_list'].iloc[0]
working_order.loc[working_order['product'].isin(recommended_products), 'kategoria'] = 'R'
working_order['main_id'] = item
order_holder = pd.concat([order_holder, working_order])
# NaN values in this case would be "additional items"
print(order_holder)
So, can u help me with faster alternative? Pointing me in some direction would be awesome, because I've stuck at this for some time. Pandas is optional.
You can do two merge to be able to have all the rows you want, then use np.select to create the column 'kategoria'. The first merge would be to keep only the rows with the 'main_id' in the 'product' column with a inner method, then the second merge would be to create the duplicates if multiple 'main_id' for a same 'order' with a left method.
df_mainid = orders.merge(promo, left_on='product', right_on='main_id', how='inner')
print (df_mainid)
# order product main_id recommended_1 recommended_2 recommended_3 \
# 0 a 2 2 1 8 10
# 1 b 2 2 1 8 10
# 2 b 4 4 2 6 9
# 3 c 6 6 8 9 10
#
# recommended_4 recommended_5
# 0 12 6
# 1 12 6
# 2 11 7
# 3 11 8
So you get only rows with a 'main_id' in the 'product', then
df_merged = orders.merge(df_mainid.drop('product', axis=1), on=['order'], how='left')\
.sort_values(['order', 'main_id'])
print (df_merged)
# order product main_id recommended_1 recommended_2 recommended_3 \
# 0 a 1 2 1 8 10
# 1 a 2 2 1 8 10
# 2 a 3 2 1 8 10
# 3 b 2 2 1 8 10
# 5 b 4 2 1 8 10
# 7 b 9 2 1 8 10
# 4 b 2 4 2 6 9
# 6 b 4 4 2 6 9
# 8 b 9 4 2 6 9
# 9 c 6 6 8 9 10
# 10 c 9 6 8 9 10
# recommended_4 recommended_5
# 0 12 6
# 1 12 6
# 2 12 6
# 3 12 6
# 5 12 6
# 7 12 6
# 4 11 7
# 6 11 7
# 8 11 7
# 9 11 8
# 10 11 8
you get the duplicated 'order' if several 'main_id'. Finally, create the column 'kategoria' with np.select. The first condition is if 'product' is equal to 'main_id' then 'M', the second condition is if the 'product' is in the different column starting with 'recommended' is 'R'. At the end, drop the columns like recommended to get the same output than your order_holder.
conds = [ df_merged['product'].eq(df_merged.main_id) ,
(df_merged['product'][:,None] == (df_merged.filter(like='recommended'))).any(1) ]
choices = ['M', 'R']
df_merged['kategoria'] = np.select( conds , choices , np.nan)
df_merged = df_merged.drop(df_merged.filter(like='recommended').columns, axis=1)
print (df_merged)
order product main_id kategoria
0 a 1 2 R
1 a 2 2 M
2 a 3 2 nan
3 b 2 2 M
5 b 4 2 nan
7 b 9 2 nan
4 b 2 4 R
6 b 4 4 M
8 b 9 4 R
9 c 6 6 M
10 c 9 6 R
I have a pandas dataframe that looks like this. The rows and the columns have the same name.
name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10
I can get the 5 number of largest values by passing df['column_name'].nlargest(n=5) but if I have to return 50 % of the largest in descending order, is there anything that is inbuilt in pandas of it I have to write a function for it, how can I get them? I am quite new to python. Please help me out.
UPDATE : So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total in this case is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly I want to go through all the other columns and select 50%.
Sorting is flexible :)
df.sort_values('column_name',ascending=False).head(int(df.shape[0]*.5))
Update: frac argument is available only on .sample(), not in .head or .tail. df.sample(frac=.5) does give 50% but head and tail expects only int. df.head(frac=.5) fails with TypeError: head() got an unexpected keyword argument 'frac'
Note: on int() vs round()
int(3.X) == 3 # True Where 0 >= X >=9
round(3.45) == 3 # True
round(3.5) == 4 # True
So when doing .head(int/round ...) do think of what behaviour fits your need.
Updated: Requirements
So let's take column a into consideration and it has values like 10,
5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of
them. so the total, in this case, is 36. 50% of these values would be
18. So from column a, I want to select 10 and 8 only. Similarly, I want to go through all the other columns and select 50%. -Matt
A silly hack would be to sort, find the cumulative sum, find the middle by dividing it with the sum total and then use that to select part of your sorted column. e.g.
import pandas as pd
data = pd.read_csv(
pd.compat.StringIO("""name a b c d e f g
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10"""),
sep=' ', index_col='name'
).dropna(axis=1).apply(
pd.to_numeric, errors='coerce', downcast='signed')
x = data[['a']].sort_values(by='a',ascending=False)[(data[['a']].sort_values(by='a',ascending=False).cumsum()
/data[['a']].sort_values(by='a',ascending=False).sum())<=.5].dropna()
print(x)
Outcome:
You could sort the data frame and display only 90% of the data
df.sort_values('column_name',ascending=False).head(round(0.9*len(df)))
data.csv
name,a,b,c,d,e,f,g
a,10,5,4,8,5,6,4
b,5,10,6,5,4,3,3
c,-,4,9,3,6,5,7
d,6,9,8,6,6,8,2
e,8,5,4,4,14,9,6
f,3,3,-,4,5,14,7
g,4,5,8,9,6,7,10
test.py
#!/bin/python
import pandas as pd
def percentageOfList(l, p):
return l[0:int(len(l) * p)]
df = pd.read_csv('data.csv')
print(percentageOfList(df.sort_values('b', ascending=False)['b'], 0.9))
I have a pandas DataFrame with columns "Time" and "A". For each row, df["Time"] is an integer timestamp and df["A"] is a float. I want to create a new column "B" which has the value of df["A"], but the one that occurs at or immediately before five seconds in the future. I can do this iteratively as:
for i in df.index:
df["B"][i] = df["A"][max(df[df["Time"] <= df["Time"][i]+5].index)]
However, the df has tens of thousands of records so this takes far too long, and I need to run this a few hundred times so my solution isn't really an option. I am somewhat new to pandas (and only somewhat less new to programming in general) so I'm not sure if there's an obvious solution to this supported by pandas.
It would help if I had a way of referencing the specific value of df["Time"] in each row while creating the column, so I could do something like:
df["B"] = df["A"][max(df[df["Time"] <= df["Time"][corresponding_row]+5].index)]
Thanks.
Edit: Here's an example of what my goal is. If the dataframe is as follows:
Time A
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
then I would like the result to be:
Time A B
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 9
20 9 9
where each line in B comes from the value of A in the row with Time greater by at most 5. So if Time is the index as well, then df["B"][0] = df["A"][4] since 4 is the largest time which is at most 5 greater than 0. In code, 4 = max(df["Time"][df["Time"] <= 0+5], which is why df["B"][0] is df["A"][4].
Use tshift. You may need to resample first to fill in any missing values. I don't have time to test this, but try this.
df['B'] = df.resample('s', how='ffill').tshift(5, freq='s').reindex_like(df)
And a tip for getting help here: if you provide a few rows of sample data and an example of your desired result, it's easy for us to copy/paste and try out a solution for you.
Edit
OK, looking at your example data, let's leave your Time column as integers.
In [59]: df
Out[59]:
A
Time
0 0
1 1
4 2
7 3
8 4
10 5
12 6
15 7
18 8
20 9
Make an array containing the first and last Time values and all the integers in between.
In [60]: index = np.arange(df.index.values.min(), df.index.values.max() + 1)
Make a new DataFrame with all the gaps filled in.
In [61]: df1 = df.reindex(index, method='ffill')
Make a new column with the same data shifted up by 5 -- that is, looking forward in time by 5 seconds.
In [62]: df1['B'] = df1.shift(-5)
And now drop all the filled-in times we added, taking only values from the original Time index.
In [63]: df1.reindex(df.index)
Out[63]:
A B
Time
0 0 2
1 1 2
4 2 4
7 3 6
8 4 6
10 5 7
12 6 7
15 7 9
18 8 NaN
20 9 NaN
How you fill in the last values, for which there is no "five seconds later" is up to you. Judging from your desired output, maybe use fillna with a constant value set to the last value in column A.